Ad verba per numeros

HOT!, Investigación
Friday, April 23, 2010, 08:46 AM
That's the title of one paper I've recently submitted to a journal. I didn't intend to talk about it because I cannot publish it as a preprint; however, some recent independent events have touched me a nerve and I'd like to "put my two cents" on the topic.

The first "event": a tweet by @zephoria linking a post about Big Data and Social sciences. She hits the nail in the head with that post, I specially liked this part (mainly because of my paper):

[...] Big Data presents new opportunities for understanding social practice. Of course the next statement must begin with a “but.” And that “but” is simple: Just because you see traces of data doesn’t mean you always know the intention or cultural logic behind them. And just because you have a big N doesn’t mean that it’s representative or generalizable.

Amen to that!

Second "event": a tweet by @munmun10 about the bias towards publishing positive results. She links an interesting article by Ars Technica which describes a study on the infamous "file-drawer effect". Such a fancy name refers to researchers tendency to just report possitive results while not discussing negative results --which, of course, can be equally important.

OK, enough, I'll talk, I'll talk.

Why do these two unrelated tweets push me to urgently describe my own paper? Mainly because it deals with negative results, lessons learned from strong assumptions about exploiting Big Data, and it gives some warnings about different pitfalls one can find when doing Social Media research.

First of all, the abstract:

A warning against converting Twitter into the next Literary Digest. Daniel Gayo-Avello (2010). User generated content has experimented a vertiginous growth both in the diversity of applications and the volume of topics covered by the users. Content published in micro-blogging systems such as Twitter is thought to be feasibly data-mine in order to "take the pulse" to society. At this moment, plentiful of positive experiences have been published, praising the goodness of relatively simple approaches to sampling, opinion mining, and sentiment analysis. In this paper I'd like to play devil's advocate by describing a careful study in which such simple approaches largely overestimate Obama's victory in U.S. 2008 Presidential Elections. A thorough post-mortem of that study is conducted and several important lessons are extracted.

The study described in the paper had been in my drawer since mid-2009 because I thought it was unpublishable because of the outcome of the research: my data predicted a Obama victory (good), but the margin was too big. And when I mean too big I mean that Obama won Texas according to Twitter data (bad).

All of this remind me of the (infamous) Literary Digest poll that was a total failure predicting the outcome of U.S. 1936 Presidential Elections. Thus, I simply assumed (in 2009) that using Twitter to predict elections in 2008 was like polling owners of cars in 1936 to predict who would be the next POTUS. Without further ado I simply moved on.

Then, this year three different papers appeared in a short time span:

All three papers worth a careful reading:
  • The one by O'Connor et al. links Twitter sentiment to public opinion (e.g. consumer confidence and presidential job approval); interestingly they did not find any strong correlation between Twitter sentiment and surveys conducted during the 2008 presidential campaign.
  • The study by Tumasjan et al., on the contrary, asserts that The mere number of tweets reflects voter preferences and comes close to traditional election polls. In fact, they were able to predict the outcome of last German elections with Twitter data.
  • Lastly, the paper by Asur and Huberman describes the correlation between volume of conversation about a movie in Twitter and the earnings in the opening weekend. In fact, in a recent interview predicting elections is described as a possible field of application of the same methods.
So, here I was, I had a complete report on how to predict a landslide victory that never happened that (1) was consistent with an independent study (the one by O'Connor et al.), and (2) seemed to reach the opposite conclusion of a third study (the one by Tumasjan et al.)

Of course, the problem here is overgeneralization; I mean, my study do not prove that elections cannot be predicted by mining Twitter, it proves, however, that I wasn't able to predict U.S. 2008 Elections with my data and my sentiment analysis methods (by the way, I tested four different ones). Neither the study by Tumasjan et al. proves that elections can be predicted; the study demonstrates that it was possible to predict one particular election in one particular country.

Hence, I decided to write a paper dealing with (1) the need to publish negative results, (2) a post-mortem in a failed Social Media study analyzing the sources of bias and ways to correct that, and (3) providing some lessons and caveats for future research on the field.

So, I'll put here the lessons I extracted for this; I hope you find them useful or, at least, you can give me some feedback on them:

  1. The Big Data fallacy. Social Media are extremely appealing because researchers can easily obtain large data collections to be mined. However, just being large does not make such collections statistically representative of the global population.
  2. Beware of naïve sentiment analysis. It is certainly possible that some applications can achieve reasonable results by merely accounting topic frequency or using simple approaches to sentiment detection. However, noisy instruments should be avoided and one should carefully check whether s/he is using --maybe unknowingly-- a random classifier.
  3. Be careful with demographical bias. Social Media users tend to be relatively young and, depending on the population of interest, this can introduce an important bias. To improve results it is imperative to know users' ages and try to correct the age bias in the data.
  4. What is essential is invisible to the eye. Non-responses can play a role even more important than the collected data. If the lack of information mostly affect just one group the results can greatly depart from reality. Nonetheless to say, estimating degree of non-response and its nature is extremely difficult --if not impossible at all. Thus, we must be very conscious of this issue.
  5. (A few) Past positive results do not guarantee generalization. As researchers we must be aware of the file drawer effect and, hence, we should carefully evaluate positive reports before assuming the reported methods can be straightforwardly applied to any similar scenario with identical (positive) results.
Of course, if you are interested in the paper just e-mail me (dani AT uniovi DOT es) and I'll send you a copy.