Friday, August 16, 2013, 12:06 AM
Warning: This post is a follow up to this one. Unless you read that one before, you'll probably won't get the point of this one.

In my previous post I said that one of the issues with the DiGrazia et al. (2013) paper is that they did not mention the work by Morstatter et al. (2013).

The later work is crucial because it shows that when using the public Streaming API gardenhose (i.e. the randomly sampled 1% stream of tweets that most researchers use) results are quite different from those obtained when using the firehose (the whole stream of tweets).

In my previous post I said that DiGrazia et al. should have at least acknowledged that they didn't known how representative their data was on the basis of the work by Morstatter et al. [Addendum, August 18: However, as Alex Hanna accurately pointed out, this would not exactly apply to the study by DiGrazia et al. since they used the gardenhose (10% sample) and not the 1% public Streaming API.]

To justify that I simply said that the work by Morstatter et al. preceded that by DiGrazia et al. However, Emilio Ferrara told me that I was wrong on that since the work by Morstatter et al. was published in July and the first draft by DiGrazia et al. was published in February.

I was sure, however, that DiGrazia et al. were aware of that work because I had addressed them to a preprint on April 25. That was the reason for me still pointing that flaw in the paper.

However, I was not aware of the deadlines for the annual meeting of the ASA: on January 9, 2013 papers should have been submitted, and March 18, 2013 was the date decisions letters should be sent to authors. Besides, on April 30, 2013 the final program for the conference was to be announced so I assume that between late March and early April authors should have submitted their camera ready version of the paper.

So, in short. The work by Morstatter et al. was available online at least on April 25 and DiGrazia et al. should know of it at least from that date because of my e-mail. However, it's very likely that by that day they had already submitted their camera ready version of the paper and, hence, that would explain the lack of that reference in their paper. [Addendum August 18] Besides, as aforementioned, it's debatable whether the findings by Morstatter et al. when comparing the public Streaming API with the firehose could apply or not to the gardenhose employed by DiGrazia et al.

Because of this I have striken through that concrete piece of criticism in my previous post.

Nevertheless, the findings by Morstatter et al. are a source of concern for all of us working with Twitter public Streaming API (the 1% sample) and even maybe those working with the gardenhose (the 10% sample) since we simply don't know the biases they can exhibit when compared with the whole firehose.

