Ad verba per numeros

Wednesday, April 7, 2010, 09:36 AM
It's been a time since my last post because I've been busy working on a paper. Indeed, this post has got two goals.

On one hand, to describe a little bit of that research, mainly the "juicy parts".

On the other hand, to encourage you to read it and send me any feedback/comment/critic/ask for clarification you want. In this sense I'm trying to follow the excellent advice of Daniel Lemire. If you want to drop me a few lines my e-mail is dani AT uniovi DOT es, if you just have time for 140 characters just tweet me @pfcdgayo.

So, what's the paper about? Let's start with the formal presentation:

Nepotistic Relationships in Twitter and their Impact on Rank Prestige Algorithms

Micro-blogging services such as Twitter allow anyone to publish anything, anytime. Nonetheless to say, many of the available contents can be diminished as babble or spam. However, given the number and diversity of users, some valuable pieces of information should arise from the stream of tweets. Thus, such services can develop into valuable sources of up-to-date information (the so-called real-time web) provided a way to find the most relevant/trustworthy/authoritative users is available. Hence, this makes a highly pertinent question for which graph centrality methods can provide an answer. In this paper the author offers a comprehensive survey of feasible algorithms for ranking users in social networks, he examines their vulnerabilities to linking malpractice in such networks, and suggests an objective criterion against which to compare such algorithms. Additionally, he suggests a first step towards "desensitizing" prestige algorithms against cheating by spammers and other abusive users.

Now, a few details.

One of my research interests is Web Information Retrieval and, hence, real-time Web search is an appealing topic. One of the problems with real-time Web search is, obviously, separating "good" users from "bad" users, where "good" means trustworthy and relevant sources of information and "bad", well, just the opposite. One pretty obvious way to do that is applying PageRank (or any other rank prestige method) to the user graph to rank the users. In fact, Google seems to be doing that, indeed.

Of course, there are other ways which tend to be, IMHO, quite "ad hoc". I assume you have heard of several of them: they rely on several parameters as number of tweets, retweets, conversations, hashtags, followers, frequency of tweeting, etc. In the midst of all this discussion on Influence on Twitter (please notice the upper case), Daniel Tunkelang proposed a new rank prestige method which was named TunkRank. Shortly after this new algorithm was described, Jason Adams provided an implementation for it.

As an academic I find here several nice questions to (try to) answer. Which rank prestige method is the best to rank users in Twitter? How can you objectively evaluate that? Which other ranking methods, if any, are available? Could I do my bit in this field?

The answers to those questions are in the paper but, in short:

  • There are plenty of rank prestige methods, the paper surveys PageRank, HITS, NodeRanking, TunkRank, and TwitterRank. There are somewhat related methods to find relevant users in Web 2.0 environments --e.g. SPEAR-- or to compute influence in Twitter in different and appealing ways --e.g. the work by Lee et al., ask Haewoon for a longer version-- but none of these was replicated.
  • An objective way, in my opinion, to compare ranking methods against each other is checking their sensitivity to spammers and other abusive users. That is, a method is better than another one when the former consistently provides spammers with lower rankings.
  • With respect to that evaluation criterion, TunkRank is the best available method.
  • Finally, my bit was proposing nepotistic links (in fact, a way to discount nepotistic links) as a feature to include in rank prestige algorithms. It seems to be a good way to separate "useful" from "useless" sources of information, but the algorithm based on that idea still have two open issues which need further research.
Nonetheless to say, to work in this stuff a Twitter user graph was needed and I had to collect my own (about 28M tweets and 1.8M users from the English-speaking twittersphere). Thus, the paper includes an appendix describing the dataset but here you have a few interesting bits:
  • Half of the users provide a valid geographical location.
  • About one third of the users provide their personal name.
  • 58.6% of the Twitter users are men and 41.4% are women but female users surpass male users in the 10-19 years interval.
  • Very few users provide age information so this must be taken with a grain of salt: the average Twitter user is 21 years old, and those between 15 and 29 years account for 70% of the users.
With regards to the dataset: I do not have any plan to release it. I started this collection after the 2008 infochimps' fiasco and I'm a bit tired of seeing promises of massive datasets releases which end in nothing because of legal pressure. Thus, I'm promising you NOT releasing the data; however, there are some people braver than me: Kwak et al. and Munmun De Choudhury.

And that's all folks! I hope you can read (or at least skim) the paper and share your opinion or tell me of weak points in the report.

Back Next