Ad verba per numeros


Tuesday, January 11, 2011, 06:51 PM
This morning I've discussed with a couple of friendsters (i.e. Twitter friends) about different methods to determine the language (natural language, that is) in which a tweet is written. A pretty obvious thing to do is to rely on a dictionary, the largest the dictionary the better.

However, which words should you use in case you want a really short dictionary? Those most likely to appear in any conceivable text. Words such as "the", "of", "with" in English, or "el", "un", "de" in Spanish would be a good choice. This kind of words are called stop words and there are several available lists for different languages (e.g English, French, Spanish, etc.)

Nevertheless, this dictionary-based method is not error free. For instance, "de" is not only a stop word for Spanish but also for Portuguese and Dutch.

Moreover, I argued that the method is specially problematic when trying to identify very short texts --specially if they are written using "imaginative" grammar and spelling. For instance, this real tweet is English but, except for "amazing", none of the words would appear in a (sensible) dictionary:


@justinbieber omg Justin bieber ur amazing lol : )

Therefore, I argued that using n-grams could be a more robust approach and, besides, the model could be trained on different data once we are sure the tweets are actually written in a given language.

So, first of all, what's an n-gram? A subsequence of n successive characters extracted from a given text string. For example, in the previous tweet we'd find the following 3-grams:

@ju
jus
ust
sti

...

lol
ol
l :
 : )

N-grams can be obtained for texts of any length and, thus, the underlying idea is to collect a list of n-grams (ideally with their relative frequency or, even better, their use probability) from a collection of documents.

Ideally, the collection should be similar to the documents you are to identify; that is, if you are going to classify tweets you shouldn't train on the Shakespeare's works. However, you are probably using any documents you find (for this post I've used the text of "The Universal Declaration of Human Rights").

Then, for any document you want to classify you just need to obtain a similar n-gram vector and compute their similarity (e.g. cosine, Jaccard, Dice, etc.)

Needless to say, when the document to classify is very short (such as tweets) most of the n-grams appearing within the document are going to be unique and, thus, awkward results can be obtained. If you are performing language identification in such short texts it's much better to just count the number of n-grams from the short text which appear for each language model and choose that language with a larger coverage.

For instance, let's take the following short texts:

(German, 48 4-grams) Als er erwachte, war der Dinosaurier immer noch da.
(Galician, 44 4-grams) Cando espertou, o dinosauro aínda estaba alí.
(Spanish, 51 4-grams) Cuando despertó, el dinosaurio todavía estaba allí.
(Basque, 43 4-grams) Esnatu zenean, dinosauroa han zegoen oraindik.
(Catalan, 46 4-grams) Quan va despertar, el dinosaure encara era allà.
(English, 43 4-grams) When [s]he awoke, the dinosaur was still there.

Using the model I've built, each of the texts has the following significant intersections:

Als er erwachte, war der Dinosaurier immer noch da. => German, 27 common 4-grams
Cando espertou, o dinosauro aínda estaba alí. => Portuguese, 18 common 4-grams
Cando espertou, o dinosauro aínda estaba alí. => Galician, 17 common 4-grams
Cuando despertó, el dinosaurio todavía estaba allí. => Spanish, 21 common 4-grams
Cuando despertó, el dinosaurio todavía estaba allí. => Asturian, 20 common 4-grams
Esnatu zenean, dinosauroa han zegoen oraindik. => Basque, 17 common 4-grams
Quan va despertar, el dinosaure encara era allà. => Catalan, 21 common 4-grams
Quan va despertar, el dinosaure encara era allà. => Spanish, 20 common 4-grams
Quan va despertar, el dinosaure encara era allà. => Asturian, 20 common 4-grams
When [s]he awoke, the dinosaur was still there. => English, 15 common 4-grams

If we choose the language with the largest intersection then we have that each text is classified as:
Als er erwachte, war der Dinosaurier immer noch da. => German, Correct!
Cando espertou, o dinosauro aínda estaba alí. => Portuguese, Incorrect, but a near miss
Cuando despertó, el dinosaurio todavía estaba allí. => Spanish, Correct!
Esnatu zenean, dinosauroa han zegoen oraindik. => Basque, Correct!
Quan va despertar, el dinosaure encara era allà. => Catalan, Correct!
When [s]he awoke, the dinosaur was still there. => English, Correct!

Another advantage of using n-gram models is that they decay gracefuly.

For instance, classifying a short text written in Galician as Portuguese is rather acceptable.

Or let's take this text:

"Hrvatski jezik skupni je naziv za standardni jezik Hrvata, i za skup narjecja i govora kojima govore ili su nekada govorili Hrvati."

It's actually Croatian, but since I did not train my system on Croatian samples it's classified as Serbian which, again, is reasonable.

In addition to this (hopefully) explanatory post, I've developed a bit of source code. You can try the demo and download the source code and data files (it's PHP, so proceed at your discretion).

As usual, if you want to discuss something on this post, just tweet me at @pfcdgayo



Back Next