Trigram language guessing

I found a nice Python module for language guessing that is based on trigram vectors. The idea is very simple: a document is analysed looking a triplets of characters, the frequency of three character sequences is calculated. When treated as a vector, this information can be compared to other trigrams, and the difference between them seen as an angle. The cosine of this angle varies between 1 for complete similarity, and 0 for utter difference. Since letter combinations are characteristic to a language, this can be used to determine the language of a body of text.

This methods are known as N-gram speech recognition algorithms.

I applied this module into my tagger to make it switch between different language parameter files. You can see the results in the difference of tags that are generated without and with the language guesser module below:

2005-10-03 13:42:01,542 – main – INFO – — Tagging session started

2005-10-03 14:05:08,076 – main – INFO – The number of messages tagged is: 190

2005-10-03 14:05:08,122 – main – INFO – The number of new tags created is: 162

2005-10-03 14:05:08,123 – main – INFO – The number of messages dumped is: 47

2005-10-03 14:05:08,123 – main – INFO – — Tagging session ended

Tags: natural language processing, tagging

Mauro Cherubini

Professor at the University of Lausanne, Switzerland

Leave a Reply Cancel reply