1. Latent Semantic Analysis is a technique that is possible to use to define the similarity between the messages. So far I could not find a good implementation in Python, except the one by James Stanley. The pros: it is reliable and based on solid statistical methods. The cons are that it works with a fixed corpus that is given by the authors and that does not change over time (supervised method).
There are other approaches still pointing on google or other web resources. See for instance this paper by Marco Baroni. Using LSA we can ultimately graph something like this:
2. Why clustering? What we want is to support the user in the exploration of space. The cluster then reflect to the concept of a Landmark, a pinpoint of “features” to the physical space.
3. About LSA. Once computed the similarity with the LSA algorithm, we then have only to cluster points with numerical feature using a geographical criteria. We loose the semantic dimension. We computer the LSA similarity only on the keywords set to avoid noise.
4. The clusters are static within a certain period of time. Periodically they are updated. A search query can be mapped against the clusters’ keywords (a composition of the message keywords that have generated them) using again LSA this will give a gradient of matching results within the DB content. A simplification might be:
5. The cluster machine. What is still missing here is the inner working of the cluster machine. In fact, I have to clarify wether this will be based on an existing method or derived by one of this into a custom made formula. I am currently reviewing a survey of these methods. Where I discovered the Non-Spatial-Data-Dominant Generalization that should provide more flexibility in the aggregation of data.
Tags: clustering, map algorithms, Latent Semantic Analysis, spatial clustering, statistics, tagging