Notes from the last meeting with M. Rajman (30.5.2006)
Standard Information Retrieval techniques provide a solid treatment of computation of similarity and ranking based on well tested and accepted methodologies. In a multidimensional space, similarity between two points is computed as the angle between the vectors representing the two points. This is called Cosene Similarity.
Standard IR measures for ranking similarities use enhanced variants of the tf*idf formula called BM25 (a.k.a. “Okapi”) and Deviation From Randomness, DFR (a.k.a. “Prosit”).
One of the advices of Dr. Rajman was to rely on these standard Information Retrieval techniques instead of spending energies trying to implement a new ranking system. It is possible to use a standard platform for Information Retrieval as “Terrier” that allows to used these standard solutions.
As a second step in the discussion we talked about the Multidimensional Filtering. One of the objectives of the thesis, in fact, is the definition of a criterion for mixing different types of relevance that we are considering (i.e., semantic relevance; geographic relevance; popularity / social relevance; etc.). These different features are not compatible or comparable, so one of the ideas is to not try to mix them.
We can use a technique called sequence filtering which works on the principle that: “rejection is local and acceptance is global”. The principle is based on the idea that we should start filtering using the feature that discriminates the most, and then move to the next feature. In other words this method allows taking into account all the features without mixing them in a particular fashion.
As a third step in the discussion, Dr. Rajman illustrated his view on another big challenge on the acceptance of the relevant document. In fact, the ultimate step for defining the criterion of acceptance of a certain feature is the definition of the boundaries of acceptance.
In a continuum of distribution of relevance in the document set, we need to define how to define acceptance. This can be achieved on the Relevance axis using a Rmin boundary or on the document axis, using a k-best factor over the ordered distribution of relevance in the document set. A combination of these criteria is also possible. One of the big challenges for my thesis work would be to infer experimentally these bounding limits.
To summarise the challenges of my thesis work are: 1) the selection of the relevant features necessary for the retrieval process; 2) an user study to define the acceptance boundaries for the retrieval process; and 3) the final validation of these parameters through an experimental study.
Tags: data mining, Exploratory Data Analysis, information retrieval, experimental design