The project, with partners here at EPFL, conducts research in the design, use and interoperability of topic-specific search engines with the goal of developing an open source prototype of a distributed, semantic-based search engine (the architecture is reported in the picture below).
Existing search engines provide poor foundation for semantic web operations, and US companies such as Google are becoming monopolies, distorting the entire information landscape. Our approach is not the traditional Semantic Web approach with coded or semi-automatically extracted metadata, but rather an engine that can build on content through automatic analysis. Linguistic processing is inside the search engine and a probabilistic document model provides a principled evaluation of relevance to complement existing standard authority scores. This facilitates semantic retrieval and incorporates pre-existing domain ontologies using facilities for import and maintenance. The distributed design is based on exposing search objects as resources, and on using implicit and automatically generated semantics (not ontologies) to distribute queries and merge results. Because semantic expressivity and interoperability are competing goals, developing a system that is both distributed and semantic-based is the key challenge: research involves both the statistical and linguistic format of semantic internals, and determining the extent to which the semantic internals are exposed at the interface.
Tags: google, information retrieval, open source, p2p, search engine