With hundreds of millions of pages of information on the Internet, search has become a fundamental service. The abundance of information sets new challenges for even the best current search engines. Qualitatively better ways are needed to answer user queries. Our research is the development of a search engine kernel to support next generation search services: a subject-specific node in a distributed, hierarchical system for supporting navigation and search. The system would be used on some part of the Internet, an intranet or other document repository. The subject-specific node may have tens of millions of pages, and needs to automatically build its own hierarchies for topic, genre, and terminology - aspects of the document set that we together call a concept map. This project will provide the statistical computing techniques and their implementations needed to build a search engine kernel intended for large (giga and terabyte) document data sets. This is needed to implement features such as hierarchical multi-aspect clustering, automatic extraction of subject-specific topic hierarchies and intelligent query matching. In addition to the basic research on methodological aspects the project will also develop C/C++ libraries based on existing Open Source scientific libraries. All the code to be developed for the kernel will follow Open Source licensing in both a source and library model to allow its inclusion in other projects.
The strength of the Internet is that there are billions of pages of information available waiting to present information on an amazing variety of topics in the format of newsgroups, magazines, references, technical data, tutorials, sales literature, etc. The weakness of the Internet is that there are billions of pages of information, most of them titled according to the whim of their author using subtly different terminology to fool keyword search. Subject-specific search sites have emerged to provide help for this situation, yet they are time consuming to maintain, only sometimes provide good coverage (Citeseer for computer science research papers is one successful example), and rarely provide a sophisticated interface. More sophisticated methods such as the analysis and structure of pages with their mixed topics, stylistic variations, and choice of terminology are just beginning to be understood.[an error occurred while processing this directive] [an error occurred while processing this directive]
One approach to addressing the above is Semantic Web research. This defines standards for integrating semantics into pages, based for instance on an ontology or resource description framework (RDF). This approach is arguably suitable for a single organization, especially where particular functionality exists so that ontologies or schemas can be limited and thus can be realistically supported. This approach has proven useful to route users to subject-specific sites. A second, related approach uses machine learning to extract specific information and thus populate a knowledge base or the semantic tags attached to pages. This however fails to support general search with its arbitrary queries. With the chaotic nature of the Internet and other large document collections where arbitrary search is required, we can instead hope to automatically generate some of the semantic information necessary to support more meaningful search and browsing, but not a full ontology. It is known in the large scale case that organizing pages hierarchically by topic provides superior classification and structuring in terms of relevance and computation. Returning results by topic improves the end-user experience, and allows search to be distributed. In more restricted laboratory settings, researchers have shown that integrating synonyms and topics into the search system can improve quality of results, however in such an approach there are many pitfalls so that nave integration can in fact be damaging. Thus we are developing a software kernel to support a subject-specific node in a distributed, hierarchical system for supporting navigation and search on pages. The node may have tens of millions of pages, and needs to automatically build its own hierarchies for topic, genre, and terminology, aspects of the document set that we together call a concept map. We stress that for our purposes the concept map is a technical construct, and exists to power the search and navigation process at a subject-specific site. It requires neither grammatical correct linguistic analysis nor carefully defined and expert approved ontologies.
Consider the following interactions one might support on such a server. On a history database two different queries would be:
The general query would reflect suggestions one might expect from a library system or librarian. A more specific query would be given in free text. The specialized presentation of results reflects a detailed analysis of the pages in the collection at a level a librarian could not be expected to master and indeed could come as a surprise to anyone not familiar with the period.[an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive]