Scalable Probabilistic Methods for the Next Generation Search Engine (PROSE)
With hundreds of millions of pages of information on the Internet,
search has become a fundamental service. The abundance of information
sets new challenges for even the best current search
engines. Qualitatively better ways are needed to answer user
queries. Our research is the development of a search engine kernel to
support next generation search services: a subject-specific node in a
distributed, hierarchical system for supporting navigation and
search. The system would be used on some part of the Internet, an
intranet or other document repository. The subject-specific node may
have tens of millions of pages, and needs to automatically build its
own hierarchies for topic, genre, and terminology - aspects of the
document set that we together call a concept map. This project will
provide the statistical computing techniques and their implementations
needed to build a search engine kernel intended for large (giga and
terabyte) document data sets. This is needed to implement features
such as hierarchical multi-aspect clustering, automatic extraction of
subject-specific topic hierarchies and intelligent query matching. In
addition to the basic research on methodological aspects the project
will also develop C/C++ libraries based on existing Open Source
scientific libraries. All the code to be developed for the kernel will
follow Open Source licensing in both a source and library model to
allow its inclusion in other projects.
Search is a fundamental service in information systems
The strength of the Internet is that there are billions of pages of
information available waiting to present information on an amazing
variety of topics in the format of newsgroups, magazines, references,
technical data, tutorials, sales literature, etc. The weakness of the
Internet is that there are billions of pages of information, most of
them titled according to the whim of their author using subtly
different terminology to fool keyword search. Subject-specific search
sites have emerged to provide help for this situation, yet they are
time consuming to maintain, only sometimes provide good coverage
(Citeseer
for computer science research papers is one successful
example), and rarely provide a sophisticated interface. More
sophisticated methods such as the analysis and structure of pages with
their mixed topics, stylistic variations, and choice of terminology
are just beginning to be understood.
The future of search: a scalable, distributed, hierarchical concept map.
One approach to addressing the above is
Semantic Web research. This
defines standards for integrating semantics into pages, based for
instance on an ontology or resource description framework (RDF). This
approach is arguably suitable for a single organization, especially
where particular functionality exists so that ontologies or schemas
can be limited and thus can be realistically supported. This approach
has proven useful to route users to subject-specific sites. A second,
related approach uses machine learning to extract specific information
and thus populate a knowledge base or the semantic tags attached to
pages. This however fails to support general search with its arbitrary
queries. With the chaotic nature of the Internet and other large
document collections where arbitrary search is required, we can
instead hope to automatically generate some of the semantic
information necessary to support more meaningful search and browsing,
but not a full ontology. It is known in the large scale case that
organizing pages hierarchically by topic provides superior
classification and structuring in terms of relevance and
computation. Returning results by topic improves the end-user
experience, and allows search to be distributed. In more restricted
laboratory settings, researchers have shown that integrating synonyms
and topics into the search system can improve quality of results,
however in such an approach there are many pitfalls so that nave
integration can in fact be damaging. Thus we are developing a software
kernel to support a subject-specific node in a distributed,
hierarchical system for supporting navigation and search on pages. The
node may have tens of millions of pages, and needs to automatically
build its own hierarchies for topic, genre, and terminology, aspects
of the document set that we together call a concept map. We stress
that for our purposes the concept map is a technical construct, and
exists to power the search and navigation process at a
subject-specific site. It requires neither grammatical correct
linguistic analysis nor carefully defined and expert approved
ontologies.
An Example
Consider the following interactions one might support on such a
server. On a history database two different queries would be:
- General:
- key phrases like Roman empire.
No results are given but suggestions for specializing the query with
topic or genre such as military achievements, politics, Christianity,
discussion groups, introductions, book reviews, etc.
- Specific:
-
free text such as sexuality in the Roman empire plus the keyword Diana
to be matched exactly. Results might be ordered under headings such
as homosexuality, moral norms, art, and religious ceremonies. Results
should have intersecting topics with the free text but not necessarily
the same words.
- Refinement:
- a find more pages like this option. This
could be selected from one of the results pages.
The general query
would reflect suggestions one might expect from a library system or
librarian. A more specific query would be given in free text. The
specialized presentation of results reflects a detailed analysis of
the pages in the collection at a level a librarian could not be
expected to master and indeed could come as a surprise to anyone not
familiar with the period.
|