Building blocks for semantic search engines:
Ranking and compact indexing in entity-relation graphs
Soumen Chakrabarti
We see an evolutionary path to supporting semantic search over text
facilitated by 1. extractors and annotators for ever-growing collections
of entity and relation types and 2. search systems that exploit a smooth
continuum between structured entities and relations on one hand and
uninterpreted text on the other. The extractors and annotators will be
imperfect and incomplete. Unlike in traditional data warehousing, the
unstructured source cannot be forgotten once structured data is curated.
The source text lives on, with annotations as probabilistic connections
to one or more ontologies. Queries involve ontology elements as well as
uninterpreted strings. Searchers know only bits and pieces of schema.
The query language must enable schema-free searches but reward schema
knowledge. In such a system, ranking of results cannot be made all
explicit as in SQL. In the first part of the talk I visit two issues
related to scoring and ranking results. In one scenario we wish to rank
mentions of instances of a given type (e.g. distance) based on their
textual proximity to query keywords (e.g. Paris Helsinki). Such queries
are crude but effective filters for simple questions like "what is the
distance between Paris and Helsinki". In the other scenario I generalize
the setting to learning ranking functions in arbitrary E-R graphs, given
partial order preferences. In the second part of the talk I discuss the
difficulties faced in indexing and query processing. I describe new
techniques to estimate query-processing cost and cost-driven index
compaction based on query logs. The work described is embodied in a
system we are building that we plan to release in the public domain.