Building blocks for semantic search engines: Ranking and compact indexing in entity-relation graphs

Soumen Chakrabarti

We see an evolutionary path to supporting semantic search over text facilitated by 1. extractors and annotators for ever-growing collections of entity and relation types and 2. search systems that exploit a smooth continuum between structured entities and relations on one hand and uninterpreted text on the other. The extractors and annotators will be imperfect and incomplete. Unlike in traditional data warehousing, the unstructured source cannot be forgotten once structured data is curated. The source text lives on, with annotations as probabilistic connections to one or more ontologies. Queries involve ontology elements as well as uninterpreted strings. Searchers know only bits and pieces of schema. The query language must enable schema-free searches but reward schema knowledge. In such a system, ranking of results cannot be made all explicit as in SQL. In the first part of the talk I visit two issues related to scoring and ranking results. In one scenario we wish to rank mentions of instances of a given type (e.g. distance) based on their textual proximity to query keywords (e.g. Paris Helsinki). Such queries are crude but effective filters for simple questions like "what is the distance between Paris and Helsinki". In the other scenario I generalize the setting to learning ranking functions in arbitrary E-R graphs, given partial order preferences. In the second part of the talk I discuss the difficulties faced in indexing and query processing. I describe new techniques to estimate query-processing cost and cost-driven index compaction based on query logs. The work described is embodied in a system we are building that we plan to release in the public domain.