Generating Accurate Training Data from Implicit Feedback

Thorsten Joachims
Cornell University, Department of Computer Science


Machine learning methods have shown much promise in designing adaptive
and personalized information system, ranging from email readers to
search engines. But how can we generate the data to train these
systems? The availability of training data is the crucial bottleneck
in many of these applications, since generating training data manually
is time consuming and often goes beyond the user's willingness to
participate. To overcome this bottleneck, researchers have tried to
infer training data from observable user behavior. Such implicit
feedback can be collected at low cost and in huge quantities, but does
it provide valid training data?  In this talk, we propose and analyze
strategies for generating training data from observable user
behavior. Focusing on clickthrough data in web search, we conducted an
eye-tracking study to analyze the relationship between user behavior
and the relevance of a page. The study shows that a particular
interpretation of clickthrough data provides reliable training
data. While clicks do not indicate the relevance of a page on an
absolute scale, clicks accurately indicate relative training data of
the kind "for query Q, document A should be ranked higher than
document B".