Generating Accurate Training Data from Implicit Feedback Thorsten Joachims Cornell University, Department of Computer Science Machine learning methods have shown much promise in designing adaptive and personalized information system, ranging from email readers to search engines. But how can we generate the data to train these systems? The availability of training data is the crucial bottleneck in many of these applications, since generating training data manually is time consuming and often goes beyond the user's willingness to participate. To overcome this bottleneck, researchers have tried to infer training data from observable user behavior. Such implicit feedback can be collected at low cost and in huge quantities, but does it provide valid training data? In this talk, we propose and analyze strategies for generating training data from observable user behavior. Focusing on clickthrough data in web search, we conducted an eye-tracking study to analyze the relationship between user behavior and the relevance of a page. The study shows that a particular interpretation of clickthrough data provides reliable training data. While clicks do not indicate the relevance of a page on an absolute scale, clicks accurately indicate relative training data of the kind "for query Q, document A should be ranked higher than document B".