Abstract
This paper considers how web search engines can learn from the successful searches recorded in their user logs. Document Transformation is a feasible approach that uses these logs to improve document representations. Existing test collections do not allow an adequate investigation of Document Transformation, but we show how a rigorous evaluation of this method can be carried out using the referer logs kept by web servers. We also describe a new strategy for Document Transformation that is suitable for long-term incremental learning. Our experiments show that Document Transformation improves retrieval performance over a medium sized collection of webpages. Commercial search engines may be able to achieve similar improvements by incorporating this approach.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Vo Ngoc Anh and Alistair Moffat. Improved retrieval effectiveness through impact transformation. In Proceedings of the Thirteenth Australasian Database Conference, Melbourne, Australia, in press.
Doug Beeferman and Adam Berger. Agglomerative clustering of a search engine query log. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pages 407–416, Boston, 2000. ACM Press.
Richard K. Belew. Adaptive information retrieval: Using a connectionist representation to retrieve and learn about documents. In Proceedings of the Twelfth International Conference on Research and Development in Information Retrieval, pages 11–20, Cambridge, MA, 1989. ACM Press.
Justin Boyan, Dayne Freitag, and Thorsten Joachims. A machine learning architecture for optimizing web search engines. In Proceedings of the AAAI Workshop on Internet-Based Information Systems. 1996.
T. Brauen. Document vector modification. In Gerard Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 456–484. Prentice Hall, NJ, 1971.
Chris Buckley. Implementation of the SMART information retrieval system. Technical Report 85-686, Department of Computer Science, Cornell University, Ithaca, NY, 1985.
Chris Buckley and Ellen M. Voorhees. Evaluating evaluation measure stability. In Proceedings of the Twenty Third Annual International Conference on Research and Development in Information Retrieval, pages 33–40, Athens, Greece, 2000. ACM Press.
Hsinchun Chen. Machine learning for information retrieval: Neural networks, symbolic learning, and genetic algorithms. Journal of the American Society of Information Science, 46(3):194–216, 1995.
The Direct Hit popularity engine technology: A white paper, 1999. Available from http://www.directhit.com/about/products/technology_whitepaper.html.
S. Friedman, J. Maceyak, and S. Weiss. A relevance feedback system based on document transformations. In Gerard Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 447–455. Prentice Hall, NJ, 1971.
Norbert Fuhr and Chris Buckley. A probabilistic learning approach for document indexing. Information Systems, 9(3):223–248, 1991.
M. Gordon. Probabilistic and genetic algorithms for document retrieval. Communications of the ACM, 31(10):1208–1218, 1988.
B. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information retrieval: A study of user queries on the web. SIGIR Forum, 32(1):5–17, 1998.
K. L. Kwok. A neural network for probabilistic information retrieval. In Proceedings of the Twelfth Annual International Conference on Research and Development in Information Retrieval, pages 21–30, Cambridge, MA, 1989.
David D. Lewis. Learning in intelligent information retrieval. In Lawrence A. Birnbaum and Gregg C. Collins, editors, Machine Learning: Proceedings of the Eighth International Workshop, pages 235–239, Evanston, IL, 1991. Morgan Kaufmann.
M. Maron and J. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of the Association for Computing Machinery, 7(3):216–244, July 1960.
Benjamin Piwowarski. Learning in information retrieval: a probabilistic differential approach. In Proceedings of the Twenty Second Annual Colloquium on Information Retrieval Research, Cambridge, England, April 2000.
J. Rocchio, Jr. Relevance feedback in information retrieval. In Gerard Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313–323. Prentice Hall, 1971.
Gerard Salton. Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading, MA, 1989.
J. Savoy and D. Vrajitoru. Evaluation of learning schemes used in information retrieval. Technical Report CR-I-95-02, Faculty of Sciences, University of Neuchâtel, 1996.
Craig Silverstein, Monika Henzinger, Hannes Marais, and Michael Moricz. Analysis of a very large AltaVista query log. Technical Report 1998-014, Systems Research Center, Digital Equipment Corporation, Palo Alto, California, October 1998.
Amit Singhal, Chris Buckley, and Mandar Mitra. Pivoted document length normalization. In H-P Frei, D. Harman, and P. Schäuble, editors, Proceedings of the Nineteenth International Conference on Research and Development in Information Retrieval, pages 21–29, New York, 1996. ACM Press.
Karen Sparck Jones. Automatic indexing. Journal of Documentation, 30:393–432, 1974.
Justin Zobel and Alistair Moffat. Exploring the similarity space. SIGIR Forum, 32(1):18–34, 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kemp, C., Ramamohanarao, K. (2002). Long-Term Learning for Web Search Engines. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_22
Download citation
DOI: https://doi.org/10.1007/3-540-45681-3_22
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive