Skip to main content

TupleRank: Ranking Discovered Content in Virtual Databases

  • Conference paper
Next Generation Information Technologies and Systems (NGITS 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4032))

Abstract

Recently, the problem of data integration has been newly addressed by methods based on machine learning and discovery. Such methods are intended to automate, at least in part, the laborious process of information integration, by which existing data sources are incorporated in a virtual database. Essentially, these methods scan new data sources, attempting to discover possible mappings to the virtual database. Like all discovery processes, this process is intrinsically probabilistic; that is, each discovery is associated with a specific value that denotes assurance of its appropriateness. Consequently, the rows in a discovered virtual table have mixed assurance levels, with some rows being more credible than others. We argue that rows in discovered virtual databases should be ranked, and we describe a ranking method, called TupleRank, for calculating such a ranking order. Roughly speaking, TupleRank calibrates the probabilities calculated during a discovery process with historical information about the performance of the system. The work is done in the framework of the Autoplex system for discovering content for virtual databases, and initial experimentation is reported and discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
eBook
USD 39.99
Price excludes VAT (USA)
Softcover Book
USD 54.99
Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley/ACM Press (1999)

    Google Scholar 

  2. Berlin, J., Motro, A.: Autoplex: Automated Discovery of Content for Virtual Databases. In: Batini, C., Giunchiglia, F., Giorgini, P., Mecella, M. (eds.) CoopIS 2001. LNCS, vol. 2172, pp. 108–122. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  3. Berlin, J., Motro, A.: Database Schema Matching Using Machine Learning with Feature Selection. In: Pidduck, A.B., Mylopoulos, J., Woo, C.C., Ozsu, M.T. (eds.) CAiSE 2002. LNCS, vol. 2348, pp. 452–466. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  4. Castano, S., De Antonellis, V.: A schema analysis and reconciliation tool environment for heterogeneous databases. In: Proc. IDEAS 1999, Int. Database Engineering and Applications Symposium, pp. 53–62 (1999)

    Google Scholar 

  5. Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: Discovering complex semantic matches between database schemas. In: Proc. SIGMOD 2004, Int. Conf. on Management of Data, pp. 383–394 (2004)

    Google Scholar 

  6. Doan, A., Domingos, P., Halevy, A.Y.: Learning source description for data integration. In: Proc. WebDB, pp. 81–86 (2000)

    Google Scholar 

  7. Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: Proc. SIGMOD 2001, Int. Conf. on Management of Data, pp. 509–520 (2001)

    Google Scholar 

  8. Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J.D., Vassalos, V., Widom, J.: The TSIMMIS approach to mediation: Data models and languages. J. Intelligent Information Systems 8(2), 117–132 (1997)

    Article  Google Scholar 

  9. Li, W.-S., Clifton, C.: SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data & Knowledge Engineering 33(1), 49–84 (2000)

    Article  MATH  Google Scholar 

  10. Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with Cupid. In: Proc. VLDB 2001, 27th Int. Conf. on Very Large Databases, pp. 49–58 (2001)

    Google Scholar 

  11. Motro, A.: Multiplex: A formal model for multidatabases and its implementation. In: Tsur, S. (ed.) NGITS 1999. LNCS, vol. 1649, pp. 138–158. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  12. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Berlin, J., Motro, A. (2006). TupleRank: Ranking Discovered Content in Virtual Databases. In: Etzion, O., Kuflik, T., Motro, A. (eds) Next Generation Information Technologies and Systems. NGITS 2006. Lecture Notes in Computer Science, vol 4032. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11780991_2

Download citation

  • DOI: https://doi.org/10.1007/11780991_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-35472-7

  • Online ISBN: 978-3-540-35473-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics