Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Identifying Value Mappings for Data Integration: An Unsupervised Approach

  • Conference paper
Web Information Systems Engineering – WISE 2005 (WISE 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3806))

Included in the following conference series:

Abstract

The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. “Two-door front wheel drive” can be represented as “2DR-FWD” or “R2FD”, or even as “CAR TYPE 3” in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Kang, J., Naughton, J.F.: On Schema Matching with Opaque Column Names and Data Values. In: ACM SIGMOD, San Diego, CA (2003)

    Google Scholar 

  2. Andritsos, P., Miller, R.J., Tsaparas, P.: Information-theoretic tools for mining database structure from large data sets. In: ACM SIGMOD (2004)

    Google Scholar 

  3. Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: Discovering Complex Mappings between Database Schemas. In: ACM SIGMOD (2004)

    Google Scholar 

  4. Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10 (2001)

    Google Scholar 

  5. Hernandez, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: ACM SIGMOD (1995)

    Google Scholar 

  6. Sarawagi, S., Bhamidipaty, A.: Interactive Deduplication using Active Learning. In: ACM SIGMOD (2002)

    Google Scholar 

  7. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: VLDB (2002)

    Google Scholar 

  8. Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text Joins for Data Cleansing and Integration in an RDBMS. In: IEEE ICDE (2003)

    Google Scholar 

  9. Cohen, W.W.: Integration of Heterogeneous Databases Without Common Domains using Queries based on Textual Similarity. In: ACM SIGMOD (1998)

    Google Scholar 

  10. Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity Uncertainty and Citation Matching. In: Advances in Neural Information Processing Systems. MIT Press, Cambridge (2003)

    Google Scholar 

  11. Winkler, W.E.: The State of Record Linkage and Current Research Problems. Technical report, US Bureau of the Census (1999)

    Google Scholar 

  12. Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and Efficient Fuzzy Match for Online Data Cleaning. In: ACM SIGMOD (2003)

    Google Scholar 

  13. Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: ACM KDD, Washington, DC (2003)

    Google Scholar 

  14. Doan, A., Lu, Y., Lee, Y., Han, J.: Object Matching for Data Integration: A Profile-Based Approach. In: Workshop on Info. Integration on the Web (2003)

    Google Scholar 

  15. Kang, J., Han, T.S., Lee, D., Mitra, P.: Establishing Value Mappings using Statistical Models and User Feedback. In: ACM CIKM (2005)

    Google Scholar 

  16. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. J. of the American Society of Information Science 41, 391–407 (1990)

    Article  Google Scholar 

  17. Golub, G.H., van Loan, C.F.: Matrix computations. The Johns Hopkins University Press (1999)

    Google Scholar 

  18. Kuhn, H.W.: The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly 2, 83–97 (1955)

    Article  MathSciNet  Google Scholar 

  19. Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. J. of the American Statistical Society 64, 1183–1210 (1969)

    Google Scholar 

  20. McCallum, A., Nigam, K., Ungar, L.H.: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: ACM KDD, Boston, MA (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Kang, J., Lee, D., Mitra, P. (2005). Identifying Value Mappings for Data Integration: An Unsupervised Approach. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_46

Download citation

  • DOI: https://doi.org/10.1007/11581062_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30017-5

  • Online ISBN: 978-3-540-32286-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics