Identifying Value Mappings for Data Integration: An Unsupervised Approach

Kang, Jaewoo; Lee, Dongwon; Mitra, Prasenjit

doi:10.1007/11581062_46

Jaewoo Kang²¹,
Dongwon Lee²² &
Prasenjit Mitra²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3806))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1231 Accesses
4 Citations

Abstract

The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. “Two-door front wheel drive” can be represented as “2DR-FWD” or “R2FD”, or even as “CAR TYPE 3” in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

The Case for Holistic Data Integration

Semantic Links Across Distributed Heterogeneous Data

Quantifying and Propagating Uncertainty in Automated Linked Data Integration

References

Kang, J., Naughton, J.F.: On Schema Matching with Opaque Column Names and Data Values. In: ACM SIGMOD, San Diego, CA (2003)
Google Scholar
Andritsos, P., Miller, R.J., Tsaparas, P.: Information-theoretic tools for mining database structure from large data sets. In: ACM SIGMOD (2004)
Google Scholar
Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: Discovering Complex Mappings between Database Schemas. In: ACM SIGMOD (2004)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10 (2001)
Google Scholar
Hernandez, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: ACM SIGMOD (1995)
Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive Deduplication using Active Learning. In: ACM SIGMOD (2002)
Google Scholar
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: VLDB (2002)
Google Scholar
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text Joins for Data Cleansing and Integration in an RDBMS. In: IEEE ICDE (2003)
Google Scholar
Cohen, W.W.: Integration of Heterogeneous Databases Without Common Domains using Queries based on Textual Similarity. In: ACM SIGMOD (1998)
Google Scholar
Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity Uncertainty and Citation Matching. In: Advances in Neural Information Processing Systems. MIT Press, Cambridge (2003)
Google Scholar
Winkler, W.E.: The State of Record Linkage and Current Research Problems. Technical report, US Bureau of the Census (1999)
Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and Efficient Fuzzy Match for Online Data Cleaning. In: ACM SIGMOD (2003)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: ACM KDD, Washington, DC (2003)
Google Scholar
Doan, A., Lu, Y., Lee, Y., Han, J.: Object Matching for Data Integration: A Profile-Based Approach. In: Workshop on Info. Integration on the Web (2003)
Google Scholar
Kang, J., Han, T.S., Lee, D., Mitra, P.: Establishing Value Mappings using Statistical Models and User Feedback. In: ACM CIKM (2005)
Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. J. of the American Society of Information Science 41, 391–407 (1990)
Article Google Scholar
Golub, G.H., van Loan, C.F.: Matrix computations. The Johns Hopkins University Press (1999)
Google Scholar
Kuhn, H.W.: The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly 2, 83–97 (1955)
Article MathSciNet Google Scholar
Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. J. of the American Statistical Society 64, 1183–1210 (1969)
Google Scholar
McCallum, A., Nigam, K., Ungar, L.H.: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: ACM KDD, Boston, MA (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

NC State University, Raleigh, NC, 27695, USA
Jaewoo Kang
Penn State University, University Park, PA, 16802, USA
Dongwon Lee & Prasenjit Mitra

Authors

Jaewoo Kang
View author publications
You can also search for this author in PubMed Google Scholar
Dongwon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Prasenjit Mitra
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Texas State University, San Marcos, TX,
Anne H. H. Ngu
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
University of Vienna, Vienna, Austria
Erich J. Neuhold
IBM Research Division, Thomas J. Watson Research Center, P.O. Box 218, 10598, New York, Yorktown Heights, USA
Jen-Yao Chung
School of Computer Science and Engineering, University of New South Wales, NSW 2052, Sydney, Australia
Quan Z. Sheng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, J., Lee, D., Mitra, P. (2005). Identifying Value Mappings for Data Integration: An Unsupervised Approach. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_46

Download citation

DOI: https://doi.org/10.1007/11581062_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30017-5
Online ISBN: 978-3-540-32286-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Identifying Value Mappings for Data Integration: An Unsupervised Approach

Abstract

Access this chapter

Preview

Similar content being viewed by others

The Case for Holistic Data Integration

Semantic Links Across Distributed Heterogeneous Data

Quantifying and Propagating Uncertainty in Automated Linked Data Integration

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Identifying Value Mappings for Data Integration: An Unsupervised Approach

Abstract

Access this chapter

Preview

Similar content being viewed by others

The Case for Holistic Data Integration

Semantic Links Across Distributed Heterogeneous Data

Quantifying and Propagating Uncertainty in Automated Linked Data Integration

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation