Abstract
The Web is a distributed network of information sources where the individual sources are autonomously created and maintained. Consequently, syntactic and semantic heterogeneity of data among sources abound. Most of the current data cleaning solutions assume that the data values referencing the same object bear some textual similarity. However, this assumption is often violated in practice. “Two-door front wheel drive” can be represented as “2DR-FWD” or “R2FD”, or even as “CAR TYPE 3” in different data sources. To address this problem, we propose a novel two-step automated technique that exploits statistical dependency structures among objects which is invariant to the tokens representing the objects. The algorithm achieved a high accuracy in our empirical study, suggesting that it can be a useful addition to the existing information integration techniques.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Kang, J., Naughton, J.F.: On Schema Matching with Opaque Column Names and Data Values. In: ACM SIGMOD, San Diego, CA (2003)
Andritsos, P., Miller, R.J., Tsaparas, P.: Information-theoretic tools for mining database structure from large data sets. In: ACM SIGMOD (2004)
Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: Discovering Complex Mappings between Database Schemas. In: ACM SIGMOD (2004)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10 (2001)
Hernandez, M.A., Stolfo, S.J.: The Merge/Purge Problem for Large Databases. In: ACM SIGMOD (1995)
Sarawagi, S., Bhamidipaty, A.: Interactive Deduplication using Active Learning. In: ACM SIGMOD (2002)
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: VLDB (2002)
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text Joins for Data Cleansing and Integration in an RDBMS. In: IEEE ICDE (2003)
Cohen, W.W.: Integration of Heterogeneous Databases Without Common Domains using Queries based on Textual Similarity. In: ACM SIGMOD (1998)
Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity Uncertainty and Citation Matching. In: Advances in Neural Information Processing Systems. MIT Press, Cambridge (2003)
Winkler, W.E.: The State of Record Linkage and Current Research Problems. Technical report, US Bureau of the Census (1999)
Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and Efficient Fuzzy Match for Online Data Cleaning. In: ACM SIGMOD (2003)
Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: ACM KDD, Washington, DC (2003)
Doan, A., Lu, Y., Lee, Y., Han, J.: Object Matching for Data Integration: A Profile-Based Approach. In: Workshop on Info. Integration on the Web (2003)
Kang, J., Han, T.S., Lee, D., Mitra, P.: Establishing Value Mappings using Statistical Models and User Feedback. In: ACM CIKM (2005)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. J. of the American Society of Information Science 41, 391–407 (1990)
Golub, G.H., van Loan, C.F.: Matrix computations. The Johns Hopkins University Press (1999)
Kuhn, H.W.: The Hungarian Method for the Assignment Problem. Naval Research Logistics Quarterly 2, 83–97 (1955)
Fellegi, I.P., Sunter, A.B.: A Theory for Record Linkage. J. of the American Statistical Society 64, 1183–1210 (1969)
McCallum, A., Nigam, K., Ungar, L.H.: Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. In: ACM KDD, Boston, MA (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kang, J., Lee, D., Mitra, P. (2005). Identifying Value Mappings for Data Integration: An Unsupervised Approach. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, JY., Sheng, Q.Z. (eds) Web Information Systems Engineering – WISE 2005. WISE 2005. Lecture Notes in Computer Science, vol 3806. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11581062_46
Download citation
DOI: https://doi.org/10.1007/11581062_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30017-5
Online ISBN: 978-3-540-32286-3
eBook Packages: Computer ScienceComputer Science (R0)