Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Learning expressive linkage rules using genetic programming

Published: 01 July 2012 Publication History
  • Get Citation Alerts
  • Abstract

    A central problem in data integration and data cleansing is to find entities in different data sources that describe the same real-world object. Many existing methods for identifying such entities rely on explicit linkage rules which specify the conditions that entities must fulfill in order to be considered to describe the same real-world object. In this paper, we present the GenLink algorithm for learning expressive linkage rules from a set of existing reference links using genetic programming. The algorithm is capable of generating linkage rules which select discriminative properties for comparison, apply chains of data transformations to normalize property values, choose appropriate distance measures and thresholds and combine the results of multiple comparisons using non-linear aggregation functions. Our experiments show that the GenLink algorithm outperforms the state-of-the-art genetic programming approach to learning linkage rules recently presented by Carvalho et. al. and is capable of learning linkage rules which achieve a similar accuracy as human written rules for the same problem.

    References

    [1]
    A. Arasu, M. Götz, and R. Kaushik. On active learning of record matching packages. In Proceedings of the 16th ACM SIGMOD International Conference on Management of Data, pages 783--794, 2010.
    [2]
    M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 39--48, 2003.
    [3]
    M. Bilenko and R. J. Mooney. Learning to combine trained distance metrics for duplicate detection in databases. Technical Report AI 02--296, Artificial Intelligence Laboratory, University of Austin, 2002.
    [4]
    C. Bizer, T. Heath, and T. Berners-Lee. Linked Data - the story so far. International Journal on Semantic Web and Information Systems, 4(2):1--22, 2009.
    [5]
    D. Brickley and L. Miller. FOAF Vocabulary Specification. http://xmlns.com/foaf/0.1/, 2005.
    [6]
    M. Carvalho, A. Laender, M. Gonçalves, and A. da Silva. Replica identification using genetic programming. In Proceedings of the 23rd Annual ACM Symposium on Applied Computing, pages 1801--1806, 2008.
    [7]
    C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20(3):273--297, 1995.
    [8]
    N. Cramer. A representation for the adaptive generation of simple sequential programs. In Proceedings of the First International Conference on Genetic Algorithms, pages 183--187, 1985.
    [9]
    M. G. de Carvalho, M. A. Gonçalves, A. H. F. Laender, and A. S. da Silva. Learning to deduplicate. In Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 41--50, 2006.
    [10]
    M. G. de Carvalho, A. H. F. Laender, M. A. Goncalves, and A. S. da Silva. A genetic programming approach to record deduplication. IEEE Transactions on Knowledge and Data Engineering, 24(3):399--412, 2012.
    [11]
    M. Elfeky, V. Verykios, and A. Elmagarmid. Tailor: A record linkage toolbox. In Proceedings of 18th International Conference on Data Engineering, pages 17--28, 2002.
    [12]
    A. Elmagarmid, P. Ipeirotis, and V. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering, 19(1):1--16, 2007.
    [13]
    J. Euzenat et al. Results of the ontology alignment evaluation initiative 2011. In 6th International Workshop on Ontology Matching., pages 85--113, 2011.
    [14]
    J. Euzenat and P. Shvaiko. Ontology Matching. Springer-Verlag, Heidelberg (DE), 2007.
    [15]
    I. P. Fellegi and A. B. Sunter. A Theory for Record Linkage. Journal of the American Statistical Association, 64(328):1183--1210, 1969.
    [16]
    L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences, 2003.
    [17]
    J. Holland. Adaptation in natural and artificial systems. The University of Michigan Press, 1975.
    [18]
    W. Hu, J. Chen, G. Cheng, and Y. Qu. ObjectCoref & Falcon-AO: Results for OAEI 2010. In 5th International Workshop on Ontology Matching, pages 158--165, 2010.
    [19]
    R. Isele and C. Bizer. Learning linkage rules using genetic programming. In 6th International Workshop on Ontology Matching, pages 13--24, 2011.
    [20]
    R. Isele, A. Jentzsch, and C. Bizer. Silk Server - adding missing links while consuming linked data. In 1st International Workshop on Consuming Linked Data, pages 85--96, 2010.
    [21]
    R. Isele, A. Jentzsch, and C. Bizer. Active learning of expressive linkage rules for the web of data. In 12th International Conference on Web Engineering, pages 411--418, 2012.
    [22]
    H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 69(2):197--210, 2010.
    [23]
    J. Koza, M. Keane, M. Streeter, W. Mydlowec, J. Yu, and G. Lanza. Genetic programming IV: Routine human-competitive machine intelligence. Springer Verlag, 2005.
    [24]
    J. R. Koza. Genetic programming - on the programming of computers by means of natural selection. MIT Press, 1993.
    [25]
    B. Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA) - Protein Structure, 405(2):442--451, 1975.
    [26]
    D. Montana. Strongly typed genetic programming. Evolutionary computation, 3(2):199--230, 1995.
    [27]
    S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 269--278, 2002.
    [28]
    S. Tejada, C. Knoblock, and S. Minton. Learning object identification rules for information integration. Information Systems, 26(8):607--633, 2001.
    [29]
    S. Tejada, C. A. Knoblock, and S. Minton. Learning domain-independent string transformation weights for high accuracy object identification. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 350--359, 2002.
    [30]
    Z. Wang, X. Zhang, L. Hou, Y. Zhao, J. Li, Y. Qi, and J. Tang. RiMOM results for OAEI 2010. In 5th International Workshop on Ontology Matching, pages 194--201, 2010.
    [31]
    W. E. Winkler. Matching and record linkage. In Business Survey Methods, pages 355--384, 1995.
    [32]
    W. E. Winkler. Methods for record linkage and bayesian networks. Technical report, Series RRS2002/05, U. S. Bureau of the Census, 2002.

    Cited By

    View all
    • (2023)Matching Roles from Temporal Data: Why Joe Biden is not only President, but also Commander-in-ChiefProceedings of the ACM on Management of Data10.1145/35889191:1(1-26)Online publication date: 30-May-2023
    • (2023)Geospatial Data ScienceundefinedOnline publication date: 9-Jun-2023
    • (2021)High-Value Token-Blocking: Efficient Blocking Method for Record LinkageACM Transactions on Knowledge Discovery from Data10.1145/345052716:2(1-17)Online publication date: 21-Jul-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 11
    July 2012
    608 pages

    Publisher

    VLDB Endowment

    Publication History

    Published: 01 July 2012
    Published in PVLDB Volume 5, Issue 11

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1
    • Downloads (Last 6 weeks)0

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Matching Roles from Temporal Data: Why Joe Biden is not only President, but also Commander-in-ChiefProceedings of the ACM on Management of Data10.1145/35889191:1(1-26)Online publication date: 30-May-2023
    • (2023)Geospatial Data ScienceundefinedOnline publication date: 9-Jun-2023
    • (2021)High-Value Token-Blocking: Efficient Blocking Method for Record LinkageACM Transactions on Knowledge Discovery from Data10.1145/345052716:2(1-17)Online publication date: 21-Jul-2021
    • (2020)Learning expressive linkage rules from sparse dataSemantic Web10.3233/SW-19035611:3(549-567)Online publication date: 1-Jan-2020
    • (2020)A benchmarking study of embedding-based entity alignment for knowledge graphsProceedings of the VLDB Endowment10.14778/3407790.340782813:12(2326-2340)Online publication date: 14-Sep-2020
    • (2020)Introducing Context and Context-awareness in Data IntegrationProceedings of the 22nd International Conference on Information Integration and Web-based Applications & Services10.1145/3428757.3429116(178-183)Online publication date: 30-Nov-2020
    • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
    • (2020)Domain- and Structure-Agnostic End-to-End Entity Resolution with JedAIACM SIGMOD Record10.1145/3385658.338566448:4(30-36)Online publication date: 25-Feb-2020
    • (2020)Selecting suitable configurations for automated link discoveryProceedings of the 35th Annual ACM Symposium on Applied Computing10.1145/3341105.3373882(907-914)Online publication date: 30-Mar-2020
    • (2019)Robust Active Learning of Expressive Linkage RulesProceedings of the 9th International Conference on Web Intelligence, Mining and Semantics10.1145/3326467.3326484(1-7)Online publication date: 26-Jun-2019
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media