Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

A Graduate-Level Course on Entity Resolution and Information Quality: A Step toward ER Education

Published: 01 March 2013 Publication History
  • Get Citation Alerts
  • Abstract

    This article discusses the topics, approaches, and lessons learned in teaching a graduate-level course covering entity resolution (ER) and its relationship to information quality (IQ). The course surveys a broad spectrum of ER topics and activities including entity reference extraction, entity reference preparation, entity reference resolution techniques, entity identity management, and entity relationship analysis. The course content also attempts to balance aspects of ER theory with practical application through a series of laboratory exercises coordinated with the lecture topics. As an additional teaching aid, a configurable, open-source entity resolution engine (OYSTER) was developed that allows students to experience with different types of ER architectures including merge-purge, record linking, identity resolution, and identity capture.

    References

    [1]
    Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T. E., Menestrina, D., Su, Q., Thavisomboon, S., and Widom, J. 2006. Generic entity resolution in the SERF Project. Tech. rep., dbpubs.standford.edu/pub/2006-8, Stanford InfoLab.
    [2]
    Chief Information Officer 2008. Maritime domain awareness architecture management hub strategy. Department of the Navy publication. www.doncio.navy.mil/Download.aspx?AttachID=710 (downloaded July 24, 2010).
    [3]
    Christen, P. 2007. Towards parameter-free blocking for scalable record linkage. Tech. rep. TR-CS-07-03, Computer Sciences Laboratory, The Australian National University.
    [4]
    eHealth. 2013. Health information exchange (HIE). http://www.ehealthinitiative.org/issues/health-information-exchange-hie.html (last accessed 2/13).
    [5]
    Fellegi, I. P. and Sunter, A. B. 1969. A theory for record linkage. J. Amer. Stat. Assoc. 64, 328, 1183--1210.
    [6]
    Herzog, T. N., Scheuren, F. J., and Winkler, W. E. 2007. Data Quality and Record Linkage Techniques. Springer, New York.
    [7]
    Holland, G. 2010. Knowledge-driven identity resolution for longitudinal education data. Doctoral dissertation, Department of Information Science, University of Arkansas at Little Rock.
    [8]
    Holland, G. and Talburt, J. 2009. An entity-based integration framework for modeling and evaluating data enhancement products. J. Comput. Sci. Colleges 24, 5, 65--73.
    [9]
    Homeland Security. 2009. State and local fusion centers. http://www.dhs.gov/files/programs/gc_1156877184684.shtm (last accessed 7/10).
    [10]
    IAIDQ. 2010. Certification for the information quality professional. International Association for Information and Data Quality. http://www.iaidq.org/main/doc/iaidq_certification_fact_sheet_v3.0_web.pdf (last accessed 7/10).
    [11]
    Lee, Y., Pierce, E., Talburt, J., Wang, R., and Zhu, H. 2007. A curriculum for a master of science in information quality. J. Inf. Syst. Educ. 18, 2.
    [12]
    Nelson, E. and Talburt, J. 2008. Improving the quality of law enforcement information through entity resolution. In Proceedings of the Conference on Applied Research in Information Technology, C. Hu and D. Berleant Eds., University of Central Arkansas, Conway, AR, 113--118. http://research.acxiom.com/publications.html.
    [13]
    Robinson, L. 2010. Correcting the misconceptions about the nature of data that thwart information quality. In Proceedings of the 4th MIT Information Quality Industry Symposium, July 14--16, 529--548. http://mitiq.mit.edu/IQIS/2010/.
    [14]
    Talburt, J. 2011. Entity Resolution and Information Quality, Morgan Kaufmann, Burlington, MA.
    [15]
    Talburt, J. and Hashemi, R. 2008. A formal framework for defining entity-based, data source integration. In Proceedings of the International Conference on Information and Knowledge Engineering, H. Arabnia and R. Hashemi Eds., (CSREA Press, July 14--17), Las Vegas, NV, 394--398.
    [16]
    Talburt, J., Wang, R., Hess, K., and Kuo, E. 2007. An algebraic approach to data quality metrics for entity resolution over large datasets, In Information Quality Management: Theory and Applications, L. Al-Hakim Ed., Idea Group Publishing, Hershey, PA. 1--22.
    [17]
    Talburt, J., Zhou, Y., and Shivaiah, S. 2009. SOG: A synthetic occupancy generator to support entity resolution instruction and research. In Proceedings of the International Conference on Information Quality, Potsdam, Germany, November 7--8, 91--105.
    [18]
    Talley, T., Talburt, J., and Chan, Y. 2010. Introduction. In Data Engineering: Mining, Information and Intelligence, Y. Chan, J. Talburt, and T. Talley Eds., Springer, 1--16.
    [19]
    Whang, S. E., Menestrina, D., Koutrika, G., Theobald, M., and Garcia-Molina, H. 2009. Entity resolution with iterative blocking. In Proceeding of SIGMOD’09 Conference. ACM Press.
    [20]
    Winkler, W. E. 1989. Near automatic weight computation in the Fellegi-Sunter Model of record linkage. In Proceedings of the 5th Census Bureau Annual Research Conference. 145--155.
    [21]
    Zhou, Y. and Talburt, J. 2011. Staging a realistic entity resolution challenge for students. J. Comput. Sci. Colleges 26, 5, 88--95.

    Cited By

    View all

    Index Terms

    1. A Graduate-Level Course on Entity Resolution and Information Quality: A Step toward ER Education

      Recommendations

      Reviews

      Fjodor J. Ruzic

      Anybody interested in entity resolution--the process of determining whether two references to an entity refer to the same object or different objects--should read this valuable paper. The authors assume that entity resolution and information quality converge, and that this convergence should be studied in graduate-level courses. The authors state that solving information quality problems is a prerequisite for properly preparing the reference sources for entity resolution decisions. This approach contributes to the quality of an information product created from several information sources. Further, they describe a core course subject dealing with the entity identity, information quality, and record linkage models and techniques. Readers are also presented with the authors' notions on practical exercises that focus on the course material and the use of various tools, such as the OYSTER (Open sYSTem Entity Resolution) system, an open-source software development project that can be used as a teaching tool to give students experience performing entity resolution. The concluding remarks state that students benefit from a series of exercises that help them understand how improvements in the quality of information sources can improve overall results in the entity resolution process. I found the paper interesting and recommend it to students and professionals working in any sector or field where information integration is a core issue. Online Computing Reviews Service

      Access critical reviews of Computing literature here

      Become a reviewer for Computing Reviews.

      Comments

      Information & Contributors

      Information

      Published In

      cover image Journal of Data and Information Quality
      Journal of Data and Information Quality  Volume 4, Issue 2
      Special Issue on Entity Resolution
      March 2013
      88 pages
      ISSN:1936-1955
      EISSN:1936-1963
      DOI:10.1145/2435221
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 March 2013
      Accepted: 01 June 2012
      Revised: 01 May 2012
      Received: 01 January 2011
      Published in JDIQ Volume 4, Issue 2

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Entity resolution
      2. corporate house-holding
      3. data quality
      4. graduate-level ER course
      5. information quality
      6. measurement
      7. record linkage

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      • SAS DataFlux®
      • Infoglide Software®
      • Arkansas Department of Education

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)9
      • Downloads (Last 6 weeks)2

      Other Metrics

      Citations

      Cited By

      View all

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media