Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Domain- and Structure-Agnostic End-to-End Entity Resolution with JedAI

Published: 25 February 2020 Publication History

Abstract

We present JedAI, a new open-source toolkit for endto- end Entity Resolution. JedAI is domain-agnostic in the sense that it does not depend on background expert knowledge, applying seamlessly to data of any domain with minimal human intervention. JedAI is also structure-agnostic, as it can process any type of data, ranging from structured (relational) to semi-structured (RDF) and un-structured (free-text) entity descriptions. JedAI consists of two parts: (i) JedAI-core is a library of numerous state-of-the-art methods that can be mixed and matched to form (thousands of) end-to-end workflows, allowing for easily benchmarking their relative performance. (ii) JedAI-gui is a user-friendly desktop application that facilitates the composition of complex workflows via a wizard-like interface. It is suitable for both lay and power users, offering concrete guidelines and automatic configuration, as well as manual configuration options, visual exploration, and detailed statistics for each method's performance. In this paper, we also delve into the new features of JedAI's latest version (2.1), and demonstrate its performance experimentally.

References

[1]
J. Bergstra and Y. Bengio. Random search for hyperparameter optimization. JMLR, 13:281--305, 2012.
[2]
G. D. Bianco, M. A. Gon¸calves, and D. Duarte. BLOSS: effective meta-blocking with almost no effort. Inf. Syst., 75:75--89, 2018.
[3]
P. Christen. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng., 24(9):1537--1555, 2012.
[4]
V. Christophides, V. Efthymiou, and K. Stefanidis. Entity Resolution in the Web of Data. Morgan & Claypool Publishers, 2015.
[5]
S. Das, P. S. G. C., A. Doan, J. F. Naughton, G. Krishnan, R. Deep, E. Arcaute, V. Raghavendra, and Y. Park. Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In SIGMOD, pages 1431--1446, 2017.
[6]
X. L. Dong and D. Srivastava. Big Data Integration. Morgan & Claypool Publishers, 2015.
[7]
D. Firmani, B. Saha, and D. Srivastava. Online entity resolution using an oracle. PVLDB, 9(5):384--395, 2016.
[8]
J. Fisher, P. Christen, Q.Wang, and E. Rahm. A clusteringbased framework to control block sizes for entity resolution. In KDD, pages 279--288, 2015.
[9]
S. Galhotra, D. Firmani, B. Saha, and D. Srivastava. Robust entity resolution using random graphs. In SIGMOD, pages 3--18, 2018.
[10]
G. Giannakopoulos, P. Mavridi, G. Paliouras, G. Papadakis, and K. Tserpes. Representation models for text classification: a comparative analysis over three web document types. In WIMS, pages 13:1--13:12, 2012.
[11]
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. W. Shavlik, and X. Zhu. Corleone: hands-off crowdsourcing for entity matching. In SIGMOD, pages 601--612, 2014.
[12]
B. Golshan, A. Y. Halevy, G. A. Mihaila, and W. Tan. Data integration: After the teenage years. In ACM PODS, pages 101--106, 2017.
[13]
O. Hassanzadeh, F. Chiang, R. J. Miller, and H. C. Lee. Framework for evaluating clustering algorithms in duplicate detection. PVLDB, 2(1):1282--1293, 2009.
[14]
R. Isele and C. Bizer. Learning expressive linkage rules using genetic programming. PVLDB, 5(11):1638--1649, 2012.
[15]
P. Konda, S. Das, P. S. G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016.
[16]
P. Konda and S. D. et. al. Technical perspective: : Toward building entity matching management systems. SIGMOD Record, 47(1):33--40, 2018.
[17]
H. K¨opcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. PVLDB, 3(1):484--493, 2010.
[18]
S. Lacoste-Julien, K. Palla, A. Davies, G. Kasneci, T. Graepel, and Z. Ghahramani. Sigma: simple greedy matching for aligning large knowledge bases. In KDD, 2013.
[19]
S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep, E. Arcaute, and V. Raghavendra. Deep learning for entity matching: A design space exploration. In SIGMOD, pages 19--34, 2018.
[20]
M. Nentwig, M. Hartung, A. Ngomo, and E. Rahm. A survey of current link discovery frameworks. Semantic Web, 8(3):419--436, 2017.
[21]
B. On, N. Koudas, D. Lee, and D. Srivastava. Group linkage. In ICDE, pages 496--505, 2007.
[22]
G. Papadakis, G. Alexiou, G. Papastefanatos, and G. Koutrika. Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB, 9(4):312--323, 2015.
[23]
G. Papadakis, E. Ioannou, T. Palpanas, C. Nieder´ee, and W. Nejdl. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng., 25(12):2665--2682, 2013.
[24]
G. Papadakis and W. Nejdl. Efficient entity resolution methods for heterogeneous information spaces. In ICDE Workshops, pages 304--307, 2011.
[25]
G. Papadakis, G. Papastefanatos, T. Palpanas, and M. Koubarakis. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In EDBT, pages 221--232, 2016.
[26]
G. Papadakis, J. Svirsky, A. Gal, and T. Palpanas. Comparative analysis of approximate blocking techniques for entity resolution. PVLDB, 9(9):684--695, 2016.
[27]
G. Papadakis, L. Tsekouras, E. Thanos, G. Giannakopoulos, T. Palpanas, and M. Koubarakis. Jedai: The force behind entity resolution. In ESWC, pages 161--166, 2017.
[28]
G. Papadakis, L. Tsekouras, E. Thanos, G. Giannakopoulos, T. Palpanas, and M. Koubarakis. The return of jedai: End-to-end entity resolution for structured and semistructured data. PVLDB, 11(12):1950--1953, 2018.
[29]
G. Simonini, S. Bergamaschi, and H. V. Jagadish. BLAST: a loosely schema-aware meta-blocking approach for entity resolution. PVLDB, 9(12):1173--1184, 2016.

Cited By

View all
  • (2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 1-Sep-2024
  • (2023)On evaluating text similarity measures for customer data deduplicationProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3578724(297-300)Online publication date: 27-Mar-2023
  • (2023)On Tuning the Sorted Neighborhood Method for Record Comparisons in a Data Deduplication PipelineDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_11(164-178)Online publication date: 28-Aug-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 48, Issue 4
December 2019
52 pages
ISSN:0163-5808
DOI:10.1145/3385658
Issue’s Table of Contents
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 February 2020
Published in SIGMOD Volume 48, Issue 4

Check for updates

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)1
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Open benchmark for filtering techniques in entity resolutionThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00868-733:5(1671-1696)Online publication date: 1-Sep-2024
  • (2023)On evaluating text similarity measures for customer data deduplicationProceedings of the 38th ACM/SIGAPP Symposium on Applied Computing10.1145/3555776.3578724(297-300)Online publication date: 27-Mar-2023
  • (2023)On Tuning the Sorted Neighborhood Method for Record Comparisons in a Data Deduplication PipelineDatabase and Expert Systems Applications10.1007/978-3-031-39847-6_11(164-178)Online publication date: 28-Aug-2023
  • (2023)Geospatial Data ScienceundefinedOnline publication date: 9-Jun-2023
  • (2022)Data Integration, Cleaning, and Deduplication: Research Versus Industrial ProjectsInformation Integration and Web Intelligence10.1007/978-3-031-21047-1_1(3-17)Online publication date: 28-Nov-2022
  • (2020)BLAST2Journal of Data and Information Quality10.1145/339495712:4(1-22)Online publication date: 10-Nov-2020
  • (2020)Profiling Entity Matching Benchmark TasksProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412781(3101-3108)Online publication date: 19-Oct-2020

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media