Abstract
Schema matching is a critical problem in many applications where the main goal is to match attributes coming from heterogeneous sources. In this paper, we propose PROCLAIM (PROfile-based Cluster-Labeling for AttrIbute Matching), an automatic, unsupervised clustering-based approach to match attributes of a large number of heterogeneous sources. We define the concept of attribute profile to characterize the main properties of an attribute using: (i) the statistical distribution and the dimension of the attribute’s values, (ii) the name and textual descriptions related to the attribute. The attribute matchings produced by PROCLAIM give the best representation of heterogeneous sources thanks to the cluster-labeling function we defined. We evaluate PROCLAIM on 45,000 different data sources coming from oil and gas authority open data website (The data is published under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)). The results we obtain are promising and validate our approach.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Data from: https://www.kaggle.com/.
- 2.
The data is published under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).
References
Alwan, A.A., Nordin, A., Alzeber, M., Abualkishik, A.Z.: A survey of schema matching research using database schemas and instances. IJACSA 8(10) (2017)
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: ordering points to identify the clustering structure. In: ACM SIGMOD Record, vol. 28, pp. 49–60. ACM (1999)
Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: WebTables: exploring the power of tables on the web. Proc. VLDB Endow. 1(1), 538–549 (2008)
Cerda, P., Varoquaux, G., Kégl, B.: Similarity encoding for learning with dirty categorical variables. Mach. Learn. 107(8–10), 1477–1494 (2018)
Charu, C.A., Chandan, K.R.: Data Clustering: Algorithms and Applications (2013)
De Sa, C., et al.: DeepDive: declarative knowledge base construction. ACM SIGMOD Rec. 45(1), 60–67 (2016)
Ester, M., Kriegel, H.P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise (1996)
Gubanov, M., Priya, M., Podkorytov, M.: IntelliLIGHT: a flashlight for large-scale dark structured data (2017)
Gupta, R., Halevy, A., Wang, X., Whang, S.E., Wu, F.: Biperpedia: an ontology for search applications. Proc. VLDB Endow. 7(7), 505–516 (2014)
Jiang, S., Liang, J., Xiao, Y., Tang, H., Huang, H., Tan, J.: Towards the completion of a domain-specific knowledge base with emerging query terms. In: 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1430–1441. IEEE (2019)
Kola, A., More, H., Soderman, S., Gubanov, M.: Generating unified famous objects (UFOs) from the classified object tables. In: IEEE Big Data, pp. 4771–4773. IEEE (2017)
NEXLA: An introduction to big data formats understanding Avro, Parquet, and ORC. In: NEXLA White Paper, pp. 1–12 (2018)
Rubenstein, D., Yin, W., Frame, M.D.: Biofluid Mechanics: An Introduction to Fluid Mechanics, Macrocirculation, and Microcirculation. Academic Press, Cambridge (2015)
Vohra, D.: Apache Parquet. Practical Hadoop Ecosystem, pp. 325–335. Apress, Berkeley, CA (2016). https://doi.org/10.1007/978-1-4842-2199-0_8
Winn, J., Guiver, J., Webster, S., Zaykov, Y., Kukla, M., Fabian, D.: Alexandria: unsupervised high-precision knowledge base construction using a probabilistic program. In: AKBC (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Arman, M., Wlodarczyk, S., Bennacer Seghouani, N., Bugiotti, F. (2020). PROCLAIM: An Unsupervised Approach to Discover Domain-Specific Attribute Matchings from Heterogeneous Sources. In: Herbaut, N., La Rosa, M. (eds) Advanced Information Systems Engineering. CAiSE 2020. Lecture Notes in Business Information Processing, vol 386. Springer, Cham. https://doi.org/10.1007/978-3-030-58135-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-030-58135-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58134-3
Online ISBN: 978-3-030-58135-0
eBook Packages: Computer ScienceComputer Science (R0)