Abstract
Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager extract, transform, and load (ETL) process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. A lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way, hybrid ETL supports incremental integration and loading of metadata and data from the data sources. We incorporate a human-in-the-loop approach, to enhance the hybrid ETL, with selective data integration driven by the user queries and sharing of integrated data between users. We implement our hybrid ETL approach in a prototype platform, Óbidos, and evaluate it in the context of data sharing for medical research. Óbidos outperforms both the eager ETL and lazy ETL approaches, for scientific research data integration and sharing, through its selective loading of data and metadata, while storing the integrated data in a scalable integrated data repository.
Similar content being viewed by others
Notes
Óbidos is a medieval fortified town that has been patronized by various Portuguese queens. It is known for its sweet wine, served in a chocolate cup.
References
Ahern, T., Casey, R., Barnes, D., Benson, R., Knight, T.: SEED Standard for the Exchange of Earthquake Data Reference Manual Format Version 2.4. Incorporated Research Institutions for Seismology (IRIS), Seattle (2007)
Antonioletti, M., Atkinson, M., Baxter, R., Borley, A., Chue Hong, N.P., Collins, B., Hardman, N., Hume, A.C., Knox, A., Jackson, M.: The design and implementation of Grid database services in OGSA-DAI. Concurr. Comput. Pract. Exp. 17(2–4), 357–376 (2005)
Ardestani, S.B., Håkansson, C.J., Laure, E., Livenson, I., Stranák, P., Dima, E., Blommesteijn, D., van de Sanden, M.: B2SHARE: an open e-Science data sharing platform. In: 2015 IEEE 11th International Conference on e-Science (e-Science), pp. 448–453. IEEE (2015)
Borckholder, C., Heinzel, A., Kaniovskyi, Y., Benkner, S., Lukas, A., Mayer, B.: A generic, service-based data integration framework applied to linking drugs and clinical trials. Procedia Comput. Sci. 23, 24–35 (2013)
caMicroscope: caMicroscope (2018). http://camicroscope.org
Çaparlar, C.Ö., Dönmez, A.: What is scientific research and how can it be done? Turk. J. Anaesthesiol. Reanim. 44(4), 212 (2016)
Chaudhuri, S., Dayal, U.: An overview of data warehousing and OLAP technology. ACM SIGMOD Rec. 26(1), 65–74 (1997)
Clark, K., Vendt, B., Smith, K., Freymann, J., Kirby, J., Koppel, P., Moore, S., Phillips, S., Maffitt, D., Pringle, M.: The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J. Digit. Imaging 26(6), 1045–1057 (2013)
Dong, X.L., Srivastava, D.: Big data integration. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 1245–1248. IEEE (2013)
Gradecki, J.D., Cole, J.: Mastering Apache Velocity. Wiley (2003)
Hausenblas, M., Nadeau, J.: Apache Drill: interactive ad-hoc analysis at scale. Big Data 1(2), 100–104 (2013)
Heinzlreiter, P., Perkins, J.R., Tirado, O.T., Karlsson, T.J.M., Ranea, J.A., Mitterecker, A., Blanca, M., Trelles, O.: A cloud-based GWAS analysis pipeline for clinical researchers. In: CLOSER, pp. 387–394 (2014)
Hey, T., Trefethen, A.E.: Cyberinfrastructure for e-Science. Science 308(5723), 817–821 (2005)
HL7: FHIR (2018). https://www.hl7.org/fhir/
Huang, Z.: Data integration for urban transport planning. Citeseer (2003)
Kadadi, A., Agrawal, R., Nyamful, C., Atiq, R.: Challenges of data integration and interoperability in big data. In: 2014 IEEE International Conference on Big Data (Big Data), pp. 38–40. IEEE (2014)
Kargín, Y., Ivanova, M., Zhang, Y., Manegold, S., Kersten, M.: Lazy ETL in action: ETL technology dates scientific data. Proc. VLDB Endow. 6(12), 1286–1289 (2013)
Kathiravelu, P., Chen, Y., Sharma, A., Galhardas, H., Van Roy, P., Veiga, L.: On-demand service-based big data integration: optimized for research collaboration. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare, pp. 9–28. Springer (2017)
Krishnan, S., Haas, D., Franklin, M.J., Wu, E.: Towards reliable interactive data cleaning: a user survey and recommendations. In: Proceedings of the Workshop on Human-in-the-Loop Data Analytics, p. 9. ACM (2016)
Langegger, A., Wöß, W., Blöchl, M.: A semantic web middleware for virtual data integration on the web. In: European Semantic Web Conference, pp. 493–507. Springer (2008)
Lecarpentier, D., Wittenburg, P., Elbers, W., Michelini, A., Kanso, R., Coveney, P., Baxter, R.: EUDAT: a new cross-disciplinary data infrastructure for science. Int. J. Digit. Curation 8(1), 279–287 (2013)
Lee, G., Doyle, S., Monaco, J., Madabhushi, A., Feldman, M.D., Master, S.R., Tomaszewski, J.E.: A knowledge representation framework for integration, classification of multi-scale imaging and non-imaging data: preliminary results in predicting prostate cancer recurrence by fusing mass spectrometry and histology. In: 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp. 77–80. IEEE (2009)
Li, G.: Human-in-the-loop data integration. Proc. VLDB Endow. 10(12), 2006–2017 (2017)
Lyu, D.M., Tian, Y., Wang, Y., Tong, D.Y., Yin, W.W., Li, J.S.: Design and implementation of clinical data integration and management system based on Hadoop platform. In: 2015 7th International Conference on Information Technology in Medicine and Education (ITME), pp. 76–79. IEEE (2015)
Marchioni, F., Surtani, M.: Infinispan Data Grid Platform. Packt Publishing Ltd., Birmingham (2012)
Milchevski, E., Michel, S.: LigDB—online query processing without (almost) any storage. In: EDBT, pp. 683–688 (2015)
Mildenberger, P., Eichelberg, M., Martin, E.: Introduction to the DICOM standard. Eur. Radiol. 12(4), 920–927 (2002)
Reichman, O.J., Jones, M.B., Schildhauer, M.P.: Challenges and opportunities of open data in ecology. Science 331(6018), 703–705 (2011)
Scality: Scality RING (2018). http://storage.scality.com/rs/963-KAI-434/images/Scality%20Technical%20Whitepaper.pdf
Spark: Spark Framework: An Expressive Web Framework for Kotlin and Java (2018). http://sparkjava.com/
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Vassiliadis, P.: A survey of Extract-transform-Load technology. Int. J. Data Warehous. Min. 5(3), 1–27 (2009)
White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc, Sebastopol (2012)
Widmann, H., Thiemann, H.: EUDAT B2FIND: a cross-discipline metadata service and discovery portal. In: EGU General Assembly Conference Abstracts, vol. 18, p. 8562 (2016)
Zhang, Q., Zhang, X., Zhang, Q., Shi, W., Zhong, H.: Firework: big data sharing and processing in collaborative edge environment. In: 2016 Fourth IEEE Workshop on Hot Topics in Web Systems and Technologies (HotWeb), pp. 20–25. IEEE (2016)
Acknowledgements
This work was supported by NCI U01 [1U01CA187013-01], Resources for development and validation of Radiomic Analyses & Adaptive Therapy, Fred Prior, Ashish Sharma (UAMS, Emory), National funds through Fundação para a Ciência e a Tecnologia with reference UID/CEC/50021/2013, PTDC/EEI-SCR/6945/2014, a Google Summer of Code project, and a PhD grant offered by the Erasmus Mundus Joint Doctorate in Distributed Computing (EMJD-DC) under grant agreement 2012-0030.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kathiravelu, P., Sharma, A., Galhardas, H. et al. On-demand big data integration. Distrib Parallel Databases 37, 273–295 (2019). https://doi.org/10.1007/s10619-018-7248-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-018-7248-y