Abstract
The Extract-Transform-Load (ETL) flows are essential for the success of a data warehouse and the business intelligence and decision support mechanisms that are attached to it. During both the ETL design phase and the entire ETL lifecycle, the ETL architect needs to design and improve an ETL design in a way that satisfies both performance and correctness guarantees and often, she has to choose among various alternative designs. In this paper, we focus on ways to predict the maintenance effort of ETL workflows and we explore techniques for assessing the quality of ETL designs under the prism of evolution. We focus on a set of graph-theoretic metrics for the prediction of evolution impact and we investigate their fit into real-world ETL scenarios. We present our experimental findings and describe the lessons we learned working on real-world cases.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Allen EB (2002) Measuring graph abstractions of software: an information-theory approach. In: Proceedings of the 8th international symposium on software metrics (METRICS’02)
Bebel B, Królikowski, Z, Wrembel R (2006) Managing evolution of data warehouses by means of nested transactions (ADVIS’06)
Bellahsene Z (2002) Schema evolution in data warehouses. Knowl Inf Syst 4(2): 283–304
Berenguer G, et al (2005) A set of quality indica-tors and their corresponding metrics for conceptual models of data warehouses. In: 7th International conference on data warehousing and knowledge discovery (DaWaK’05)
Blaschka M, Sapia C, Höfling G (1999) On schema evolution in multidimensional databases. In: 1st International conference on data warehousing and knowledge discovery (DaWaK’99)
Briand LC, Morasca S, Basili VR (1996) Property-based software engineering measurement. IEEE Trans Softw Eng 22(1):68–85
Calero C, Piattini M, Genero M (2001) Empirical validation of referential integrity metrics. Inf Softw Technol 43(15): 949–957
Calero C, Piattini M, Pascual C, Serrano M (2001) Towards data warehouse quality metrics. In: Proceedings of the 3rd international workshop on design and management of data warehouses (DMDW’01)
Cleve A, Brogneaux A, Hainaut J (2010) A conceptual approach to database applications evolution. In: Proceedings of the 29th international conference on conceptual modeling (ER’10)
Fan H, Poulovassilis A (2004) Schema evolution in data warehousing environments—a schema transformation-based approach. In: Proceedings of the 23rd international conference on conceptual modeling (ER’04)
Favre C, Bentayeb F, Boussaid O (2007) Evolution of data warehouses’ optimization: a workload perspective. In: 9th International conference on data warehousing and knowledge discovery (DaWaK’07)
Fenton NE, Pfleeger SL (1998) Software metrics: a rigorous and practical approach, revised 2nd edn. PWS Publishing Co.
Genero M, Piattini M, Calero C, Serrano M (2000) Measures to get better quality databases. In: Proceedings of the 2nd international conference on enterprise information systems (ICEIS’00)
Golfarelli M, Lechtenbörger J, Rizzi S, Vossen G (2006) Schema versioning in datawarehouses: enabling cross-version querying via schema augmentation. Data Knowl Eng 59(2): 435–459
Golfarelli M, Rizzi S (2009) A survey on temporal data warehousing. In: Database technologies: concepts, methodologies, tools, and applications, pp 221–237
Gray R, Carey B, McGlynn N, Pengelly A (1991) Design metrics for database systems. BT Technol J 9(4): 69–79
Gupta A, Mumick IS, Rao J, Ross KA (2001) Adapting materialized views after redefinitions: techniques and a performance study. Inf Syst 26(5): 323–362
Harrison W (1992) An entropy-based measure of software complexity. IEEE Trans Softw Eng 18(11): 1025–1034
Inmon WH (2000) The data warehouse budget. White paper
Jarke M, Jeusfeld MA, Quix C, Vassiliadis P (1999) Architecture and quality in data warehouses: an extended repository approach. Inf Syst 24(3): 229–253
Kesh S (1995) Evaluating the quality of entity relationshipmodels. Inf Softw Technol 37(12): 681–689
Kim K, Shin Y, Wu C (1995) Complexity measures for object-oriented program based on the entropy. In: Proceedings of the 2nd Asia-Pacific software engineering conference (APSEC ’95)
Levene M, Loizou G (2003) Why is the snowflake schema a good data warehouse design?. Inf Syst 28(3): 225–240
Lorenz M, Kidd J (1994) Object-oriented software metrics. Prentice Hall, Englewood Cliffs
Moody DL (1998) Metrics for evaluating the quality of entity relationship models. In: Proceedings of the 17th international conference on conceptual modeling (ER’98)
Nica A, Lee AJ, Rundensteiner EA (1998) The CSV algorithm for view synchronization in evolvable large-scale information systems. In: Proceedings of the 6th international conference on extending database technology (EDBT’98)
Ordonez C, García-García J (2008) Referential integrity quality metrics. Decis Support Syst 44(2):495–508
Papastefanatos G, Vassiliadis P, Simitsis A, Vassiliou Y (2008) Design metrics for data warehouse evolution. In: Proceedings of the 27th international conference on conceptual modeling (ER’08)
Papastefanatos G, et al (2008) Language extensions for the automation of database schema evolution. In: Proceedings of the 14th international conference on enterprise information systems (ICEIS’08)
Papastefanatos G, Vassiliadis P, Simitsis A, Vassiliou Y (2009) Policy-regulated management of ETL evolution. J Data Semantics 13: 147–177
Papastefanatos G, Vassiliadis P, Simitsis A, Vassiliou Y (2010) HECATAEUS. Regulating schema evolution. In: Proceedings of the 26th IEEE international conference on data engineering (ICDE’10)
Papoulis A (1990) Probability & statistics. Prentice Hall, Englewood Cliffs
Piattini M, Genero M, Calero C (2001) Table oriented metrics for relational databases. Softw Quality J 9(2): 79–97
Pressman RS, Ince D (2000) Software engineering (a practitioner’s approach), 5th edn. European Adaptation. McGraw Hill
Simitsis A, Vassiliadis P, Dayal U, Karagiannis A, Tziovara V (2009) Benchmarking ETL workflows. In: Proceedings of the TPC technology conference (TPCTC’09)
Simitsis A, Wilkinson K, Castellanos M, Dayal U (2009) QoX-driven ETL design: reducing the cost of ETL consulting engagements. In: Proceedings of the 35th SIGMOD international conference on management of data (SIGMOD’09)
Simitsis A, Wilkinson K, Dayal U, Castellanos M (2010) Optimizing ETL workflows for fault-tolerance. In: Proceedings of the 26th IEEE international conference on data engineering (ICDE’10)
Vassiliadis P, Bouzeghoub M, Quix C (2000) Towards quality-oriented data warehouse usage and evolution. Inf Syst 25(2): 89–115
Vassiliadis P, Simitsis A, Terrovitis M, Skiadopoulos S (2005) Blueprints and measures for ETL workflows. In: Proceedings of 24th international conference on conceptual modeling (ER 2005), 24–28 Oct 2005, Klagenfurt, Austria
Vassiliadis P (2009) A survey of extract–transform–load technology. Int J Data Warehousing Mining 5(3): 1–27
Wedemeijer L (2000) Defining metrics for conceptual schema evolution. In: Proceedings of the 9th international workshop on foundations of models and languages for data and objects (FMLDO’00)
Wrembel R (2009) A survey of managing the evolution of data warehouses. Int J Data Warehousing Mining 5(2): 24–56
Wrembel R, Morzy T (2006) Managing and querying versions of multiversion data warehouse (EDBT’06)
Author information
Authors and Affiliations
Corresponding author
Additional information
Work conducted in the context of the “EICOS: foundations for personalized Cooperative Information Ecosystems” project of the “Thales” Programme. The only source of funding for this research comes from the European Social Fund (ESF)-European Union (EU) and National Resources of the Greek State under the Operational Programme “Education and Lifelong Learning (EdLL)”.
Rights and permissions
About this article
Cite this article
Papastefanatos, G., Vassiliadis, P., Simitsis, A. et al. Metrics for the Prediction of Evolution Impact in ETL Ecosystems: A Case Study. J Data Semant 1, 75–97 (2012). https://doi.org/10.1007/s13740-012-0006-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13740-012-0006-9