Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3448016.3457549acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

ExDRa: Exploratory Data Science on Federated Raw Data

Published: 18 June 2021 Publication History

Abstract

Data science workflows are largely exploratory, dealing with under-specified objectives, open-ended problems, and unknown business value. Therefore, little investment is made in systematic acquisition, integration, and pre-processing of data. This lack of infrastructure results in redundant manual effort and computation. Furthermore, central data consolidation is not always technically or economically desirable or even feasible (e.g., due to privacy, and/or data ownership). The ExDRa system aims to provide system infrastructure for this exploratory data science process on federated and heterogeneous, raw data sources. Technical focus areas include (1) ad-hoc and federated data integration on raw data, (2) data organization and reuse of intermediates, and (3) optimization of the data science lifecycle, under awareness of partially accessible data. In this paper, we describe use cases, the overall system architecture, selected features of SystemDS' new federated backend (for federated linear algebra programs, federated parameter servers, and federated data preparation), as well as promising initial results. Beyond existing work on federated learning, ExDRa focuses on enterprise federated ML and related data pre-processing challenges. In this context, federated ML has the potential to create a more fine-grained spectrum of data ownership and thus, even new markets.

Supplementary Material

Read me (3448016.3457549_readme.pdf)
Source Code (3448016.3457549_source_code.zip)
MP4 File (3448016.3457549.mp4)
Data science workflows are largely exploratory, dealing with under-specified objectives, open-ended problems, and unknown business value. Therefore, little investment is made in systematic acquisition, integration, and pre-processing of data. This lack of infrastructure results in redundant manual effort and computation. Furthermore, central data consolidation is not always technically or economically desirable or even feasible (e.g., due to privacy, and/or data ownership). The ExDRa system aims to provide system infrastructure for this exploratory data science process on federated and heterogeneous, raw data sources. Technical focus areas include (1) ad-hoc and federated data integration on raw data, (2) data organization and reuse of intermediates, and (3) optimization of the data science lifecycle, under awareness of partially accessible data. In this paper, we describe use cases, the system architecture, selected features of SystemDS' new federated backend, and promising results. Beyond existing work on federated learning, ExDRa focuses on enterprise federated ML and related data pre-processing challenges because, in this context, federated ML has the potential to create a more fine-grained spectrum of data ownership and thus, new markets.

References

[1]
Mart'i n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In OSDI. 265--283.
[2]
Abbas Acar, Hidayet Aksu, A. Selcuk Uluagac, and Mauro Conti. 2018. A Survey on Homomorphic Encryption Schemes: Theory and Implementation . ACM Comput. Surv., Vol. 51, 4 (2018), 79:1--79:35.
[3]
Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos, and Anastasia Ailamaki. 2012. NoDB: Efficient Query Execution on Raw Data Files. In SIGMOD. 241--252.
[4]
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinl"a nder, Matthias J. Sax, Sebastian Schelter, Mareike Hö ger, Kostas Tzoumas, and Daniel Warneke. 2014. The Stratosphere platform for big data analytics. VLDB J., Vol. 23, 6 (2014), 939--964.
[5]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark . In SIGMOD. 1383--1394.
[6]
Yannis Bakos, Hanna Halaburda, and Christoph Mü ller-Bloch. 2021. When permissioned blockchains deliver more decentralization than permissionless. Commun. ACM, Vol. 64, 2 (2021), 20--22.
[7]
Denis Baylor, Eric Breck, Heng-Tze Cheng, Noah Fiedel, Chuan Yu Foo, Zakaria Haque, Salem Haykal, Mustafa Ispir, Vihan Jain, Levent Koc, Chiu Yuen Koo, Lukasz Lew, Clemens Mewald, Akshay Naresh Modi, Neoklis Polyzotis, Sukriti Ramesh, Sudip Roy, Steven Euijong Whang, Martin Wicke, Jarek Wilkiewicz, Xin Zhang, and Martin Zinkevich. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In SIGKDD. 1387--1395.
[8]
Matthias Boehm, Iulian Antonov, Sebastian Baunsgaard, Mark Dokter, Robert Ginthö r, Kevin Innerebner, Florijan Klezin, Stefanie N. Lindstaedt, Arnab Phani, Benjamin Rath, Berthold Reinwald, Shafaq Siddiqui, and Sebastian Benjamin Wrede. 2020. SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle. In CIDR.
[9]
Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloé Kiddon, Jakub Konecný, Stefano Mazzocchi, Brendan McMahan, Timon Van Overveldt, David Petrou, Daniel Ramage, and Jason Roselander. 2019. Towards Federated Learning at Scale: System Design. In MLSys .
[10]
Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2017. Practical Secure Aggregation for Privacy-Preserving Machine Learning. In CCS. 1175--1191.
[11]
Brendan Burns, Brian Grant, David Oppenheimer, Eric A. Brewer, and John Wilkes. 2016. Borg, Omega, and Kubernetes . Commun. ACM, Vol. 59, 5 (2016), 50--57.
[12]
Sebastian Caldas, Peter Wu, Tian Li, Jakub Konecný, H. Brendan McMahan, Virginia Smith, and Ameet Talwalkar. 2018. LEAF: A Benchmark for Federated Settings. CoRR, Vol. abs/1812.01097 (2018).
[13]
José Cambronero, John K. Feser, Micah J. Smith, and Samuel Madden. 2017. Query Optimization for Dynamic Imputation . PVLDB, Vol. 10, 11 (2017), 1310--1321.
[14]
Andrew Chen, Andy Chow, Aaron Davidson, Arjun DCunha, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Clemens Mewald, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Avesh Singh, Fen Xie, Matei Zaharia, Richard Zang, Juntai Zheng, and Corey Zumar. 2020. Developments in MLflow: A System to Accelerate the Machine Learning Lifecycle. In DEEM@SIGMOD . 5:1--5:4.
[15]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems . CoRR, Vol. abs/1512.01274 (2015).
[16]
Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. 2009. MAD Skills: New Analysis Practices for Big Data. PVLDB, Vol. 2, 2 (2009), 1481--1492.
[17]
Richard Craib, Geoffrey Bradway, Xander Dunnwith, and Joey Krug. 2017. Numeraire: A Cryptographic Token for Coordinating Machine Intelligence and Preventing Overfitting . https://numer.ai/
[18]
Piali Das, Nikita Ivkin, Tanya Bansal, Laurence Rouesnel, Philip Gautier, Zohar S. Karnin, Leo Dirac, Lakshmi Ramakrishnan, Andre Perunicic, Iaroslav Shcherbatyi, Wilton Wu, Aida Zolic, Huibin Shen, Amr Ahmed, Fela Winkelmolen, Miroslav Miladinovic, Cé dric Archembeau, Alex Tang, Bhaskar Dutt, Patricia Grao, and Kumar Venkateswar. 2020. Amazon SageMaker Autopilot: a white box AutoML solution at scale. In DEEM@SIGMOD . 2:1--2:7.
[19]
Roshan Dathathri, Olli Saarikivi, Hao Chen, Kim Laine, Kristin E. Lauter, Saeed Maleki, Madanlal Musuvathi, and Todd Mytkowicz. 2019. CHET: an optimizing compiler for fully-homomorphic neural-network inferencing. In PLDI. 142--156.
[20]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc'Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large Scale Distributed Deep Networks. In NeurIPS. 1232--1240.
[21]
Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: Amazon's Highly Available Key-value Store. In SOSP. 205--220.
[22]
Dong Deng, Raul Castro Fernandez, Ziawasch Abedjan, Sibo Wang, Michael Stonebraker, Ahmed K. Elmagarmid, Ihab F. Ilyas, Samuel Madden, Mourad Ouzzani, and Nan Tang. 2017. The Data Civilizer System. In CIDR.
[23]
Behrouz Derakhshan, Alireza Rezaei Mahdiraji, Ziawasch Abedjan, Tilmann Rabl, and Volker Markl. 2020. Optimizing Machine Learning Workloads in Collaborative Environments. In SIGMOD. 1701--1716.
[24]
Joseph Vinish D'silva, Florestan De Moor, and Bettina Kemme. 2018. AIDA - Abstraction for Advanced In-Database Analytics . PVLDB, Vol. 11, 11 (2018), 1400--1413.
[25]
Ahmed Elgohary, Matthias Boehm, Peter J. Haas, Frederick R. Reiss, and Berthold Reinwald. 2018. Compressed linear algebra for large-scale machine learning. VLDB J., Vol. 27, 5 (2018), 719--744.
[26]
Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, and Frank Hutter. 2019. Auto-sklearn: Efficient and Robust Automated Machine Learning . In Automated Machine Learning . 113--134.
[27]
Food and Agriculture Organization of the United Nations. 2017. World fertilizer trends and outlook to 2020, Summary Report. http://www.fao.org/3/a-i6895e.pdf.
[28]
Michael Fruhwirth, Michael Rachinger, and Emina Prlja. 2020. Discovering Business Models of Data Marketplaces. In HICSS. 1--10.
[29]
Craig Gentry. 2009. Fully homomorphic encryption using ideal lattices. In STOC. 169--178.
[30]
Ran Gilad-Bachrach, Nathan Dowlin, Kim Laine, Kristin E. Lauter, Michael Naehrig, and John Wernsing. 2016. CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy. In ICML, Vol. 48. 201--210.
[31]
Google. 2020. TensorFlow Federated: Machine Learning on Decentralized Data . https://www.tensorflow.org/federated
[32]
Philipp M. Grulich, Sebastian Breß, Steffen Zeuch, Jonas Traub, Janis von Bleichert, Zongxiong Chen, Tilmann Rabl, and Volker Markl. 2020. Grizzly: Efficient Stream Processing Through Adaptive Query Compilation. In SIGMOD. 2487--2503.
[33]
Suyash Gupta, Jelle Hellings, Sajjad Rahnama, and Mohammad Sadoghi. 2020. Building High Throughput Permissioned Blockchain Fabrics: Challenges and Opportunities . PVLDB, Vol. 13, 12 (2020), 3441--3444.
[34]
Alireza Heidari, Joshua McGrath, Ihab F. Ilyas, and Theodoros Rekatsinas. 2019. HoloDetect: Few-Shot Learning for Error Detection. In SIGMOD. 829--846.
[35]
Stratos Idreos, Ioannis Alagiannis, Ryan Johnson, and Anastasia Ailamaki. 2011. Here are my Data Files. Here are my Queries. Where are my Results?. In CIDR. 57--68.
[36]
Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2007. Database Cracking. In CIDR. 68--78.
[37]
Stratos Idreos, Stefan Manegold, and Goetz Graefe. 2012. Adaptive indexing in modern database kernels. In EDBT. 566--569.
[38]
Milena Ivanova, Martin L. Kersten, Niels J. Nes, and Romulo Goncalves. 2010. An architecture for recycling intermediates in a column-store . ACM Trans. Database Syst., Vol. 35, 4 (2010), 24:1--24:43.
[39]
Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J. Gao. 2019. Declarative Recursive Computation on an RDBMS . PVLDB, Vol. 12, 7 (2019), 822--835.
[40]
Zhanglong Ji, Zachary Chase Lipton, and Charles Elkan. 2014. Differential Privacy and Machine Learning: a Survey and Review. CoRR, Vol. abs/1412.7584 (2014).
[41]
Jiawei Jiang, Bin Cui, Ce Zhang, and Lele Yu. 2017. Heterogeneity-aware Distributed Parameter Servers. In SIGMOD. 463--478.
[42]
Michael Jordan. 2018. SysML: Perspectives and Challenges . https://www.youtube.com/watch?v=4inIBmY8dQI MLSys Keynote.
[43]
Vanja Josifovski, Peter M. Schwarz, Laura M. Haas, and Eileen Tien Lin. 2002. Garlic: a new flavor of federated query processing for DB2. In SIGMOD. 524--532.
[44]
Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. 2018. GAZELLE: A Low Latency Framework for Secure Neural Network Inference. In USENIX Security Symposium. 1651--1669.
[45]
Peter Kairouz, Brendan McMahan, and Virginia Smith. 2020. Federated Learning Tutorial. In NeurIPS. https://slideslive.com/38935813/federated-learning-tutorial
[46]
Bojan Karlas, Peng Li, Renzhi Wu, Nezihe, Merve Guerel, Xu Chu, Wentao Wu, and Ce Zhang. 2021. Nearest Neighbor Classifiers over Incomplete Information: From Certain Answers to Certain Predictions . PVLDB (2021).
[47]
Manos Karpathiotakis, Ioannis Alagiannis, and Anastasia Ailamaki. 2016. Fast Queries Over Heterogeneous Data Through Engine Customization . PVLDB, Vol. 9, 12 (2016), 972--983.
[48]
Lars Kotthoff, Chris Thornton, Holger H. Hoos, Frank Hutter, and Kevin Leyton-Brown. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA . J. Mach. Learn. Res., Vol. 18 (2017), 25:1--25:5.
[49]
Ahmed Koubaa and Zoltan Koran. 1995. Measure of the internal bond strength of paper/board . Tappi Journal, Vol. 78 (1995).
[50]
Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, and Eugene Wu. 2017. BoostClean: Automated Error Detection and Repair for Machine Learning . CoRR, Vol. abs/1711.01299 (2017).
[51]
Sanjay Krishnan and Eugene Wu. 2019. AlphaClean: Automatic Generation of Data Cleaning Pipelines . CoRR, Vol. abs/1904.11827 (2019).
[52]
Fengan Li, Lingjiao Chen, Yijing Zeng, Arun Kumar, Xi Wu, Jeffrey F. Naughton, and Jignesh M. Patel. 2019. Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent. In SIGMOD. 1517--1534.
[53]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. In OSDI. 583--598.
[54]
Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. 2020 a. Federated Learning: Challenges, Methods, and Future Directions . IEEE Signal Process. Mag., Vol. 37, 3 (2020), 50--60.
[55]
Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. 2020 b. Federated Optimization in Heterogeneous Networks. In MLSys.
[56]
Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent. In NeurIPS. 5330--5340.
[57]
Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous Decentralized Parallel Stochastic Gradient Descent. In ICML. 3049--3058.
[58]
Yujie Lin, Pengjie Ren, Zhumin Chen, Zhaochun Ren, Dongxiao Yu, Jun Ma, Maarten de Rijke, and Xiuzhen Cheng. 2020. Meta Matrix Factorization for Federated Rating Predictions. In SIGIR. 981--990.
[59]
Shangyu Luo, Zekai J. Gao, Michael N. Gubanov, Luis Leopoldo Perez, and Christopher M. Jermaine. 2017. Scalable Linear Algebra on a Relational Database System. In ICDE. 523--534.
[60]
Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Agü era y Arcas. 2017. Communication-Efficient Learning of Deep Networks from Decentralized Data. In AISTATS. 1273--1282.
[61]
H. Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agü era y Arcas. 2016. Federated Learning of Deep Networks using Model Averaging . CoRR, Vol. abs/1602.05629 (2016).
[62]
C. Mohan. 2019. State of Public and Private Blockchains: Myths and Reality. In SIGMOD. 404--411.
[63]
Payman Mohassel and Yupeng Zhang. 2017. SecureML: A System for Scalable Privacy-Preserving Machine Learning. In IEEE Symposium on Security and Privacy. 19--38.
[64]
Piero Molino, Yaroslav Dudin, and Sai Sumanth Miryala. 2019. Ludwig: a type-based declarative deep learning toolbox. CoRR, Vol. abs/1909.07930 (2019).
[65]
Peter Mü llner, Dominik Kowald, and Elisabeth Lex. 2021. Robustness of Meta Matrix Factorization Against Strict Privacy Constraints . CoRR, Vol. abs/2101.06927 (2021).
[66]
Milos Nikolic, Mohammed Elseidy, and Christoph Koch. 2014. LINVIEW: incremental view maintenance for complex analytical queries. In SIGMOD. 253--264.
[67]
Randal S. Olson and Jason H. Moore. 2019. TPOT: A Tree-Based Pipeline Optimization Tool for Automating Machine Learning . In Automated Machine Learning . 151--160.
[68]
Fabian Pedregosa, Gaë l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python . J. Mach. Learn. Res., Vol. 12 (2011), 2825--2830.
[69]
Arnab Phani, Benjamin Rath, and Matthias Boehm. 2021. LIMA: Fine-grained Lineage Tracing and Reuse in Machine Learning Systems. In SIGMOD.
[70]
Christopher Ré et almbox. 2020. Overton: A Data System for Monitoring and Improving Machine-Learned Products. In CIDR.
[71]
Sashank J. Reddi, Zachary Charles, Manzil Zaheer, Zachary Garrett, Keith Rush, Jakub Konecný, Sanjiv Kumar, and H. Brendan McMahan. 2020. Adaptive Federated Optimization. CoRR, Vol. abs/2003.00295 (2020).
[72]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference . PVLDB, Vol. 10, 11 (2017), 1190--1201.
[73]
Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In SCIPY.
[74]
Felix Sattler, Klaus-Robert Mü ller, and Wojciech Samek. 2019. Clustered Federated Learning: Model-Agnostic Distributed Multi-Task Optimization under Privacy Constraints. CoRR, Vol. abs/1910.01991 (2019).
[75]
Sebastian Schelter. 2020. "Amnesia" - Machine Learning Models That Can Forget User Data Very Fast. In CIDR.
[76]
D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francc ois Crespo, and Dan Dennison. 2015. Hidden Technical Debt in Machine Learning Systems. In NeurIPS. 2503--2511.
[77]
Zeyuan Shang, Emanuel Zgraggen, Benedetto Buratti, Ferdinand Kossmann, Philipp Eichmann, Yeounoh Chung, Carsten Binnig, Eli Upfal, and Tim Kraska. 2019. Democratizing Data Science through Interactive Curation of ML Pipelines. In SIGMOD. 1171--1188.
[78]
Jaeho Shin, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. 2015. Incremental Knowledge Base Construction Using DeepDive . PVLDB, Vol. 8, 11 (2015), 1310--1321.
[79]
Alexander J. Smola and Shravan M. Narayanamurthy. 2010. An Architecture for Parallel Topic Models . PVLDB, Vol. 3, 1 (2010), 703--710.
[80]
Evan R. Sparks, Ameet Talwalkar, Daniel Haas, Michael J. Franklin, Michael I. Jordan, and Tim Kraska. 2015. Automating model search for large scale machine learning. In SoCC. 368--380.
[81]
Evan R. Sparks, Shivaram Venkataraman, Tomer Kaftan, Michael J. Franklin, and Benjamin Recht. 2017. KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics. In ICDE. 535--546.
[82]
Hanlin Tang, Xiangru Lian, Ming Yan, Ce Zhang, and Ji Liu. 2018. D(^mbox2 ): Decentralized Training over Decentralized Data. In ICML (Proceedings of Machine Learning Research), Vol. 80. 4855--4863.
[83]
Chris Thornton, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. 2013. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In SIGKDD. 847--855.
[84]
Yuanyuan Tian, Fatma Ö zcan, Tao Zou, Romulo Goncalves, and Hamid Pirahesh. 2016. Building a Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Warehouse . ACM Trans. Database Syst., Vol. 41, 4 (2016), 21:1--21:38.
[85]
Tiffany Tuor, Shiqiang Wang, Bong Jun Ko, Changchang Liu, and Kin K. Leung. 2020. Overcoming Noisy and Irrelevant Data in Federated Learning . CoRR (2020).
[86]
Stef van Buuren and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in R . Journal of Statistical Software, Articles, Vol. 45, 3 (2011), 1--67.
[87]
Manasi Vartak, Joana M. F. da Trindade, Samuel Madden, and Matei Zaharia. 2018. MISTIQUE: A System to Store and Query Model Intermediates for Model Diagnosis. In SIGMOD. 1285--1300.
[88]
Manasi Vartak and Samuel Madden. 2018. MODELDB: Opportunities and Challenges in Managing Machine Learning Models . IEEE Data Eng. Bull., Vol. 41, 4 (2018), 16--25.
[89]
Ashish Vulimiri, Carlo Curino, Brighten Godfrey, Konstantinos Karanasos, and George Varghese. 2015. WANalytics: Analytics for a Geo-Distributed Data-Intensive World. In CIDR.
[90]
Pijika Watcharapichat, Victoria Lopez Morales, Raul Castro Fernandez, and Peter R. Pietzuch. 2016. Ako: Decentralised Deep Learning with Partial Gradient Exchange. In SoCC. 84--97.
[91]
Wes McKinney. 2010. Data Structures for Statistical Computing in Python. In SciPy. 56--61.
[92]
Yuncheng Wu, Shaofeng Cai, Xiaokui Xiao, Gang Chen, and Beng Chin Ooi. 2020. Privacy Preserving Vertical Federated Learning for Tree-based Models . PVLDB, Vol. 13, 11 (2020), 2090--2103.
[93]
Doris Xin, Litian Ma, Jialin Liu, Stephen Macke, Shuchen Song, and Aditya G. Parameswaran. 2018a. Helix: Accelerating Human-in-the-loop Machine Learning . PVLDB, Vol. 11, 12 (2018), 1958--1961.
[94]
Doris Xin, Stephen Macke, Litian Ma, Jialin Liu, Shuchen Song, and Aditya G. Parameswaran. 2018b. Helix: Holistic Optimization for Accelerating Iterative Machine Learning . PVLDB, Vol. 12, 4 (2018), 446--460.
[95]
Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated Machine Learning: Concept and Applications. ACM Trans. Intell. Syst. Technol., Vol. 10, 2 (2019), 12:1--12:19.
[96]
Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, Fen Xie, and Corey Zumar. 2018. Accelerating the Machine Learning Lifecycle with MLflow . IEEE Data Eng. Bull., Vol. 41, 4 (2018), 39--45.
[97]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI. 15--28.
[98]
Steffen Zeuch, Ankit Chaudhary, Bonaventura Del Monte, Haralampos Gavriilidis, Dimitrios Giouroukis, Philipp M. Grulich, Sebastian Breß, Jonas Traub, and Volker Markl. 2020. The NebulaStream Platform for Data and Application Management in the Internet of Things. In CIDR.
[99]
Ce Zhang, Arun Kumar, and Christopher Ré. 2014. Materialization Optimizations for Feature Selection Workloads. In SIGMOD. 265--276.
[100]
Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang. 2017. How Good Are Machine Learning Clouds for Binary Classification with Good Features? CoRR, Vol. abs/1707.09562 (2017).
[101]
Jingren Zhou, Per-Åke Larson, and Hicham G. Elmongui. 2007. Lazy Maintenance of Materialized Views. In VLDB. 231--242.

Cited By

View all
  • (2024)Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the UglyProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663331(39-50)Online publication date: 9-Jun-2024
  • (2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
  • (2023)FEAST: A Communication-efficient Federated Feature Selection Framework for Relational DataProceedings of the ACM on Management of Data10.1145/35889611:1(1-28)Online publication date: 30-May-2023
  • Show More Cited By

Index Terms

  1. ExDRa: Exploratory Data Science on Federated Raw Data

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      SIGMOD '21: Proceedings of the 2021 International Conference on Management of Data
      June 2021
      2969 pages
      ISBN:9781450383431
      DOI:10.1145/3448016
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 June 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Badges

      Author Tags

      1. data science
      2. federated learning
      3. ml pipelines
      4. raw data

      Qualifiers

      • Research-article

      Funding Sources

      • German Federal Ministry for Economic Affairs and Energy
      • Austrian Federal Ministry for Climate Action Environment Energy Mobility Innovation and Technology

      Conference

      SIGMOD/PODS '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 785 of 4,003 submissions, 20%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)291
      • Downloads (Last 6 weeks)37
      Reflects downloads up to 11 Sep 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Federated Fine-Tuning of LLMs on the Very Edge: The Good, the Bad, the UglyProceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning10.1145/3650203.3663331(39-50)Online publication date: 9-Jun-2024
      • (2023)GIO: Generating Efficient Matrix and Frame Readers for Custom Data Formats by ExampleProceedings of the ACM on Management of Data10.1145/35892651:2(1-26)Online publication date: 20-Jun-2023
      • (2023)FEAST: A Communication-efficient Federated Feature Selection Framework for Relational DataProceedings of the ACM on Management of Data10.1145/35889611:1(1-28)Online publication date: 30-May-2023
      • (2023)AWARE: Workload-aware, Redundancy-exploiting Linear AlgebraProceedings of the ACM on Management of Data10.1145/35886821:1(1-28)Online publication date: 30-May-2023
      • (2022)VF-PSProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3600422(2088-2101)Online publication date: 28-Nov-2022
      • (2022)UPLIFTProceedings of the VLDB Endowment10.14778/3551793.355184215:11(2929-2938)Online publication date: 1-Jul-2022
      • (2022)BETZE: Benchmarking Data Exploration Tools with (Almost) Zero Effort2022 IEEE 38th International Conference on Data Engineering (ICDE)10.1109/ICDE53745.2022.00224(2385-2398)Online publication date: May-2022
      • (2022)Chunk-oriented dimension ordering for efficient range query processing on sparse multidimensional dataWorld Wide Web10.1007/s11280-022-01098-z26:4(1395-1433)Online publication date: 9-Sep-2022

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Get Access

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media