research-article

Building Big Data Processing and Visualization Pipeline through Apache Zeppelin

Authors:

Duen Horng ChauAuthors Info & Claims

PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity

Article No.: 57, Pages 1 - 7

https://doi.org/10.1145/3219104.3229288

Published: 22 July 2018 Publication History

Get Access

Abstract

Big data analytics pipeline becomes popular for large volume data processing, Apache Zeppelin provides an integrated environment for data ingestion, data discovery, data analytics and data visualization and collaboration with an extended framework which allows different programming languages and data processing back ends to be plugged in. The supported languages include Scala, Python, SQL, and Shell script as well as big data processing back ends including Hadoop, Spark and Hive. With the necessary tool sets, an interactive and dynamic data analysis can be done on the fly with heterogeneous programming interfaces. Although Zeppelin is great for code development and interactive analysis with small scale data set for proof-of-concept or use-case presentations, running the data processing pipeline in the batch mode is still needed for performance, robustness to fit in an automated workflow in some cases. We are developing a tool to convert Zeppelin notebook into a workflow with a set of codes that can run in a batch mode through command line interface without requiring running Zeppelin, so that the prototype code can be seamlessly deployed on the production cluster after demo stage. The entire workflow can be preserved, configured manually and run automatically. Zeppelin also provides a flexible way to integrate the visualization functionality, another contribution of this paper is to extend the Zeppelin's existing built-in visualization component for D3Network. With two added features described above, Zeppelin can help users to develop big data pipeline and visualizing graph data quickly and efficiently.

References

[1]

D3JS. 2018. Data-Driven Documents. https://d3js.org/

Google Scholar

[2]

Wenjie XU Dunlu PENG, Lidong CAO. 2010. Using JSON for Data Exchanging in Web Service Applications. Journal of Computational Information Systems 7(16), 10--10 (2010), 5883--5890.

Google Scholar

[3]

Jupyter. 2018. Jupyter. http://jupyter.org/

Google Scholar

[4]

Markdown. 2018. Daring Fireball: Markdown. https://daringfireball.net/projects/markdown

Google Scholar

[5]

Julian McAuley and Jure Leskovec. 2012. Learning to Discover Social Circles in Ego Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'12). Curran Associates Inc., USA, 539--547. http://dl.acm.org/citation.cfm?id=2999134.2999195

Digital Library

Google Scholar

[6]

Scala. 2018. The Scala Programming Language. https://www.scala-lang.org/

Google Scholar

[7]

Amit Sharma, Mevlana Gemici, and Dan Cosley. 2013. Friends, Strangers, and the Value of Ego Networks for Recommendation. CoRR abs/1304.4837 (2013). arXiv:1304.4837 http://arxiv.org/abs/1304.4837

Google Scholar

[8]

SNAP. 2018. Stanford Large Network Dataset Collection. https://snap.Stanford.edu/data/

Google Scholar

[9]

Apache Spark. 2018. Apache Spark: Lightning-fast unified analytics engine. https://spark.apache.org/

Google Scholar

[10]

Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten. 2005. Evaluating Similarity Measures: A Large-scale Study in the Orkut Social Network. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD '05). ACM, New York, NY, USA, 678--684.

Digital Library

Google Scholar

[11]

Xiwang Yang, Harald Steck, and Yong Liu. 2012. Circle-based Recommendation in Online Social Networks. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '12). ACM, New York, NY, USA, 1267--1275.

Digital Library

Google Scholar

[12]

Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10, 10--10 (2010), 95.

Digital Library

Google Scholar

[13]

Apache Zeppelin. 2017. Apache Zeppelin. https://zeppelin.apache.org

Google Scholar

Cited By

View all

Harby AZulkernine F(2025)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460127(102460)Online publication date: Jan-2025
https://doi.org/10.1016/j.is.2024.102460
Pise ASingampalli D(2023)An Introduction to Data VisualisationHandbook of Research on AI and Knowledge Engineering for Real-Time Business Intelligence10.4018/978-1-6684-6519-6.ch003(34-53)Online publication date: 7-Apr-2023
https://doi.org/10.4018/978-1-6684-6519-6.ch003
Munshi AAlhindi AQadah TAlqurashi A(2023)An Electronic Commerce Big Data Analytics Architecture and PlatformApplied Sciences10.3390/app13191096213:19(10962)Online publication date: 4-Oct-2023
https://doi.org/10.3390/app131910962
Show More Cited By

Index Terms

Building Big Data Processing and Visualization Pipeline through Apache Zeppelin
1. Human-centered computing
2. Information systems

Recommendations

Challenges and opportunities with big data visualization
MEDES '15: Proceedings of the 7th International Conference on Management of computational and collective intElligence in Digital EcoSystems

In this big data era, huge amount data are continuously acquired for a variety of purposes. Advanced computing, imaging, and sensing technologies enable scientists to study natural and physical phenomena at unprecedented precision, resulting in an ...
A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing
ICDS 2015: Proceedings of the Second International Conference on Data Science - Volume 9208

With the fast development of remote sensing techniques, the volume of acquired data grows exponentially. This brings a big challenge to process massive remote sensing data. In the paper, an in-memory computing framework is proposed to address this ...
Educational data mining with Python and Apache spark: a hands-on tutorial
LAK '16: Proceedings of the Sixth International Conference on Learning Analytics & Knowledge

Enormous amount of educational data has been accumulated through Massive Open Online Courses (MOOCs), as well as commercial and non-commercial learning platforms. This is in addition to the educational data released by US government since 2012 to ...

Comments

Information & Contributors

Information

Published In

PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity

July 2018

652 pages

ISBN:9781450364461

DOI:10.1145/3219104

General Chair:
Sergiu Sanielevici
Pittsburgh Supercomputing Center

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

PEARC '18

PEARC '18: Practice and Experience in Advanced Research Computing

July 22 - 26, 2018

PA, Pittsburgh, USA

Acceptance Rates

PEARC '18 Paper Acceptance Rate 79 of 123 submissions, 64%;

Overall Acceptance Rate 133 of 202 submissions, 66%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
341
Total Downloads

Downloads (Last 12 months)28
Downloads (Last 6 weeks)3

Reflects downloads up to 01 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Harby AZulkernine F(2025)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460127(102460)Online publication date: Jan-2025
https://doi.org/10.1016/j.is.2024.102460
Pise ASingampalli D(2023)An Introduction to Data VisualisationHandbook of Research on AI and Knowledge Engineering for Real-Time Business Intelligence10.4018/978-1-6684-6519-6.ch003(34-53)Online publication date: 7-Apr-2023
https://doi.org/10.4018/978-1-6684-6519-6.ch003
Munshi AAlhindi AQadah TAlqurashi A(2023)An Electronic Commerce Big Data Analytics Architecture and PlatformApplied Sciences10.3390/app13191096213:19(10962)Online publication date: 4-Oct-2023
https://doi.org/10.3390/app131910962
Sarr JBame NBoly A(2022)Data Streams Management: Multidimensional Summary with Big Data Tools2022 5th International Conference on Computing and Big Data (ICCBD)10.1109/ICCBD56965.2022.10080310(50-55)Online publication date: 16-Dec-2022
https://doi.org/10.1109/ICCBD56965.2022.10080310
Huang XFan JDeng ZYan JLi JWang L(2021)Efficient IoT Data Management for Geological Disasters Based on Big Data-Turbocharged Data Lake ArchitectureISPRS International Journal of Geo-Information10.3390/ijgi1011074310:11(743)Online publication date: 2-Nov-2021
https://doi.org/10.3390/ijgi10110743
Alexeev VBaranov LKulagin MSidorenko V(2021)Building Architecture of Intelligent Control System for Urban Rail Transit SystemWorld of Transport and Transportation10.30932/1992-3252-2021-19-1-18-4619:1(18-46)Online publication date: 8-Sep-2021
https://doi.org/10.30932/1992-3252-2021-19-1-18-46
Koliogeorgi KKeddous FMasouros DChazapis AAubrun MXydis SBilas AHugues RAcquaviva JNguyen HSoudris D(2021)FPGA acceleration in EVOLVE’s Converged Cloud-HPC Infrastructure2021 31st International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL53798.2021.00072(376-377)Online publication date: Aug-2021
https://doi.org/10.1109/FPL53798.2021.00072
Mehringer SBarker B(2020)Using Containers to Create More Interactive Online Training and Education MaterialsPractice and Experience in Advanced Research Computing 2020: Catch the Wave10.1145/3311790.3396641(246-251)Online publication date: 26-Jul-2020
https://dl.acm.org/doi/10.1145/3311790.3396641
Ghane K(2020)Big Data Pipeline with ML-Based and Crowd Sourced Dynamically Created and Maintained Columnar Data Warehouse for Structured and Unstructured Big Data2020 3rd International Conference on Information and Computer Technologies (ICICT)10.1109/ICICT50521.2020.00018(60-67)Online publication date: Mar-2020
https://doi.org/10.1109/ICICT50521.2020.00018
Wang CFeng S(2020)Research on big data mining and fault prediction based on elevator life cycle2020 International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE)10.1109/ICBASE51474.2020.00030(103-107)Online publication date: Oct-2020
https://doi.org/10.1109/ICBASE51474.2020.00030
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Challenges and opportunities with big data visualization

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing

Educational data mining with Python and Apache spark: a hands-on tutorial

Comments

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Challenges and opportunities with big data visualization

A Spark-Based Big Data Platform for Massive Remote Sensing Data Processing

Educational data mining with Python and Apache spark: a hands-on tutorial

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Get Access

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations