Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3219104.3229288acmotherconferencesArticle/Chapter ViewAbstractPublication PagespearcConference Proceedingsconference-collections
research-article

Building Big Data Processing and Visualization Pipeline through Apache Zeppelin

Published: 22 July 2018 Publication History

Abstract

Big data analytics pipeline becomes popular for large volume data processing, Apache Zeppelin provides an integrated environment for data ingestion, data discovery, data analytics and data visualization and collaboration with an extended framework which allows different programming languages and data processing back ends to be plugged in. The supported languages include Scala, Python, SQL, and Shell script as well as big data processing back ends including Hadoop, Spark and Hive. With the necessary tool sets, an interactive and dynamic data analysis can be done on the fly with heterogeneous programming interfaces. Although Zeppelin is great for code development and interactive analysis with small scale data set for proof-of-concept or use-case presentations, running the data processing pipeline in the batch mode is still needed for performance, robustness to fit in an automated workflow in some cases. We are developing a tool to convert Zeppelin notebook into a workflow with a set of codes that can run in a batch mode through command line interface without requiring running Zeppelin, so that the prototype code can be seamlessly deployed on the production cluster after demo stage. The entire workflow can be preserved, configured manually and run automatically. Zeppelin also provides a flexible way to integrate the visualization functionality, another contribution of this paper is to extend the Zeppelin's existing built-in visualization component for D3Network. With two added features described above, Zeppelin can help users to develop big data pipeline and visualizing graph data quickly and efficiently.

References

[1]
D3JS. 2018. Data-Driven Documents. https://d3js.org/
[2]
Wenjie XU Dunlu PENG, Lidong CAO. 2010. Using JSON for Data Exchanging in Web Service Applications. Journal of Computational Information Systems 7(16), 10--10 (2010), 5883--5890.
[3]
Jupyter. 2018. Jupyter. http://jupyter.org/
[4]
Markdown. 2018. Daring Fireball: Markdown. https://daringfireball.net/projects/markdown
[5]
Julian McAuley and Jure Leskovec. 2012. Learning to Discover Social Circles in Ego Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1 (NIPS'12). Curran Associates Inc., USA, 539--547. http://dl.acm.org/citation.cfm?id=2999134.2999195
[6]
Scala. 2018. The Scala Programming Language. https://www.scala-lang.org/
[7]
Amit Sharma, Mevlana Gemici, and Dan Cosley. 2013. Friends, Strangers, and the Value of Ego Networks for Recommendation. CoRR abs/1304.4837 (2013). arXiv:1304.4837 http://arxiv.org/abs/1304.4837
[8]
SNAP. 2018. Stanford Large Network Dataset Collection. https://snap.Stanford.edu/data/
[9]
Apache Spark. 2018. Apache Spark: Lightning-fast unified analytics engine. https://spark.apache.org/
[10]
Ellen Spertus, Mehran Sahami, and Orkut Buyukkokten. 2005. Evaluating Similarity Measures: A Large-scale Study in the Orkut Social Network. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD '05). ACM, New York, NY, USA, 678--684.
[11]
Xiwang Yang, Harald Steck, and Yong Liu. 2012. Circle-based Recommendation in Online Social Networks. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '12). ACM, New York, NY, USA, 1267--1275.
[12]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. HotCloud 10, 10--10 (2010), 95.
[13]
Apache Zeppelin. 2017. Apache Zeppelin. https://zeppelin.apache.org

Cited By

View all
  • (2025)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460127(102460)Online publication date: Jan-2025
  • (2023)An Introduction to Data VisualisationHandbook of Research on AI and Knowledge Engineering for Real-Time Business Intelligence10.4018/978-1-6684-6519-6.ch003(34-53)Online publication date: 7-Apr-2023
  • (2023)An Electronic Commerce Big Data Analytics Architecture and PlatformApplied Sciences10.3390/app13191096213:19(10962)Online publication date: 4-Oct-2023
  • Show More Cited By

Index Terms

  1. Building Big Data Processing and Visualization Pipeline through Apache Zeppelin

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing: Seamless Creativity
      July 2018
      652 pages
      ISBN:9781450364461
      DOI:10.1145/3219104
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 22 July 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Batch processing
      2. Big Data
      3. Scala
      4. Spark
      5. Visualization

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      PEARC '18

      Acceptance Rates

      PEARC '18 Paper Acceptance Rate 79 of 123 submissions, 64%;
      Overall Acceptance Rate 133 of 202 submissions, 66%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)28
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 01 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Data Lakehouse: A survey and experimental studyInformation Systems10.1016/j.is.2024.102460127(102460)Online publication date: Jan-2025
      • (2023)An Introduction to Data VisualisationHandbook of Research on AI and Knowledge Engineering for Real-Time Business Intelligence10.4018/978-1-6684-6519-6.ch003(34-53)Online publication date: 7-Apr-2023
      • (2023)An Electronic Commerce Big Data Analytics Architecture and PlatformApplied Sciences10.3390/app13191096213:19(10962)Online publication date: 4-Oct-2023
      • (2022)Data Streams Management: Multidimensional Summary with Big Data Tools2022 5th International Conference on Computing and Big Data (ICCBD)10.1109/ICCBD56965.2022.10080310(50-55)Online publication date: 16-Dec-2022
      • (2021)Efficient IoT Data Management for Geological Disasters Based on Big Data-Turbocharged Data Lake ArchitectureISPRS International Journal of Geo-Information10.3390/ijgi1011074310:11(743)Online publication date: 2-Nov-2021
      • (2021)Building Architecture of Intelligent Control System for Urban Rail Transit SystemWorld of Transport and Transportation10.30932/1992-3252-2021-19-1-18-4619:1(18-46)Online publication date: 8-Sep-2021
      • (2021)FPGA acceleration in EVOLVE’s Converged Cloud-HPC Infrastructure2021 31st International Conference on Field-Programmable Logic and Applications (FPL)10.1109/FPL53798.2021.00072(376-377)Online publication date: Aug-2021
      • (2020)Using Containers to Create More Interactive Online Training and Education MaterialsPractice and Experience in Advanced Research Computing 2020: Catch the Wave10.1145/3311790.3396641(246-251)Online publication date: 26-Jul-2020
      • (2020)Big Data Pipeline with ML-Based and Crowd Sourced Dynamically Created and Maintained Columnar Data Warehouse for Structured and Unstructured Big Data2020 3rd International Conference on Information and Computer Technologies (ICICT)10.1109/ICICT50521.2020.00018(60-67)Online publication date: Mar-2020
      • (2020)Research on big data mining and fault prediction based on elevator life cycle2020 International Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE)10.1109/ICBASE51474.2020.00030(103-107)Online publication date: Oct-2020
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media