MDPI - Publisher of Open Access Journals

30 pages, 618 KiB

Open AccessArticle

Benchmarking Big Data Systems: Performance and Decision-Making Implications in Emerging Technologies

by Leonidas Theodorakopoulos, Aristeidis Karras, Alexandra Theodoropoulou and Georgios Kampiotis

Technologies 2024, 12(11), 217; https://doi.org/10.3390/technologies12110217 - 3 Nov 2024

Cited by 2 | Viewed by 3096

Systems for graph processing are a key enabler for insights from large-scale graphs that are critical to many new advanced technologies such as Artificial Intelligence, Internet of Things, and blockchain. In this study, we benchmark another two widely utilized graph processing systems, Apache [...] Read more.

Systems for graph processing are a key enabler for insights from large-scale graphs that are critical to many new advanced technologies such as Artificial Intelligence, Internet of Things, and blockchain. In this study, we benchmark another two widely utilized graph processing systems, Apache Spark GraphX and Apache Fink, concerning the key performance criterion by means of response time, scalability, and computational complexity. We demonstrate our results which show the capability of each system for real-world graph applications, and hence, providing a quantitative understanding to select the system for our purpose. GraphX’s strength was in processing batch in-memory workloads typical of blockchain and machine learning model optimization, while Flink excelled in processing stream data, which is timely and important to the IoT world. These performance characteristics emphasize how the capabilities of graph processing systems can match the requirements for the performance of different emerging technology applications. Our findings ultimately inform practitioners about system efficiencies and limitations, but also the recent advances in hardware accelerators and algorithmic improvements aimed at shaping the new graph processing frontier in diverse technology domains. Full article

► Show Figures

Figure 1

28 pages, 1843 KiB

Open AccessArticle

Performance Evaluation Analysis of Spark Streaming Backpressure for Data-Intensive Pipelines

by Kassiano J. Matteussi, Julio C. S. dos Anjos, Valderi R. Q. Leithardt and Claudio F. R. Geyer

Sensors 2022, 22(13), 4756; https://doi.org/10.3390/s22134756 - 23 Jun 2022

Cited by 7 | Viewed by 2806

Abstract

A significant rise in the adoption of streaming applications has changed the decision-making processes in the last decade. This movement has led to the emergence of several Big Data technologies for in-memory processing, such as the systems Apache Storm, Spark, Heron, Samza, Flink, [...] Read more.

A significant rise in the adoption of streaming applications has changed the decision-making processes in the last decade. This movement has led to the emergence of several Big Data technologies for in-memory processing, such as the systems Apache Storm, Spark, Heron, Samza, Flink, and others. Spark Streaming, a widespread open-source implementation, processes data-intensive applications that often require large amounts of memory. However, Spark Unified Memory Manager cannot properly manage sudden or intensive data surges and their related in-memory caching needs, resulting in performance and throughput degradation, high latency, a large number of garbage collection operations, out-of-memory issues, and data loss. This work presents a comprehensive performance evaluation of Spark Streaming backpressure to investigate the hypothesis that it could support data-intensive pipelines under specific pressure requirements. The results reveal that backpressure is suitable only for small and medium pipelines for stateless and stateful applications. Furthermore, it points out the Spark Streaming limitations that lead to in-memory-based issues for data-intensive pipelines and stateful applications. In addition, the work indicates potential solutions. Full article

(This article belongs to the Special Issue Performance, Simulation and Modelling of Sensors Networks in the Context of IoT, Edge Computing, and AI)

► Show Figures

Figure 1

19 pages, 3842 KiB

Open AccessArticle

The Metamorphosis (of RAM³S)

by Ilaria Bartolini and Marco Patella

Appl. Sci. 2021, 11(24), 11584; https://doi.org/10.3390/app112411584 - 7 Dec 2021

Cited by 2 | Viewed by 1867

Abstract

The real-time analysis of Big Data streams is a terrific resource for transforming data into value. For this, Big Data technologies for smart processing of massive data streams are available, but the facilities they offer are often too raw to be effectively exploited [...] Read more.

The real-time analysis of Big Data streams is a terrific resource for transforming data into value. For this, Big Data technologies for smart processing of massive data streams are available, but the facilities they offer are often too raw to be effectively exploited by analysts. RAM

^{3}

S (Real-time Analysis of Massive MultiMedia Streams) is a framework that acts as a middleware software layer between multimedia stream analysis techniques and Big Data streaming platforms, so as to facilitate the implementation of the former on top of the latter. RAM

^{3}

S has been proven helpful in simplifying the deployment of non-parallel techniques to streaming platforms, such as Apache Storm or Apache Flink. In this paper, we show how RAM

^{3}

S has been updated to incorporate novel stream processing platforms, such as Apache Samza, and to be able to communicate with different message brokers, such as Apache Kafka. Abstracting from the message broker also provides us with the ability to pipeline several RAM

^{3}

S instances that can, therefore, perform different processing tasks. This represents a richer model for stream analysis with respect to the one already available in the original RAM

^{3}

S version. The generality of this new RAM

^{3}

S version is demonstrated through experiments conducted on three different multimedia applications, proving that RAM

^{3}

S is a formidable asset for enabling efficient and effective Data Mining and Machine Learning on multimedia data streams. Full article

(This article belongs to the Special Issue Data Mining and Machine Learning in Multimedia Databases)

► Show Figures

Figure 1

26 pages, 2234 KiB

Open AccessArticle

SPOT: Testing Stream Processing Programs with Symbolic Execution and Stream Synthesizing

by Qian Ye and Minyan Lu

Appl. Sci. 2021, 11(17), 8057; https://doi.org/10.3390/app11178057 - 30 Aug 2021

Cited by 1 | Viewed by 2280

Abstract

Adoption of distributed stream processing (DSP) systems such as Apache Flink in real-time big data processing is increasing. However, DSP programs are prone to be buggy, especially when one programmer neglects some DSP features (e.g., source data reordering), which motivates development of approaches [...] Read more.

Adoption of distributed stream processing (DSP) systems such as Apache Flink in real-time big data processing is increasing. However, DSP programs are prone to be buggy, especially when one programmer neglects some DSP features (e.g., source data reordering), which motivates development of approaches for testing and verification. In this paper, we focus on the test data generation problem for DSP programs. Currently, there is a lack of an approach that generates test data for DSP programs with both high path coverage and covering different stream reordering situations. We present a novel solution, SPOT (i.e., Stream Processing Program Test), to achieve these two goals simultaneously. At first, SPOT generates a set of individual test data representing each path of one DSP program through symbolic execution. Then, SPOT composes these independent data into various time series data (a.k.a, stream) in diverse reordering. Finally, we can perform a test by feeding the DSP program with these streams continuously. To automatically support symbolic analysis, we also developed JPF-Flink, a JPF (i.e., Java Pathfinder) extension to coordinate the execution of Flink programs. We present four case studies to illustrate that: (1) SPOT can support symbolic analysis for the commonly used DSP operators; (2) test data generated by SPOT can more efficiently achieve high JDU (i.e., Joint Dataflow and UDF) path coverage than two recent DSP testing approaches; (3) test data generated by SPOT can more easily trigger software failure when comparing with those two DSP testing approaches; and (4) the data randomly generated by those two test techniques are highly skewed in terms of stream reordering, which is measured by the entropy metric. In comparison, it is even for test data from SPOT. Full article

(This article belongs to the Special Issue Analytics, Privacy and Security for IoT and Big Data)

► Show Figures

Figure 1

24 pages, 1008 KiB

Open AccessArticle

SPARQL2Flink: Evaluation of SPARQL Queries on Apache Flink

by Oscar Ceballos, Carlos Alberto Ramírez Restrepo, María Constanza Pabón, Andres M. Castillo and Oscar Corcho

Appl. Sci. 2021, 11(15), 7033; https://doi.org/10.3390/app11157033 - 30 Jul 2021

Cited by 5 | Viewed by 2936

Abstract

Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based [...] Read more.

Existing SPARQL query engines and triple stores are continuously improved to handle more massive datasets. Several approaches have been developed in this context proposing the storage and querying of RDF data in a distributed fashion, mainly using the MapReduce Programming Model and Hadoop-based ecosystems. New trends in Big Data technologies have also emerged (e.g., Apache Spark, Apache Flink); they use distributed in-memory processing and promise to deliver higher data processing performance. In this paper, we present a formal interpretation of some PACT transformations implemented in the Apache Flink DataSet API. We use this formalization to provide a mapping to translate a SPARQL query to a Flink program. The mapping was implemented in a prototype used to determine the correctness and performance of the solution. The source code of the project is available in Github under the MIT license. Full article

(This article belongs to the Special Issue Big Data Management and Analysis with Distributed or Cloud Computing)

► Show Figures

Figure 1

33 pages, 3235 KiB

Open AccessArticle

s2p: Provenance Research for Stream Processing System

by Qian Ye and Minyan Lu

Appl. Sci. 2021, 11(12), 5523; https://doi.org/10.3390/app11125523 - 15 Jun 2021

Cited by 3 | Viewed by 2410

Abstract

The main purpose of our provenance research for DSP (distributed stream processing) systems is to analyze abnormal results. Provenance for these systems is not nontrivial because of the ephemerality of stream data and instant data processing mode in modern DSP systems. Challenges include [...] Read more.

The main purpose of our provenance research for DSP (distributed stream processing) systems is to analyze abnormal results. Provenance for these systems is not nontrivial because of the ephemerality of stream data and instant data processing mode in modern DSP systems. Challenges include but are not limited to an optimization solution for avoiding excessive runtime overhead, reducing provenance-related data storage, and providing it in an easy-to-use fashion. Without any prior knowledge about which kinds of data may finally lead to the abnormal, we have to track all transformations in detail, which potentially causes hard system burden. This paper proposes s2p (Stream Process Provenance), which mainly consists of online provenance and offline provenance, to provide fine- and coarse-grained provenance in different precision. We base our design of s2p on the fact that, for a mature online DSP system, the abnormal results are rare, and the results that require a detailed analysis are even rarer. We also consider state transition in our provenance explanation. We implement s2p on Apache Flink named as s2p-flink and conduct three experiments to evaluate its scalability, efficiency, and overhead from end-to-end cost, throughput, and space overhead. Our evaluation shows that s2p-flink incurs a 13% to 32% cost overhead, 11% to 24% decline in throughput, and few additional space costs in the online provenance phase. Experiments also demonstrates the s2p-flink can scale well. A case study is presented to demonstrate the feasibility of the whole s2p solution. Full article

(This article belongs to the Collection Big Data Analysis and Visualization Ⅱ)

► Show Figures

Figure 1

36 pages, 2883 KiB

Open AccessArticle

Industry 4.0 towards Forestry 4.0: Fire Detection Use Case

by Radhya Sahal, Saeed H. Alsamhi, John G. Breslin and Muhammad Intizar Ali

Sensors 2021, 21(3), 694; https://doi.org/10.3390/s21030694 - 20 Jan 2021

Cited by 45 | Viewed by 7300

Abstract

Forestry 4.0 is inspired by the Industry 4.0 concept, which plays a vital role in the next industrial generation revolution. It is ushering in a new era for efficient and sustainable forest management. Environmental sustainability and climate change are related challenges to promote [...] Read more.

Forestry 4.0 is inspired by the Industry 4.0 concept, which plays a vital role in the next industrial generation revolution. It is ushering in a new era for efficient and sustainable forest management. Environmental sustainability and climate change are related challenges to promote sustainable forest management of natural resources. Internet of Forest Things (IoFT) is an emerging technology that helps manage forest sustainability and protect forest from hazards via distributing smart devices for gathering data stream during monitoring and detecting fire. Stream processing is a well-known research area, and recently, it has gained a further significance due to the emergence of IoFT devices. Distributed stream processing platforms have emerged, e.g., Apache Flink, Storm, and Spark, etc. Querying windowing is the heart of any stream-processing platform which splits infinite data stream into chunks of finite data to execute a query. Dynamic query window-based processing can reduce the reporting time in case of missing and delayed events caused by data drift.In this paper, we present a novel dynamic mechanism to recommend the optimal window size and type based on the dynamic context of IoFT application. In particular, we designed a dynamic window selector for stream queries considering input stream data characteristics, application workload and resource constraints to recommend the optimal stream query window configuration. A research gap on the likelihood of adopting smart IoFT devices in environmental sustainability indicates a lack of empirical studies to pursue forest sustainability, i.e., sustainable forestry applications. So, we focus on forest fire management and detection as a use case of Forestry 4.0, one of the dynamic environmental management challenges, i.e., climate change, to deliver sustainable forestry goals. According to the dynamic window selector’s experimental results, end-to-end latency time for the reported fire alerts has been reduced by dynamical adaptation of window size with IoFT stream rate changes. Full article

(This article belongs to the Section Internet of Things)

► Show Figures

Figure 1

21 pages, 6376 KiB

Open AccessArticle

SLA-Based Adaptation Schemes in Distributed Stream Processing Engines

by Muhammad Hanif, Eunsam Kim, Sumi Helal and Choonhwa Lee

Appl. Sci. 2019, 9(6), 1045; https://doi.org/10.3390/app9061045 - 13 Mar 2019

Cited by 3 | Viewed by 4220

Abstract

With the upswing in the volume of data, information online, and magnanimous cloud applications, big data analytics becomes mainstream in the research communities in the industry as well as in the scholarly world. This prompted the emergence and development of real-time distributed stream [...] Read more.

With the upswing in the volume of data, information online, and magnanimous cloud applications, big data analytics becomes mainstream in the research communities in the industry as well as in the scholarly world. This prompted the emergence and development of real-time distributed stream processing frameworks, such as Flink, Storm, Spark, and Samza. These frameworks endorse complex queries on streaming data to be distributed across multiple worker nodes in a cluster. Few of these stream processing frameworks provides fundamental support for controlling the latency and throughput of the system as well as the correctness of the results. However, none has the ability to handle them on the fly at runtime. We present a well-informed and efficient adaptive watermarking and dynamic buffering timeout mechanism for the distributed streaming frameworks. It is designed to increase the overall throughput of the system by making the watermarks adaptive towards the stream of incoming workload, and scale the buffering timeout dynamically for each task tracker on the fly while maintaining the Service Level Agreement (SLA)-based end-to-end latency of the system. This work focuses on tuning the parameters of the system (such as window correctness, buffering timeout, and so on) based on the prediction of incoming workloads and assesses whether a given workload will breach an SLA using output metrics including latency, throughput, and correctness of both intermediate and final results. We used Apache Flink as our testbed distributed processing engine for this work. However, the proposed mechanism can be applied to other streaming frameworks as well. Our results on the testbed model indicate that the proposed system outperforms the status quo of stream processing. With the inclusion of learning models like naïve Bayes, multilayer perceptron (MLP), and sequential minimal optimization (SMO)., the system shows more progress in terms of keeping the SLA intact as well as quality of service (QoS). Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Figure 1

Search Results (8)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (8)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI