0% found this document useful (0 votes)

109 views

4 Building Blocks of A Streaming Data Architecture

The document discusses the key components of a streaming data architecture: 1) A message broker like Kafka or Kinesis that ingests streaming data from sources. 2) A stream processor like Storm, Spark Streaming, or WSO2 that processes and analyzes the streaming data. 3) An analytics engine like Redshift, Elasticsearch, or Cassandra that further analyzes the data. 4) Storage options for the streaming data like databases, data warehouses, or the message broker itself.

Uploaded by

Ulises Carreon

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views

4 Building Blocks of A Streaming Data Architecture

Uploaded by

Ulises Carreon

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

4 Building Blocks

of a Streaming
Data Architecture
Whitepaper
Streaming data is becoming a core component of enterprise data architecture.
Streaming technologies are not new, but they have considerably matured over
the past year. The industry is moving from painstaking integration of
technologies like Kafka and Storm, towards full stack solutions that provide an
end-to-end streaming data architecture.

What is Streaming Data Architecture?

A streaming data architecture can ingest and process large volumes of
streaming data from multiple sources. While traditional data solutions focused
on writing and reading data in batches, a streaming data architecture consumes
data immediately as it is generated, persists it to storage, and may perform real-
time processing, data manipulation and analytics.

Why Streaming Data Architecture? Benefits of

Stream Processing
Stream processing is becoming an essential data infrastructure for many
organizations. Typical use cases include clickstream analytics, which allows
companies to track web visitor activities and personalize content; eCommerce
analytics which helps online retailers avoid shopping cart abandonment and
display more relevant offers; and analysis of large volumes of streaming data
from sensors and connected devices in the Internet of Things (IoT). Stream
processing provides several benefits that other data platforms cannot:

2
• Able to deal with never-ending streams of events—some data is naturally
structured this way. Traditional batch processing tools require stopping the
stream of events, capturing batches of data and combining the batches to
draw overall conclusions. In stream processing, while it is challenging to
combine and capture data from multiple streams, it lets you derive
immediate insights from large volumes of streaming data.

• Real-time or near-real-time processing—most organizations adopt stream

processing to enable real time data analytics. While real time analytics is also
possible with high performance database systems, often the data lends itself
to a stream processing model.

• Detecting patterns in time-series data—detecting patterns over time, for

example looking for trends in website traffic data, requires data to be
continuously processed and analyzed. Batch processing makes this more
difficult because it breaks data into batches, meaning some events are
broken across two or more batches.

• Easy data scalability—growing data volumes can break a batch processing

system, requiring you to provision more resources or modify the
architecture. Modern stream processing infrastructure is hyper-scalable, able
to deal with Gigabytes of data per second with a single stream processor.
This allows you to easily deal with growing data volumes without
infrastructure changes.

3
The Components of a Traditional
Streaming Architecture
1. The Message Broker
This is the element that takes data from a source, called a producer, translates it
into a standard message format, and streams it on an ongoing basis. Other
components can then listen in and consume the messages passed on by the
broker.

The first generation of message brokers, such as RabbitMQ and Apache

ActiveMQ, relied on the Message Oriented Middleware (MOM) paradigm. Later,
hyper-performant messaging platforms emerged which are more suitable for a
streaming paradigm. Two popular streaming brokers are Apache Kafka and
Amazon Kinesis Data Streams.

4
Unlike the old MoM brokers, streaming brokers support very high performance
with persistence, have massive capacity of a Gigabyte per second or more of
message traffic, and are tightly focused on streaming with no support for data
transformations or task scheduling. You can learn more about message brokers
in our article on analyzing Apache Kafka data.

2. Stream Processor / Streaming Data Aggregator

The stream processor collects data streams from one or more message brokers.
It receives queries from users, fetches events from message queues and applies
the query, to generate a result. The result may be an API call, an action, a
visualization, an alert, or in some cases a new data stream.

5
A few examples of stream processors are Apache Storm, Spark Streaming and
WSO2 Stream Processor. While stream processors work in different ways, they
are all capable of listening to message streams, processing the data and saving
it to storage. Some stream processors, including Spark and WSO2, provide a SQL
syntax for querying and manipulating the data.

3. Data Analytics Engine

After streaming data is prepared for consumption by the stream processor, it
must be analyzed to provide value. There are many different approaches to
streaming data analytics. Here are some of the tools most commonly used for
streaming data analytics.

Analytics Tool Streaming Use Case Example Setup

Streaming data is saved to
S3. You can setup ad hoc SQL
queries via the AWS
Amazon
Distributed SQL engine Management Console,
Athena
Athena runs them as
serverless functions and
returns results
Amazon Kinesis Streaming
Data Firehose can be used to
save streaming data to
Redshift. This enables near
Amazon Redshift Data Warehouse
real-time analytics with BI
tools and dashboard you
have already integrated with
Redshift

6
Kafka Connect can be used
to stream topics directly
into Elasticsearch. If you use
the Avro data format and a
schema registry,
Elasticsearch Text Search Elasticsearch mappings with
correct datatypes are
created automatically. You
can then perform rapid text
search or analytics within
Elasticsearch.

Kafka streams can be

processed and persisted to a
Cassandra cluster. You can
implement another Kafka
Low latency serving of
Cassandra instance that receives a
streaming events to apps
stream of changes from
Cassandra and serves them
to applications for real time
decision making.

4. Streaming Data Storage

With the advent of low cost storage technologies, most organizations today are
storing their streaming event data. Here are several options for storing
streaming data, and their pros and cons.

7
Streaming Data
Pros Cons
Storage Options

In a database or data
Hard to scale and manage. If
warehouse- for example,
Easy SQL-based data analysis cloud-based, storage is
PostgreSQL or Amazon
expensive
Redshift

Data retention is an issue

since Kafka storage is up to
Agile, no need to structure
In the message broker- for 10x more expensive
data into tables. Easy to set
example, using Kafka compared to data lake
up, no additional
persistent storage storage. Kafka performance
components
is best for reading recent
(cached) data (cached)

High latency, makes real

Agile, no need to structure
In a data lake- for example, time analysis difficult.
data into tables. Low cost
Amazon S3 Difficult to perform SQL-
storage
based analysis

8
A data lake is the most flexible and inexpensive option for storing event data,
but it has several limitations for streaming data applications. Upsolver provides
a data lake platform that ingests streaming data into a data lake, creates
schema-on-read, and extracts metadata. This allows data consumers to easily
prepare data for analytics tools and real time analytics.

Modern Streaming Architecture

In modern streaming data deployments, many organizations are adopting a full
stack approach. Vendors are providing technology solutions, most of them
based on Kafka, which can. take streaming data and perform the entire process,
from message ingestion through ETL, storage management and preparing data
for analytics.

9
Benefits of a modern streaming architecture:

• Can eliminate the need for large data engineering projects

• Performance, high availability and fault tolerance built in

• Newer platforms are cloud-based and can be deployed very quickly with no
upfront investment

• Flexibility and support for multiple use cases

The Future of Streaming Data in 2019 and Beyond

Streaming data architecture is in constant flux. Three trends we believe will be
significant in 2019 and beyond:

• Fast adoption of platforms that decouple storage and compute—streaming

data growth is making traditional data warehouse platforms too expensive
and cumbersome to manage. Data lakes are increasingly used, both as a
cheap persistence option for storing large volumes of event data, and as a
flexible integration point, allowing tools outside the streaming ecosystem to
access streaming data.

• From table modeling to schemaless development—data consumers don’t

always know the questions they will ask in advance. They want to run an
interactive, iterative process with as little initial setup as possible. Lengthy
table modeling, schema detection and metadata extraction are a burden.

10
• Automation of data plumbing—organizations are becoming reluctant to
spend precious data engineering time on data plumbing, instead of activities
that add value, such as data cleansing or enrichment. Increasingly, data
teams prefer full stack platforms that reduce time-to-value, over tailored
home-grown solutions.

You can read more of our predictions for streaming data trends here.

Want to enhance your streaming architecture? Upsolver's streaming data

platform processes event data and ingests it into data lakes, data warehouses,
serverless platforms, elasticsearch, and much more. Furthermore, it enables real
time analytics, using low-latency consumers that read from a Kafka stream in
parallel. It is a fully integrated solution that can be set up within hours.

By using Upsolver, you get the best of both worlds—low cost storage on a data
lake, easy transformation to tabular formats, and real time support. Begin your
free trial to start building a next-gen streaming data architecture.

Lab - Qlik Replicate Azure Databricks
No ratings yet
Lab - Qlik Replicate Azure Databricks
16 pages
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Databricks How To Data Import PDF
No ratings yet
Databricks How To Data Import PDF
16 pages
Ruta de Entrenamiento Base Cloudera Revisada
100% (1)
Ruta de Entrenamiento Base Cloudera Revisada
6 pages
IBM - Architecting A Big Data Platform For - White Paper - IML14333USEN PDF
No ratings yet
IBM - Architecting A Big Data Platform For - White Paper - IML14333USEN PDF
36 pages
Kaspersky Lab Scan Exclusions
No ratings yet
Kaspersky Lab Scan Exclusions
29 pages
Bai Tap Big Mang May Tinh
No ratings yet
Bai Tap Big Mang May Tinh
5 pages
201 v4-1 Final
67% (3)
201 v4-1 Final
488 pages
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet
Set Your Data in Motion
No ratings yet
Set Your Data in Motion
8 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Scala Design Patterns - Sample Chapter
No ratings yet
Scala Design Patterns - Sample Chapter
33 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
Certification
No ratings yet
Certification
16 pages
Fundamentals of Big Data Engineering: A Guide To The
No ratings yet
Fundamentals of Big Data Engineering: A Guide To The
14 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
SCD Typ2 in Databricks Azure
0% (1)
SCD Typ2 in Databricks Azure
8 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Data Lakes For Maximum Flexibility
No ratings yet
Data Lakes For Maximum Flexibility
29 pages
Ebook Solving Business Needs With Delta Lakev2
No ratings yet
Ebook Solving Business Needs With Delta Lakev2
43 pages
4.2.4 - Data Source Architectural Patterns
No ratings yet
4.2.4 - Data Source Architectural Patterns
20 pages
Troubleshooting Spark Challenges
No ratings yet
Troubleshooting Spark Challenges
7 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Hive SQLforHadoop
No ratings yet
Hive SQLforHadoop
59 pages
Apache Spark For Beginners
No ratings yet
Apache Spark For Beginners
30 pages
AZ 900 - Complete Notes
No ratings yet
AZ 900 - Complete Notes
90 pages
Nosql Data Architecture Patterns
No ratings yet
Nosql Data Architecture Patterns
62 pages
WP - Databricks vs. ETL Data Lake - Updated
No ratings yet
WP - Databricks vs. ETL Data Lake - Updated
12 pages
2 Hadoop (Uploaded)
No ratings yet
2 Hadoop (Uploaded)
82 pages
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
No ratings yet
Choosing Technologies For A Big Data Solution in The Cloud: James Serra
58 pages
Operating System
No ratings yet
Operating System
60 pages
Microsoft Azure Platform: November 3rd, 2010 COMS W6998-6
No ratings yet
Microsoft Azure Platform: November 3rd, 2010 COMS W6998-6
71 pages
Azure Synpase Analytics Service
No ratings yet
Azure Synpase Analytics Service
22 pages
Mining Your Data Lake For Analytics Insights v3 101420
No ratings yet
Mining Your Data Lake For Analytics Insights v3 101420
16 pages
Real Time Data Processing With PDI
No ratings yet
Real Time Data Processing With PDI
15 pages
Course12 2 PDF
No ratings yet
Course12 2 PDF
36 pages
What Is DW2.0
No ratings yet
What Is DW2.0
13 pages
Kudu
No ratings yet
Kudu
9 pages
02 - Apache Spark On Amazon EMR
No ratings yet
02 - Apache Spark On Amazon EMR
31 pages
20 Best Practices For Working With Apache Kafka at Scale - DZone Big Data
No ratings yet
20 Best Practices For Working With Apache Kafka at Scale - DZone Big Data
10 pages
SQL Server Theory
No ratings yet
SQL Server Theory
2 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
AWS Certified Big Data Specialty Exam Dumps - Amazondumps - Us
100% (1)
AWS Certified Big Data Specialty Exam Dumps - Amazondumps - Us
5 pages
Cloudera Introduction PDF
No ratings yet
Cloudera Introduction PDF
97 pages
Snowflake To Oracle
No ratings yet
Snowflake To Oracle
16 pages
Ambari Operations
No ratings yet
Ambari Operations
194 pages
Cloudera Spark
No ratings yet
Cloudera Spark
55 pages
Access Control Snowflake
No ratings yet
Access Control Snowflake
6 pages
Move The Data That Moves Your Business: Attunity Replicate
No ratings yet
Move The Data That Moves Your Business: Attunity Replicate
2 pages
ERModel PDF
100% (1)
ERModel PDF
82 pages
Cassandra Tutorial
No ratings yet
Cassandra Tutorial
27 pages
Impala and BigQuery
No ratings yet
Impala and BigQuery
47 pages
Data Pipeline Essentials: See Ya Later
No ratings yet
Data Pipeline Essentials: See Ya Later
6 pages
05.azure Data Lake Authentication
No ratings yet
05.azure Data Lake Authentication
16 pages
Open Source Cloud Management Stacks Comparison: Eucalyptus vs. OpenStack
No ratings yet
Open Source Cloud Management Stacks Comparison: Eucalyptus vs. OpenStack
10 pages
DP 203 Merged Merged Merged
No ratings yet
DP 203 Merged Merged Merged
699 pages
Bigquery: Introducing Powerful New Enterprise Data Warehousing Features
No ratings yet
Bigquery: Introducing Powerful New Enterprise Data Warehousing Features
6 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Installing and Using Impala
No ratings yet
Installing and Using Impala
248 pages
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
No ratings yet
Matthieu - Lamairesse - Reda - Khouani - Why The Best Serverless Data Warehouse Is A Lakehouse - (DAIWT - PARIS)
38 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
Quick Start Kubernetes: Unlock the Full Potential of Kubernetes for Scalable Application Management
From Everand
Quick Start Kubernetes: Unlock the Full Potential of Kubernetes for Scalable Application Management
Nigel Poulton
No ratings yet
Prefer Embedding: Document Schema Design Cheatsheet
No ratings yet
Prefer Embedding: Document Schema Design Cheatsheet
1 page
Mongodb - Microservices - and - Serverless
No ratings yet
Mongodb - Microservices - and - Serverless
14 pages
AWS Parner Security Best Practices PDF
No ratings yet
AWS Parner Security Best Practices PDF
15 pages
Continuous Integration in Gitlab CI - CD With Drupal 8
No ratings yet
Continuous Integration in Gitlab CI - CD With Drupal 8
10 pages
Processadores Intel
No ratings yet
Processadores Intel
33 pages
Categories of Computer System
No ratings yet
Categories of Computer System
5 pages
Vsphere Replication 85 Admin
No ratings yet
Vsphere Replication 85 Admin
159 pages
Subnetting A Network With IP Addresses To Share Among Different Sites
No ratings yet
Subnetting A Network With IP Addresses To Share Among Different Sites
5 pages
Presentación TDE
No ratings yet
Presentación TDE
35 pages
BRKSEC-2059 - Deploying ISE in A Dynamic Environment (Best Practices) - Cisco Live - 2019
No ratings yet
BRKSEC-2059 - Deploying ISE in A Dynamic Environment (Best Practices) - Cisco Live - 2019
108 pages
NetSim User Manual
No ratings yet
NetSim User Manual
248 pages
Product Data Sheet Deltav PK Controller Deltav en 3583460
No ratings yet
Product Data Sheet Deltav PK Controller Deltav en 3583460
21 pages
Assignment Brief - BTEC (RQF) Higher National Diploma in Computing
No ratings yet
Assignment Brief - BTEC (RQF) Higher National Diploma in Computing
5 pages
From Chapter 1 of Distributed Systems Concepts and Design, 4 Edition
100% (1)
From Chapter 1 of Distributed Systems Concepts and Design, 4 Edition
49 pages
Solutions - Midterm Exam
No ratings yet
Solutions - Midterm Exam
4 pages
Socket Programming in C: Server and Client
No ratings yet
Socket Programming in C: Server and Client
9 pages
Lab14 File Handling 1
No ratings yet
Lab14 File Handling 1
5 pages
Solution To Wireshark Lab: UDP: Fig. 1: UDP Header Fields
No ratings yet
Solution To Wireshark Lab: UDP: Fig. 1: UDP Header Fields
4 pages
Manual For Buffalo
No ratings yet
Manual For Buffalo
132 pages
CSC 101 - LECTURE 2 - Component - of - Computers - Hardware - Software
No ratings yet
CSC 101 - LECTURE 2 - Component - of - Computers - Hardware - Software
108 pages
Class A Subnetting Chart
No ratings yet
Class A Subnetting Chart
1 page
What Is Booting
No ratings yet
What Is Booting
3 pages
SHARE 120 - VSAM Boot Camp - An Introduction To VSAM
No ratings yet
SHARE 120 - VSAM Boot Camp - An Introduction To VSAM
50 pages
The Forrester Wave™ - Hyperconverged Infrastructure, Q3 2020
No ratings yet
The Forrester Wave™ - Hyperconverged Infrastructure, Q3 2020
18 pages
Lec4-5 Q
No ratings yet
Lec4-5 Q
4 pages
Bluetooth Module HC-05
No ratings yet
Bluetooth Module HC-05
5 pages
110-6168-EN-R3 SANHQ Guide V3.0 Web PDF
No ratings yet
110-6168-EN-R3 SANHQ Guide V3.0 Web PDF
234 pages
TP2
No ratings yet
TP2
2 pages
CMOS Battery Presentation
No ratings yet
CMOS Battery Presentation
18 pages
Oracle Grid Infrastructure 11g: Manage Clusterware and ASM: Student Guide - Volume II
No ratings yet
Oracle Grid Infrastructure 11g: Manage Clusterware and ASM: Student Guide - Volume II
51 pages
Spectre Meltdown Advisory
No ratings yet
Spectre Meltdown Advisory
5 pages