Kudu Forrester Webinar

1© Cloudera, Inc. All rights reserved.
Apache Kudu Webinar Series
Understanding and Unlocking
the Value of Real-Time Data
Ryan Lippert | Cloudera
Michele Goetz | Forrester (Special Guest)

Kudu Webinar Series
Part 1: Lambda Architectures – Simplified by Apache Kudu
A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can
dramatically simplify real-time analytics.
Part 2: Extending the Capabilities of Operational and Analytical Databases
An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and
Analytical databases can handle.
Part 3: Data-in-Motion: Unlock the Value of Real-Time Data
Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will
discuss how to make it a reality.
Part 4: Techincal Deep-Dive into Apache Kudu
An in-depth examination of the technical architecture and design of Apache Kudu, straight from a PMC
Member.

Updateable Analytic Storage
Simple real-time analytics and updates with Apache Kudu
Kudu: Storage for fast analytics on fast data
• Simplified architecture for building real-time analytic
applications
• Designed for next-generation hardware for faster analytic
performance across frameworks
• Native Hadoop storage engine
Flexibility for the right tools for the right use
case in one platform
• Only analytic database for big data with Kudu + Impala
• Simple real-time applications with Kudu + Spark
Use cases
• Time series data
• Machine data analytics
• Online reporting
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
OTHER
Kite
NoSQL
HBase
OTHER
Object Store
FILESYSTEM
HDFS
RELATIONAL
Kudu

Ingest data of any
type or volume
Process data as it
arrives
Serve data to users
and applications
Real-Time Data

Agenda
Drivers for agile, real-time data platforms
The key use cases that are driving businesses towards real time
platforms?
Data on adoption trends for real-time technologies
What is Forrester seeing in the market for real-time technologies?
Deploying a real-time OSS achitecture to grow your business
How can you build a scalable, cost-effective platform to grow your
business?

© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Michele Goetz
Special Guest Speaker
Principal Analyst Serving Enterprise Architecture Professionals

7© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Agenda
platforms?
business?

Superior CX depends on data and insights

Fraud and risk management requires real-time data

IoT heat map shows where data matters most, now

Data bottlenecks are catalysts for transition

Create a road map for a real-time, agile data platform

Agenda
platforms?
business?

Leaders are focused on the technologies that allow data and
insights to be consumed across the organization
What are your firm's plans for the following data driven initiatives?
Base: 3005 global data and analytics decision-makers.
Source: Business Technographics® Global Data & Analytics Survey, 2016
51%
51%
51%
51%
51%
49%
52%
52%
54%
54%
58%
22%
22%
22%
22%
22%
24%
22%
23%
22%
23%
22%
Creating an organizational center of excellence for business intelligence
Combine content management and data management programs into a unified information management
program
Changing our processes to promote data stewardship and sharing
Investing in platforms to and share out data content
Creating a business led data stewardship or governance program
Changing management incentives to promote data sharing
Implementing analytics insights in software systems to aid customers or support employee decisions.
Investing more in business friendly, self-service visualization and analytics
Engaging external services providers or strategic business consultants for data and analytics or insights
services
Providing data preparation tools for self-service data management
Investing in distributed real time insight delivery technology
Expanding/Implemented Planning to implement within the next 12 months

Base: 325 global data and analytics technology decision-makers. “Don’t know” not shown.
Which of the following describes your [TDM=”IT budget data and analytics technology or
services”; BDM=”business budget
for data and analytics technology or services”] from 2015 to 2016?
4%
5%
6%
6%
22%
26%
30%
0% 5% 10% 15% 20% 25% 30% 35%
Decrease by 5% to 10%
Don’t know
Decrease by 1-4%
Increase by more than 10%
Increase by 5% to 10%
Increase by 1-4%
Stay about the same
54% of data and analytics technology decision-makers increased
their budgets for data and analytics from 2015 to 2016
54%

Companies of all sizes are spending millions for data & analytics
Note: Don’t know excluded. Base: 765*, 1,288 global data and analytics decision makers
Please estimate, in millions, how much your data and analytics budget is for 2016? (Note:
Number is in US Dollars)
55%
22%
9%
1% 1% 0% 0%
32%
30%
13%
4%
2% 2% 1%
Less than $1 million $1 million to under $10
million
$10 million to under $100
million
$100 million to under
$500 million
$500 million to under $ 1
billion
$1 billion to under $5
billion
$5 billion or more
SMB (20-999 employees)*
Enterprise (1,000 or more employees)

Among the DM technologies Forrester tracks, interest for stream
processing tools has grown the most YoY
What are your firm's plans to use the following data management technologies?
Base: 2094 and *1805 global data and analytics technology decision-makers.
% with
commitment
% with
interest, but
no immediate
plans
+5 p.p. +3 p.p. -2 p.p. -1 p.p. -2 p.p. -3 p.p.
% with commitment (expanding, implemented, or planning to implement in the next 12 months)
59%
61%
63% 63%
60% 59%
64% 64%
61% 62%
58% 56%
Stream processing tools Inverted index database Distributed NoSQL
databases
Hadoop Associative index
databases
RDF, triple store
-20% -19% -19% -20% -19% -19%
-13% -13% -16% -14% -14% -13%

Base: Total: 2094
Which of the following are included in your plans for big data?
16%
18%
22%
23%
23%
26%
26%
27%
28%
30%
33%
36%
40%
NoSQL other than Hadoop
A MPP (massively parallel processing) data warehouse
Semantic technologies (ontology building, search, auto curation, graph, etc.)
Hadoop (including Hbase or Accumulo)
Data anonymization or de-identification
Creating or building out a data lake
Marketing or digital data management platforms and service providers that
brand their offerings as big data
Packaged analytics technologies that brand themselves as big data
Unstructured data mining / analytics
Distributed in memory databases, grids, analytics tools
Streaming analytics / computing
Large scale predictive modeling, data mining or other advanced analytics
Public cloud big data services
Streaming analytics high in the list of big data plans

Agenda
platforms?
Deploying a real-time OSS achitecture to grow your
business
business?

Trend Towards Real-Time Data Platforms is Clear
Drivers for Real-Time Platforms
• Enhancing customer experiences
• Risk Management
• Advancement of IoT and broader instrumentation
Adoption is Accelerating
• Top data-driven initiative by investment: distributed delivery of
real-time data
• DM technology with highest momentum: stream processing
• Top big data plans: streaming analytics is top 3
• Broad, large investments: 90% of decision makers are either
continuing or increasing their investments in data and analytics;
millions/billions being spent

The Underlying Driver
What drives a use case to real-time?
High Frequency Trading
APT Detection
Fraud Detection
Predictive Maintenance
Next Best Offer
Inventory Management
Shipping/Logistic Systems
CRM Systems
Employee Management
Strategic Planning
Real-time data management use cases are
defined by a common set of characteristics.
• Narrow time window in which to make a decision
(automated or manual)
• Opportunity for the data points to change the
decision path
• Decreasing value of data over time
Not all use cases have a pressing need for
real-time data.
• Broader strategic decisions, for example, do not
require real-time data input
• Over time, decreases in HW costs and increases in
availability of real-time systems will lead most use
cases to be conducted in real-time
Real Time
Some Latency
Acceptable

Moving to Real-Time and Leveraging Analytics
What do we have to gain?
“Monitoring System”
Sensors are automatically
monitored and
programmed to deliver
warnings when readings
are delivered outside of
an “optimal zone”.
Basic models developed
over small subsets of
data.
“Predictive System”
Ingestion and processing
of all sensor data into an
unlimited data store with
analytic capabilities
enables machine
learning, which can
provide automated
optimization and
predictive maintenance.
“Only 1 percent of data from an oil rig with 30,000 sensors
is examined. The data that are used today are mostly for
anomaly detection and control, not optimization and
prediction, which provide the greatest value.”
- McKinsey & Company
Traditional Architectures Real-Time Analytic Capabilities

Ingest data of any
type or volume
Process data as it
arrives
Serve data to users
and applications
Real-Time Data

Ingestion at Cloudera
• Apache Sqoop for data from
relational databases
• Apache Flume for logs, event
based data
• Apache Kafka is fast,
scalable, and fault-tolerant
messaging
Partners, such as Streamsets,
provide rich visualization tools
Ingestion in Real-Time
Stream Ingestion is a Must for Many Use Cases
Ingestion isn’t just about internal business data anymore.
• Traditional ingestion was internally focused, and often a matter of
moving data from one silo or system to another
• Today, businesses aim to take in data from a variety of external
sources, IoT sensors, and machine-generated (user/network)
data
Your data journey can’t start until the data arrives.
• Each step of the ingest/process/serve data pipeline must occur
at real-time speed if decisions are to be made in time to affect
the course of business
Visualization help practitioners understand their data.
• Complex tasks can be made less complex via graphical
representations; data ingestion is no different

Stream Processing at Cloudera
Spark Streaming, the leading
open-source framework for real-
time use cases, is deployed in
Cloudera’s real-time
architectures.
Cloudera has the broadest base
of Hadoop-adjacent experience
with Spark and integrating it
with Apache components.
Ingestion in Real-Time
Unlocking Value at Speed
For some use cases, batch just isn’t enough.
• Batch processing can lead to bottlenecks and delays in data
transformations that cause missed opportunities.
Apache Spark is gaining momentum for a reason.
• Leveraging Apache Spark for stream processing enables real-
time use cases with sub-second latency and best-in-class API’s.
Spark has a best-in-class ecosystem.
• Machine learning (via MLlib) is seamlessly integrated into Spark.
• Broadest set of vendors and contributors working on Spark
among available processing engines, leading to rapid innovation.

Data Serving at Cloudera
Apache Kudu provides batch
analysis and real-time serving within
the same storage layer
Apache HBase yields the best
read/write performance
Cloudera Search enables SQL-like
faceted search in natural language
Apache Kafka can be used to serve
data to applications and users
Serving in Real-Time
Inject Data into Real-Time Decisions
You need options that suit your use case.
• Platform proliferation hurts IT departments as skillsets are
divided; fewer platforms with broad capabilities help.
Apache Kudu changes the game for open source
software.
• Combining real-time serving with analytic scans through a
relational database had taken a complex lambda architecture
until Kudu
• Together, simplification and affordability should drive more use
cases to real-time automated processes, in turn driving
increased revenue, decreased risk, and better service for
companies deploying Kudu

HDFS
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Apache Kudu: Filling the Analytic Gap
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the Gap
Modern analytic
applications often
require complex data
flow & difficult
integration work to
move data between
HBase & HDFS
Analytic
Gap
Pace of Analysis
PaceofData

Real-Time Data Analysis at Work
Customer 360  “Next Best Offer 2.0”
Kafka
Spark
Streaming
Kudu
Spark MLlib
Application
Data
Sources
Individual Session
Customer
Interaction
Spark
Full Model/Learning
Data Request Sent For Stream Processing
Data Cleaned/Ordered/Processed, Then
Delivered to Kudu for Modelling
User’s navigation returns the results they
are looking for, in addition to offers and
suggestions hyper-customized for them.
Illustrative,
models will
likely have
>2
dimensions

Machine Learning
Kudu opens the door to machine learning
Kudu provides the ability
to leverage real-time
updates and analytic
scans together - critical for
many machine learning
applications.
Source: GHOSTS IN THE MACHINE: Artificial intelligence, risks and regulation in financial markets

The Time for
Real-Time Data
and Analytics
is Now.
And the platform for it
is Cloudera Enterprise.

Kudu Forrester Webinar

Related slideshows

More Related Content

Kudu Forrester Webinar

Editor's Notes