Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Apache Kudu Webinar Series
Understanding and Unlocking
the Value of Real-Time Data
Ryan Lippert | Cloudera
Michele Goetz | Forrester (Special Guest)
2© Cloudera, Inc. All rights reserved.
Kudu Webinar Series
Part 1: Lambda Architectures – Simplified by Apache Kudu
A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can
dramatically simplify real-time analytics.
Part 2: Extending the Capabilities of Operational and Analytical Databases
An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and
Analytical databases can handle.
Part 3: Data-in-Motion: Unlock the Value of Real-Time Data
Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will
discuss how to make it a reality.
Part 4: Techincal Deep-Dive into Apache Kudu
An in-depth examination of the technical architecture and design of Apache Kudu, straight from a PMC
Member.
3© Cloudera, Inc. All rights reserved.
Updateable Analytic Storage
Simple real-time analytics and updates with Apache Kudu
Kudu: Storage for fast analytics on fast data
• Simplified architecture for building real-time analytic
applications
• Designed for next-generation hardware for faster analytic
performance across frameworks
• Native Hadoop storage engine
Flexibility for the right tools for the right use
case in one platform
• Only analytic database for big data with Kudu + Impala
• Simple real-time applications with Kudu + Spark
Use cases
• Time series data
• Machine data analytics
• Online reporting
STRUCTURED
Sqoop
UNSTRUCTURED
Kafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENT
YARN
SECURITY
Sentry, RecordService
STORE
INTEGRATE
BATCH
Spark, Hive, Pig
MapReduce
STREAM
Spark
SQL
Impala
SEARCH
Solr
OTHER
Kite
NoSQL
HBase
OTHER
Object Store
FILESYSTEM
HDFS
RELATIONAL
Kudu
4© Cloudera, Inc. All rights reserved.
Ingest data of any
type or volume
Process data as it
arrives
Serve data to users
and applications
Real-Time Data
5© Cloudera, Inc. All rights reserved.
Agenda
Drivers for agile, real-time data platforms
The key use cases that are driving businesses towards real time
platforms?
Data on adoption trends for real-time technologies
What is Forrester seeing in the market for real-time technologies?
Deploying a real-time OSS achitecture to grow your business
How can you build a scalable, cost-effective platform to grow your
business?
© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Michele Goetz
Special Guest Speaker
Principal Analyst Serving Enterprise Architecture Professionals
7© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Agenda
Drivers for agile, real-time data platforms
The key use cases that are driving businesses towards real time
platforms?
Data on adoption trends for real-time technologies
What is Forrester seeing in the market for real-time technologies?
Deploying a real-time OSS achitecture to grow your business
How can you build a scalable, cost-effective platform to grow your
business?
8© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Superior CX depends on data and insights
9© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Fraud and risk management requires real-time data
10© 2017 FORRESTER. REPRODUCTION PROHIBITED.
IoT heat map shows where data matters most, now
11© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Data bottlenecks are catalysts for transition
12© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Create a road map for a real-time, agile data platform
13© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Agenda
Drivers for agile, real-time data platforms
The key use cases that are driving businesses towards real time
platforms?
Data on adoption trends for real-time technologies
What is Forrester seeing in the market for real-time technologies?
Deploying a real-time OSS achitecture to grow your business
How can you build a scalable, cost-effective platform to grow your
business?
14© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Leaders are focused on the technologies that allow data and
insights to be consumed across the organization
What are your firm's plans for the following data driven initiatives?
Base: 3005 global data and analytics decision-makers.
Source: Business Technographics® Global Data & Analytics Survey, 2016
51%
51%
51%
51%
51%
49%
52%
52%
54%
54%
58%
22%
22%
22%
22%
22%
24%
22%
23%
22%
23%
22%
Creating an organizational center of excellence for business intelligence
Combine content management and data management programs into a unified information management
program
Changing our processes to promote data stewardship and sharing
Investing in platforms to and share out data content
Creating a business led data stewardship or governance program
Changing management incentives to promote data sharing
Implementing analytics insights in software systems to aid customers or support employee decisions.
Investing more in business friendly, self-service visualization and analytics
Engaging external services providers or strategic business consultants for data and analytics or insights
services
Providing data preparation tools for self-service data management
Investing in distributed real time insight delivery technology
Expanding/Implemented Planning to implement within the next 12 months
15© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Base: 325 global data and analytics technology decision-makers. “Don’t know” not shown.
Source: Business Technographics® Global Data & Analytics Survey, 2016
Which of the following describes your [TDM=”IT budget data and analytics technology or
services”; BDM=”business budget
for data and analytics technology or services”] from 2015 to 2016?
4%
5%
6%
6%
22%
26%
30%
0% 5% 10% 15% 20% 25% 30% 35%
Decrease by 5% to 10%
Don’t know
Decrease by 1-4%
Increase by more than 10%
Increase by 5% to 10%
Increase by 1-4%
Stay about the same
54% of data and analytics technology decision-makers increased
their budgets for data and analytics from 2015 to 2016
54%
16© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Companies of all sizes are spending millions for data & analytics
Note: Don’t know excluded. Base: 765*, 1,288 global data and analytics decision makers
Source: Business Technographics® Global Data & Analytics Survey, 2016
Please estimate, in millions, how much your data and analytics budget is for 2016? (Note:
Number is in US Dollars)
55%
22%
9%
1% 1% 0% 0%
32%
30%
13%
4%
2% 2% 1%
Less than $1 million $1 million to under $10
million
$10 million to under $100
million
$100 million to under
$500 million
$500 million to under $ 1
billion
$1 billion to under $5
billion
$5 billion or more
SMB (20-999 employees)*
Enterprise (1,000 or more employees)
17© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Among the DM technologies Forrester tracks, interest for stream
processing tools has grown the most YoY
What are your firm's plans to use the following data management technologies?
Base: 2094 and *1805 global data and analytics technology decision-makers.
Source: Business Technographics® Global Data & Analytics Survey, 2016
% with
commitment
% with
interest, but
no immediate
plans
+5 p.p. +3 p.p. -2 p.p. -1 p.p. -2 p.p. -3 p.p.
% with commitment (expanding, implemented, or planning to implement in the next 12 months)
59%
61%
63% 63%
60% 59%
64% 64%
61% 62%
58% 56%
Stream processing tools Inverted index database Distributed NoSQL
databases
Hadoop Associative index
databases
RDF, triple store
-20% -19% -19% -20% -19% -19%
-13% -13% -16% -14% -14% -13%
18© 2017 FORRESTER. REPRODUCTION PROHIBITED.
Base: Total: 2094
Source: Business Technographics® Global Data & Analytics Survey, 2016
Which of the following are included in your plans for big data?
16%
18%
22%
23%
23%
26%
26%
27%
28%
30%
33%
36%
40%
NoSQL other than Hadoop
A MPP (massively parallel processing) data warehouse
Semantic technologies (ontology building, search, auto curation, graph, etc.)
Hadoop (including Hbase or Accumulo)
Data anonymization or de-identification
Creating or building out a data lake
Marketing or digital data management platforms and service providers that
brand their offerings as big data
Packaged analytics technologies that brand themselves as big data
Unstructured data mining / analytics
Distributed in memory databases, grids, analytics tools
Streaming analytics / computing
Large scale predictive modeling, data mining or other advanced analytics
Public cloud big data services
Streaming analytics high in the list of big data plans
19© Cloudera, Inc. All rights reserved.
Agenda
Drivers for agile, real-time data platforms
The key use cases that are driving businesses towards real time
platforms?
Data on adoption trends for real-time technologies
What is Forrester seeing in the market for real-time technologies?
Deploying a real-time OSS achitecture to grow your
business
How can you build a scalable, cost-effective platform to grow your
business?
20© Cloudera, Inc. All rights reserved.
Trend Towards Real-Time Data Platforms is Clear
Drivers for Real-Time Platforms
• Enhancing customer experiences
• Risk Management
• Advancement of IoT and broader instrumentation
Adoption is Accelerating
• Top data-driven initiative by investment: distributed delivery of
real-time data
• DM technology with highest momentum: stream processing
• Top big data plans: streaming analytics is top 3
• Broad, large investments: 90% of decision makers are either
continuing or increasing their investments in data and analytics;
millions/billions being spent
21© Cloudera, Inc. All rights reserved.
The Underlying Driver
What drives a use case to real-time?
High Frequency Trading
APT Detection
Fraud Detection
Predictive Maintenance
Next Best Offer
Inventory Management
Shipping/Logistic Systems
CRM Systems
Employee Management
Strategic Planning
Real-time data management use cases are
defined by a common set of characteristics.
• Narrow time window in which to make a decision
(automated or manual)
• Opportunity for the data points to change the
decision path
• Decreasing value of data over time
Not all use cases have a pressing need for
real-time data.
• Broader strategic decisions, for example, do not
require real-time data input
• Over time, decreases in HW costs and increases in
availability of real-time systems will lead most use
cases to be conducted in real-time
Real Time
Some Latency
Acceptable
22© Cloudera, Inc. All rights reserved.
Moving to Real-Time and Leveraging Analytics
What do we have to gain?
“Monitoring System”
Sensors are automatically
monitored and
programmed to deliver
warnings when readings
are delivered outside of
an “optimal zone”.
Basic models developed
over small subsets of
data.
“Predictive System”
Ingestion and processing
of all sensor data into an
unlimited data store with
analytic capabilities
enables machine
learning, which can
provide automated
optimization and
predictive maintenance.
“Only 1 percent of data from an oil rig with 30,000 sensors
is examined. The data that are used today are mostly for
anomaly detection and control, not optimization and
prediction, which provide the greatest value.”
- McKinsey & Company
Traditional Architectures Real-Time Analytic Capabilities
23© Cloudera, Inc. All rights reserved.
Ingest data of any
type or volume
Process data as it
arrives
Serve data to users
and applications
Real-Time Data
24© Cloudera, Inc. All rights reserved.
Ingestion at Cloudera
• Apache Sqoop for data from
relational databases
• Apache Flume for logs, event
based data
• Apache Kafka is fast,
scalable, and fault-tolerant
messaging
Partners, such as Streamsets,
provide rich visualization tools
Ingestion in Real-Time
Stream Ingestion is a Must for Many Use Cases
Ingestion isn’t just about internal business data anymore.
• Traditional ingestion was internally focused, and often a matter of
moving data from one silo or system to another
• Today, businesses aim to take in data from a variety of external
sources, IoT sensors, and machine-generated (user/network)
data
Your data journey can’t start until the data arrives.
• Each step of the ingest/process/serve data pipeline must occur
at real-time speed if decisions are to be made in time to affect
the course of business
Visualization help practitioners understand their data.
• Complex tasks can be made less complex via graphical
representations; data ingestion is no different
25© Cloudera, Inc. All rights reserved.
Stream Processing at Cloudera
Spark Streaming, the leading
open-source framework for real-
time use cases, is deployed in
Cloudera’s real-time
architectures.
Cloudera has the broadest base
of Hadoop-adjacent experience
with Spark and integrating it
with Apache components.
Ingestion in Real-Time
Unlocking Value at Speed
For some use cases, batch just isn’t enough.
• Batch processing can lead to bottlenecks and delays in data
transformations that cause missed opportunities.
Apache Spark is gaining momentum for a reason.
• Leveraging Apache Spark for stream processing enables real-
time use cases with sub-second latency and best-in-class API’s.
Spark has a best-in-class ecosystem.
• Machine learning (via MLlib) is seamlessly integrated into Spark.
• Broadest set of vendors and contributors working on Spark
among available processing engines, leading to rapid innovation.
26© Cloudera, Inc. All rights reserved.
Data Serving at Cloudera
Apache Kudu provides batch
analysis and real-time serving within
the same storage layer
Apache HBase yields the best
read/write performance
Cloudera Search enables SQL-like
faceted search in natural language
Apache Kafka can be used to serve
data to applications and users
Serving in Real-Time
Inject Data into Real-Time Decisions
You need options that suit your use case.
• Platform proliferation hurts IT departments as skillsets are
divided; fewer platforms with broad capabilities help.
Apache Kudu changes the game for open source
software.
• Combining real-time serving with analytic scans through a
relational database had taken a complex lambda architecture
until Kudu
• Together, simplification and affordability should drive more use
cases to real-time automated processes, in turn driving
increased revenue, decreased risk, and better service for
companies deploying Kudu
27© Cloudera, Inc. All rights reserved.
HDFS
Fast Scans, Analytics
and Processing of
Stored Data
Fast On-Line
Updates &
Data Serving
Arbitrary Storage
(Active Archive)
Fast Analytics
(on fast-changing or
frequently-updated data)
Apache Kudu: Filling the Analytic Gap
Unchanging
Fast Changing
Frequent Updates
HBase
Append-Only
Real-Time
Kudu Kudu fills the Gap
Modern analytic
applications often
require complex data
flow & difficult
integration work to
move data between
HBase & HDFS
Analytic
Gap
Pace of Analysis
PaceofData
28© Cloudera, Inc. All rights reserved.
Real-Time Data Analysis at Work
Customer 360  “Next Best Offer 2.0”
Kafka
Spark
Streaming
Kudu
Spark MLlib
Application
Data
Sources
Individual Session
Customer
Interaction
Spark
Full Model/Learning
Data Request Sent For Stream Processing
Data Cleaned/Ordered/Processed, Then
Delivered to Kudu for Modelling
User’s navigation returns the results they
are looking for, in addition to offers and
suggestions hyper-customized for them.
Illustrative,
models will
likely have
>2
dimensions
29© Cloudera, Inc. All rights reserved.
Machine Learning
Kudu opens the door to machine learning
Kudu provides the ability
to leverage real-time
updates and analytic
scans together - critical for
many machine learning
applications.
Source: GHOSTS IN THE MACHINE: Artificial intelligence, risks and regulation in financial markets
30© Cloudera, Inc. All rights reserved.
The Time for
Real-Time Data
and Analytics
is Now.
And the platform for it
is Cloudera Enterprise.
31© Cloudera, Inc. All rights reserved.

More Related Content

Kudu Forrester Webinar

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Kudu Webinar Series Understanding and Unlocking the Value of Real-Time Data Ryan Lippert | Cloudera Michele Goetz | Forrester (Special Guest)
  • 2. 2© Cloudera, Inc. All rights reserved. Kudu Webinar Series Part 1: Lambda Architectures – Simplified by Apache Kudu A look into the potential trouble involved with a lambda architecture, and how Apache Kudu can dramatically simplify real-time analytics. Part 2: Extending the Capabilities of Operational and Analytical Databases An examination of how Apache Kudu expands the set of use cases that Cloudera’s Operational and Analytical databases can handle. Part 3: Data-in-Motion: Unlock the Value of Real-Time Data Forrester will discuss their research into real-time data pipelines and analytics, and Cloudera will discuss how to make it a reality. Part 4: Techincal Deep-Dive into Apache Kudu An in-depth examination of the technical architecture and design of Apache Kudu, straight from a PMC Member.
  • 3. 3© Cloudera, Inc. All rights reserved. Updateable Analytic Storage Simple real-time analytics and updates with Apache Kudu Kudu: Storage for fast analytics on fast data • Simplified architecture for building real-time analytic applications • Designed for next-generation hardware for faster analytic performance across frameworks • Native Hadoop storage engine Flexibility for the right tools for the right use case in one platform • Only analytic database for big data with Kudu + Impala • Simple real-time applications with Kudu + Spark Use cases • Time series data • Machine data analytics • Online reporting STRUCTURED Sqoop UNSTRUCTURED Kafka, Flume PROCESS, ANALYZE, SERVE UNIFIED SERVICES RESOURCE MANAGEMENT YARN SECURITY Sentry, RecordService STORE INTEGRATE BATCH Spark, Hive, Pig MapReduce STREAM Spark SQL Impala SEARCH Solr OTHER Kite NoSQL HBase OTHER Object Store FILESYSTEM HDFS RELATIONAL Kudu
  • 4. 4© Cloudera, Inc. All rights reserved. Ingest data of any type or volume Process data as it arrives Serve data to users and applications Real-Time Data
  • 5. 5© Cloudera, Inc. All rights reserved. Agenda Drivers for agile, real-time data platforms The key use cases that are driving businesses towards real time platforms? Data on adoption trends for real-time technologies What is Forrester seeing in the market for real-time technologies? Deploying a real-time OSS achitecture to grow your business How can you build a scalable, cost-effective platform to grow your business?
  • 6. © 2017 FORRESTER. REPRODUCTION PROHIBITED. Michele Goetz Special Guest Speaker Principal Analyst Serving Enterprise Architecture Professionals
  • 7. 7© 2017 FORRESTER. REPRODUCTION PROHIBITED. Agenda Drivers for agile, real-time data platforms The key use cases that are driving businesses towards real time platforms? Data on adoption trends for real-time technologies What is Forrester seeing in the market for real-time technologies? Deploying a real-time OSS achitecture to grow your business How can you build a scalable, cost-effective platform to grow your business?
  • 8. 8© 2017 FORRESTER. REPRODUCTION PROHIBITED. Superior CX depends on data and insights
  • 9. 9© 2017 FORRESTER. REPRODUCTION PROHIBITED. Fraud and risk management requires real-time data
  • 10. 10© 2017 FORRESTER. REPRODUCTION PROHIBITED. IoT heat map shows where data matters most, now
  • 11. 11© 2017 FORRESTER. REPRODUCTION PROHIBITED. Data bottlenecks are catalysts for transition
  • 12. 12© 2017 FORRESTER. REPRODUCTION PROHIBITED. Create a road map for a real-time, agile data platform
  • 13. 13© 2017 FORRESTER. REPRODUCTION PROHIBITED. Agenda Drivers for agile, real-time data platforms The key use cases that are driving businesses towards real time platforms? Data on adoption trends for real-time technologies What is Forrester seeing in the market for real-time technologies? Deploying a real-time OSS achitecture to grow your business How can you build a scalable, cost-effective platform to grow your business?
  • 14. 14© 2017 FORRESTER. REPRODUCTION PROHIBITED. Leaders are focused on the technologies that allow data and insights to be consumed across the organization What are your firm's plans for the following data driven initiatives? Base: 3005 global data and analytics decision-makers. Source: Business Technographics® Global Data & Analytics Survey, 2016 51% 51% 51% 51% 51% 49% 52% 52% 54% 54% 58% 22% 22% 22% 22% 22% 24% 22% 23% 22% 23% 22% Creating an organizational center of excellence for business intelligence Combine content management and data management programs into a unified information management program Changing our processes to promote data stewardship and sharing Investing in platforms to and share out data content Creating a business led data stewardship or governance program Changing management incentives to promote data sharing Implementing analytics insights in software systems to aid customers or support employee decisions. Investing more in business friendly, self-service visualization and analytics Engaging external services providers or strategic business consultants for data and analytics or insights services Providing data preparation tools for self-service data management Investing in distributed real time insight delivery technology Expanding/Implemented Planning to implement within the next 12 months
  • 15. 15© 2017 FORRESTER. REPRODUCTION PROHIBITED. Base: 325 global data and analytics technology decision-makers. “Don’t know” not shown. Source: Business Technographics® Global Data & Analytics Survey, 2016 Which of the following describes your [TDM=”IT budget data and analytics technology or services”; BDM=”business budget for data and analytics technology or services”] from 2015 to 2016? 4% 5% 6% 6% 22% 26% 30% 0% 5% 10% 15% 20% 25% 30% 35% Decrease by 5% to 10% Don’t know Decrease by 1-4% Increase by more than 10% Increase by 5% to 10% Increase by 1-4% Stay about the same 54% of data and analytics technology decision-makers increased their budgets for data and analytics from 2015 to 2016 54%
  • 16. 16© 2017 FORRESTER. REPRODUCTION PROHIBITED. Companies of all sizes are spending millions for data & analytics Note: Don’t know excluded. Base: 765*, 1,288 global data and analytics decision makers Source: Business Technographics® Global Data & Analytics Survey, 2016 Please estimate, in millions, how much your data and analytics budget is for 2016? (Note: Number is in US Dollars) 55% 22% 9% 1% 1% 0% 0% 32% 30% 13% 4% 2% 2% 1% Less than $1 million $1 million to under $10 million $10 million to under $100 million $100 million to under $500 million $500 million to under $ 1 billion $1 billion to under $5 billion $5 billion or more SMB (20-999 employees)* Enterprise (1,000 or more employees)
  • 17. 17© 2017 FORRESTER. REPRODUCTION PROHIBITED. Among the DM technologies Forrester tracks, interest for stream processing tools has grown the most YoY What are your firm's plans to use the following data management technologies? Base: 2094 and *1805 global data and analytics technology decision-makers. Source: Business Technographics® Global Data & Analytics Survey, 2016 % with commitment % with interest, but no immediate plans +5 p.p. +3 p.p. -2 p.p. -1 p.p. -2 p.p. -3 p.p. % with commitment (expanding, implemented, or planning to implement in the next 12 months) 59% 61% 63% 63% 60% 59% 64% 64% 61% 62% 58% 56% Stream processing tools Inverted index database Distributed NoSQL databases Hadoop Associative index databases RDF, triple store -20% -19% -19% -20% -19% -19% -13% -13% -16% -14% -14% -13%
  • 18. 18© 2017 FORRESTER. REPRODUCTION PROHIBITED. Base: Total: 2094 Source: Business Technographics® Global Data & Analytics Survey, 2016 Which of the following are included in your plans for big data? 16% 18% 22% 23% 23% 26% 26% 27% 28% 30% 33% 36% 40% NoSQL other than Hadoop A MPP (massively parallel processing) data warehouse Semantic technologies (ontology building, search, auto curation, graph, etc.) Hadoop (including Hbase or Accumulo) Data anonymization or de-identification Creating or building out a data lake Marketing or digital data management platforms and service providers that brand their offerings as big data Packaged analytics technologies that brand themselves as big data Unstructured data mining / analytics Distributed in memory databases, grids, analytics tools Streaming analytics / computing Large scale predictive modeling, data mining or other advanced analytics Public cloud big data services Streaming analytics high in the list of big data plans
  • 19. 19© Cloudera, Inc. All rights reserved. Agenda Drivers for agile, real-time data platforms The key use cases that are driving businesses towards real time platforms? Data on adoption trends for real-time technologies What is Forrester seeing in the market for real-time technologies? Deploying a real-time OSS achitecture to grow your business How can you build a scalable, cost-effective platform to grow your business?
  • 20. 20© Cloudera, Inc. All rights reserved. Trend Towards Real-Time Data Platforms is Clear Drivers for Real-Time Platforms • Enhancing customer experiences • Risk Management • Advancement of IoT and broader instrumentation Adoption is Accelerating • Top data-driven initiative by investment: distributed delivery of real-time data • DM technology with highest momentum: stream processing • Top big data plans: streaming analytics is top 3 • Broad, large investments: 90% of decision makers are either continuing or increasing their investments in data and analytics; millions/billions being spent
  • 21. 21© Cloudera, Inc. All rights reserved. The Underlying Driver What drives a use case to real-time? High Frequency Trading APT Detection Fraud Detection Predictive Maintenance Next Best Offer Inventory Management Shipping/Logistic Systems CRM Systems Employee Management Strategic Planning Real-time data management use cases are defined by a common set of characteristics. • Narrow time window in which to make a decision (automated or manual) • Opportunity for the data points to change the decision path • Decreasing value of data over time Not all use cases have a pressing need for real-time data. • Broader strategic decisions, for example, do not require real-time data input • Over time, decreases in HW costs and increases in availability of real-time systems will lead most use cases to be conducted in real-time Real Time Some Latency Acceptable
  • 22. 22© Cloudera, Inc. All rights reserved. Moving to Real-Time and Leveraging Analytics What do we have to gain? “Monitoring System” Sensors are automatically monitored and programmed to deliver warnings when readings are delivered outside of an “optimal zone”. Basic models developed over small subsets of data. “Predictive System” Ingestion and processing of all sensor data into an unlimited data store with analytic capabilities enables machine learning, which can provide automated optimization and predictive maintenance. “Only 1 percent of data from an oil rig with 30,000 sensors is examined. The data that are used today are mostly for anomaly detection and control, not optimization and prediction, which provide the greatest value.” - McKinsey & Company Traditional Architectures Real-Time Analytic Capabilities
  • 23. 23© Cloudera, Inc. All rights reserved. Ingest data of any type or volume Process data as it arrives Serve data to users and applications Real-Time Data
  • 24. 24© Cloudera, Inc. All rights reserved. Ingestion at Cloudera • Apache Sqoop for data from relational databases • Apache Flume for logs, event based data • Apache Kafka is fast, scalable, and fault-tolerant messaging Partners, such as Streamsets, provide rich visualization tools Ingestion in Real-Time Stream Ingestion is a Must for Many Use Cases Ingestion isn’t just about internal business data anymore. • Traditional ingestion was internally focused, and often a matter of moving data from one silo or system to another • Today, businesses aim to take in data from a variety of external sources, IoT sensors, and machine-generated (user/network) data Your data journey can’t start until the data arrives. • Each step of the ingest/process/serve data pipeline must occur at real-time speed if decisions are to be made in time to affect the course of business Visualization help practitioners understand their data. • Complex tasks can be made less complex via graphical representations; data ingestion is no different
  • 25. 25© Cloudera, Inc. All rights reserved. Stream Processing at Cloudera Spark Streaming, the leading open-source framework for real- time use cases, is deployed in Cloudera’s real-time architectures. Cloudera has the broadest base of Hadoop-adjacent experience with Spark and integrating it with Apache components. Ingestion in Real-Time Unlocking Value at Speed For some use cases, batch just isn’t enough. • Batch processing can lead to bottlenecks and delays in data transformations that cause missed opportunities. Apache Spark is gaining momentum for a reason. • Leveraging Apache Spark for stream processing enables real- time use cases with sub-second latency and best-in-class API’s. Spark has a best-in-class ecosystem. • Machine learning (via MLlib) is seamlessly integrated into Spark. • Broadest set of vendors and contributors working on Spark among available processing engines, leading to rapid innovation.
  • 26. 26© Cloudera, Inc. All rights reserved. Data Serving at Cloudera Apache Kudu provides batch analysis and real-time serving within the same storage layer Apache HBase yields the best read/write performance Cloudera Search enables SQL-like faceted search in natural language Apache Kafka can be used to serve data to applications and users Serving in Real-Time Inject Data into Real-Time Decisions You need options that suit your use case. • Platform proliferation hurts IT departments as skillsets are divided; fewer platforms with broad capabilities help. Apache Kudu changes the game for open source software. • Combining real-time serving with analytic scans through a relational database had taken a complex lambda architecture until Kudu • Together, simplification and affordability should drive more use cases to real-time automated processes, in turn driving increased revenue, decreased risk, and better service for companies deploying Kudu
  • 27. 27© Cloudera, Inc. All rights reserved. HDFS Fast Scans, Analytics and Processing of Stored Data Fast On-Line Updates & Data Serving Arbitrary Storage (Active Archive) Fast Analytics (on fast-changing or frequently-updated data) Apache Kudu: Filling the Analytic Gap Unchanging Fast Changing Frequent Updates HBase Append-Only Real-Time Kudu Kudu fills the Gap Modern analytic applications often require complex data flow & difficult integration work to move data between HBase & HDFS Analytic Gap Pace of Analysis PaceofData
  • 28. 28© Cloudera, Inc. All rights reserved. Real-Time Data Analysis at Work Customer 360  “Next Best Offer 2.0” Kafka Spark Streaming Kudu Spark MLlib Application Data Sources Individual Session Customer Interaction Spark Full Model/Learning Data Request Sent For Stream Processing Data Cleaned/Ordered/Processed, Then Delivered to Kudu for Modelling User’s navigation returns the results they are looking for, in addition to offers and suggestions hyper-customized for them. Illustrative, models will likely have >2 dimensions
  • 29. 29© Cloudera, Inc. All rights reserved. Machine Learning Kudu opens the door to machine learning Kudu provides the ability to leverage real-time updates and analytic scans together - critical for many machine learning applications. Source: GHOSTS IN THE MACHINE: Artificial intelligence, risks and regulation in financial markets
  • 30. 30© Cloudera, Inc. All rights reserved. The Time for Real-Time Data and Analytics is Now. And the platform for it is Cloudera Enterprise.
  • 31. 31© Cloudera, Inc. All rights reserved.

Editor's Notes

  1. Ingest: Collecting the Data Today’s data-in-motion conversation, like the data journey itself, starts with ingestion. The increase in sensor-generated data associated with IoT, combined with the demands for social media data collection, has created a deluge of unstructured data that is difficult for organizations to contend with. As a common initial bottleneck in the data-in-motion journey, organizations often reach for a robust ingestion solution. However, it’s important to understand ingestion as part of a broader real-time data context; it’s a critical component, but only the first of three. Cloudera takes an open-source approach to ingestion, as it does with all three stages of the data-in-motion journey. Identifying the need for a streaming data capture system, Cloudera led the development of Apache Flume, the open standard for collecting and moving a vast amount of log data. The subsequent integration of Flume with Apache Kafka created an ingest architecture that has been replicated across Cloudera’s customer base in a variety of use cases. With Flume and Kafka, Cloudera deploys the leading streaming ingest platform. Flume can provide light weight agents deployed on edge nodes that number in the hundreds or thousands, each of which can be tiered to enable efficient ingest topologies. The integration between Kafka and Flume is bidirectional, meaning either component can be a producer or consumer of data depending on the specifics of your use case. A rising trend in data ingestion is the use of a rich visual interface that enables a user to interact with their ingestion architecture in an easy-to-use manner. While Cloudera delivers all the functionality underneath, we partner with best-in-class partners such as Streamsets, Cask, and others to deliver rich visualization. This enables Cloudera to focus on our core competency of data management, while enabling vendors with large engineering teams dedicated to visualization to focus on theirs. Portability, neutrality, and history of success for companies like Informatica,Talend, and others in similar spaces creates the best experience for our customers.
  2. Cloudera relies on Spark Streaming to process data once it is ingested. As the leading open-source processing framework for real-time use cases, Spark Streaming is an open standard and one of the most easily-recognizable components of the broader Apache Hadoop™ ecosystem. Cloudera has a the broadest base of Hadoop-adjacent experience with Apache Spark™ and Spark Streaming; this is a product of early adoption and integration of these projects into Cloudera Enterprise. CLOUDERA ENTERPRISE: THE INDUSTRY STANDARD FOR A COMPLETE DATA-IN-MOTION SOLUTION 5 WHITE PAPER Spark Streaming provides the strongest processing solution for data-in-motion use cases as a result of: • Best-in-class performance: - High throughput ensures that jobs will not bottleneck at the processing stage - Sub-second latency enables real-time capabilities • Best-in-class API and Features: - Easy-to-use SQL based API’s for authoring streaming jobs help expand the number of use cases and value of data in motion - “Exactly once” stream processing semantics help ensure accuracy - Sliding window computations enable fast insights into time period data slices - Built-in API’s for maintaining and updating in-memory information • Best-in-class ecosystem: - Largest set of vendors working with and around Spark among available processing engines, enabling access to latest innovations - Broadest and deepest machine learning library (MLib) is seamlessly integrated Spark Streaming from Cloudera, in particular, benefits users through the most robust integration into the ingestion and serving phases that bookend the data-in-motion story. This integration ensures a fast, easy, and secure delivery of processed data to the serving stage of data in motion.
  3. Whereas ingestion and processing have a relatively consistent flow irrespective of use case, the serving phase of a data-in-motion solution requires a variety of options in order to deliver the right data, to the right place, at the right time. Without this ability to quickly serve data to decision points, a solution loses its real-time capability and ceases to become a data-in-motion solution. Cloudera has a variety of options that help serve the diverse needs of individual use cases: • Apache Kudu™: A new, Cloudera-initiated Apache project, Kudu offers the unique ability to do fast scans on fast data. With an overwhelming number of data-in-motion use cases requiring analysis or visualization of streaming data, Kudu can enable the required batch analysis and real-time serving within the same storage layer. • Apache HBase™: HBase offers the best random read/write performance of any component within the Hadoop ecosystem. This capability, combined with high levels of concurrent access, enables online applications and operational needs that require the ability to query the latest data. • Cloudera Search: Powered by Apache Solr™, Cloudera Search democratizes data by enabling non-technical users to perform SQL-like, faceted search in natural language. Solr’s native integration into Cloudera Enterprise generates faster and more secure results. • Apache Kafka: Kafka’s fast, scalable, and durable design enables hundreds of megabytes of reads and writes per second, from thousands of clients.In addition to playing a role in ingestion, Kafka can be used to serve data to applications and users. This “last mile” step in the data-in-motion story is arguably the most critical step, which is why this breadth of options is necessary. Each use case, including the tendencies and workflows of the expected users, requires a different set of data access capabilities. Cloudera can meet any requirement through these tools, and can do so as the final step in an end-to-end data-in-motion story.
  4. Kudu allows you to have your cake and eat it too