Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Building a Single
Logical Data Lake
For Advanced Analytics, Data Science, and
Business Operations
Ravi Shankar, CMO
September, 2018
2
• Competition from a low cost
vendor
• Lower the price, affecting
margins?
• Or, maintain high price, but
differentiate in other ways?
3
Benefits
Large Heavy Equipment Manufacturer
Self-service / Predictive Analytics – IoT Integration
Improved asset performance and
proactive maintenance
Increased revenue from sale of
services and parts
Reduced warranty costs of parts
failure
4
“Big Data Challenges Impede Big Data Vision”
5
IT – Business Dilemma
IT Architecture is Unmanageable & Brittle because:
IT Focuses on
Data Collection
& Storage
Business
Focuses on Data
Visualization &
Analysis
No One Focused on Data Delivery
– So create 100’s to 1K’s of brittle direct connections and
replicate large volumes of data
Inventory System
(MS SQL Server)
Product Catalog
(Web Service -SOAP)
BI / Reporting
JDBC, ODBC,
ADO .NET
Web / Mobile
WS – REST JSON,
XML, HTML, RSSLog files
(.txt/.log files)
CRM
(MySQL)
Billing System
(Web Service - Rest)
ETL
Portals
JSR168 / 286,
Ms Web Parts
SOA, Middleware,
Enterprise Apps
WS – SOAP
Java API
Customer Voice
(Internet, Unstruc)
6
Big Data Fabric Architecture – Forrester Research, 2016
7
Big Data Fabric – Data Abstraction Layer
Abstracts access to disparate
data sources
Acts as a single repository
(virtual)
Makes data available in
real-time to consumers
8
Consume
in business
applications
Combine
related
data into
views
2
3 DATA CONSUMERS
Enterprise Applications, Reporting, BI, Portals, ESB, Mobile, Web, Users, IoT/Streaming Data
Connect
to disparate
data sources
1 DISPARATE DATA SOURCES
Databases & Warehouses, Cloud/Saas Applications, Big Data, NoSQL, Web, XML, Excel, PDF, Word...
Less StructuredMore Structured
Multiple protocols,
formats
Linked data services
query, search, browse
Request/Reply,
event driven
Secure
delivery
Library of
wrappers
Web
automation
Any data
or content
Read
& Write
DATA VIRTUALIZATION
DATA CONSUMERSAnalytical Operational
CONNECT COMBINE CONSUME
Share, Deliver,
Publish, Govern,
Collaborate
Discover,
Transform,
Prepare, Improve
Quality, Integrate
Normalized
views of
disparate data
Agile Development
Performance
Resource Management
Lifecycle Management Data Services
Data Catalog
Governance & Metadata
Security & Data Privacy
9
Logical Data Lake – Use Cases
Data Warehouse OffloadingIoT Integration
10
Data Virtualization in the IoT Ecosystem
Other RDBMS
(apps, CRM, SAP)
Other Sources
(SaaS, SFDC, etc.)
Ingestion Streaming
analytics
Big Data
Storage
Batch analytics,
Machine learning
Streaming data
Traditional batch
processing
(ETL to EDW)
Semantic
Model
Secure
+
Combine
+
Enrich
11
DV Layer @ Edge
DV Layer @ Data
Center/Cloud
Lightweight DV Layer
Beyond Edge
Layer 1 Layer 2 Layer 3 Layer 4
Edge Computing Using Data Virtualization
12
Six Essential Capabilities of Data Virtualization
4. Self-service data services
4. Centralized metadata, security &
governance
4. Location-agnostic architecture for
multi-cloud, hybrid acceleration
1. Data abstraction
2. Zero replication, zero relocation
3. Real-time information
13
Big Data Queries Faster with Denodo Platform
1. Data Virtualization delivers better performance without needing to replicate data into Hadoop.
2. Data Virtualization leverages Data Source Architectures for what they are good at.
Performance comparison of 5 different queries
Impala
Hadoop-only
Runtime (s)
Denodo
Runtime (s)
Denodo
Runtime w/
Cache (s)
Data Volumes
Query 1
199 120 68
Queries 1,2,3,5
•Exadata Row Count: ~5M
•Impala Row Count: ~500k
Query 4
•Exadata Row Count: ~5M
•Impala Row Count: ~2M
Query 2
187 96 88
Query 3
120 212 115
Query 4 timeout
328 69
Query 5
46 91 56
14
Denodo Dynamic Query Optimizer
System Execution Time Data Transferred Optimization Technique
Denodo 9 sec. 4 M Aggregation push-down
Without Denodo 125 sec. 292 M None: full scan
SELECT c.id, SUM(s.amount) as total
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.id
290 M 2 M
Sales Customer
2 M
2 M
Sales Customer
join
group by join
group by
15
join
Group by ZIP
join
Group by ZIP
Data Virtualization with Massively Parallel Processing
2M rows
(sales by customer)
Customer
(2M rows)
System Execution Time Optimization Techniques
Others ~ 10 min Traditional Federation Tools
No MPP 43 sec Aggregation push-down
With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes)
Sales
(300 million rows)
join
Group by ZIP
1. Partial Aggregation
Push-down
Maximizes source processing
Reduces network traffic
3. On-demand data transfer
For SQL-on-Hadoop systems,
Denodo automatically generates
and uploads Parquet files
4. Integration with local
and pre-cached data
The optimization engine detects
when data is cached, or when it is a
native table in the MPP
2. Integrated with Cost-based Optimizer
Based on data volume estimation and
the cost of these particular operations,
CBO can decide to move all or part
of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto, and Impala
for fast analytical processing using
inexpensive Hadoop-based solutions
With MPP Integration
group by
customer ID
16
Customer-reported projected savings by percentage
ROI and TCO of Data Virtualization
Data Integration Cost reduction
▪ 60-80% savings
Traditional Call Centres, Portals
▪ 30-70% savings
BI and Reporting
▪ 40-60% savings
ETL and Data Warehousing
▪ Project timelines of 6-12 months reduced to 3-6 months
▪ Up to 85% reduction in time
• New sources can be configured in
minutes, and fully integrated within
days.
• 100’s of application entities can be
integrated within weeks.
• New business functionality can be
added within days.
• Existing functionality can be
enhanced with new data within days.
• Data proliferation can be significantly
reduced.
• Common, consistent and timely
access to all data via preferred
visualization tools.
17
Customer Centricity / MDM
✓ Complete View of Customer
Data Services
✓ Data as a Service
✓ Data Marketplace
✓ Data Services
✓ Application and Data Migration
Cloud Solutions
Data Governance
✓ GRC
✓ GDPR
✓ Data Privacy / Masking
BI and Analytics
✓ Self-Service Analytics
✓ Logical Data Warehouse
✓ Enterprise Data Fabric
Big Data
✓ Logical Data Lake
✓ Data Warehouse Offloading
✓ IoT Analytics
✓ Cloud Modernization
✓ Cloud Analytics
✓ Hybrid Data Fabric
Denodo ‘Horizontal Solution’ Categories
18
Autodesk – Integration of Streaming Data AUTODESK
2017 CIO 100
Award
AUTODESK
FINALIST in 2017
Excellence Awards
19
Asurion: Data Virtualization On-Premises and Cloud ASURION
2017 Best Practices
Award
20
Denodo
The Leader in Data Virtualization
DENODO OFFICES, CUSTOMERS, PARTNERS
Palo Alto, CA.
Global presence throughout North America,
EMEA, APAC, and Latin America.
LEADERSHIP
▪ Longest continuous focus on data
virtualization – since 1999
▪ Leader in 2018 Forrester Wave – Big
Data Fabric
▪ Winner of numerous awards
CUSTOMERS
~500 customers, including many F500 and
G2000 companies across every major industry
have gained significant business agility and ROI.
FINANCIALS
Backed by $4B+ private equity firm.
50+% annual growth; Profitable.
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,
without prior the written authorization from Denodo Technologies.

More Related Content

Building a Single Logical Data Lake: For Advanced Analytics, Data Science, and Business Operations

  • 1. Building a Single Logical Data Lake For Advanced Analytics, Data Science, and Business Operations Ravi Shankar, CMO September, 2018
  • 2. 2 • Competition from a low cost vendor • Lower the price, affecting margins? • Or, maintain high price, but differentiate in other ways?
  • 3. 3 Benefits Large Heavy Equipment Manufacturer Self-service / Predictive Analytics – IoT Integration Improved asset performance and proactive maintenance Increased revenue from sale of services and parts Reduced warranty costs of parts failure
  • 4. 4 “Big Data Challenges Impede Big Data Vision”
  • 5. 5 IT – Business Dilemma IT Architecture is Unmanageable & Brittle because: IT Focuses on Data Collection & Storage Business Focuses on Data Visualization & Analysis No One Focused on Data Delivery – So create 100’s to 1K’s of brittle direct connections and replicate large volumes of data Inventory System (MS SQL Server) Product Catalog (Web Service -SOAP) BI / Reporting JDBC, ODBC, ADO .NET Web / Mobile WS – REST JSON, XML, HTML, RSSLog files (.txt/.log files) CRM (MySQL) Billing System (Web Service - Rest) ETL Portals JSR168 / 286, Ms Web Parts SOA, Middleware, Enterprise Apps WS – SOAP Java API Customer Voice (Internet, Unstruc)
  • 6. 6 Big Data Fabric Architecture – Forrester Research, 2016
  • 7. 7 Big Data Fabric – Data Abstraction Layer Abstracts access to disparate data sources Acts as a single repository (virtual) Makes data available in real-time to consumers
  • 8. 8 Consume in business applications Combine related data into views 2 3 DATA CONSUMERS Enterprise Applications, Reporting, BI, Portals, ESB, Mobile, Web, Users, IoT/Streaming Data Connect to disparate data sources 1 DISPARATE DATA SOURCES Databases & Warehouses, Cloud/Saas Applications, Big Data, NoSQL, Web, XML, Excel, PDF, Word... Less StructuredMore Structured Multiple protocols, formats Linked data services query, search, browse Request/Reply, event driven Secure delivery Library of wrappers Web automation Any data or content Read & Write DATA VIRTUALIZATION DATA CONSUMERSAnalytical Operational CONNECT COMBINE CONSUME Share, Deliver, Publish, Govern, Collaborate Discover, Transform, Prepare, Improve Quality, Integrate Normalized views of disparate data Agile Development Performance Resource Management Lifecycle Management Data Services Data Catalog Governance & Metadata Security & Data Privacy
  • 9. 9 Logical Data Lake – Use Cases Data Warehouse OffloadingIoT Integration
  • 10. 10 Data Virtualization in the IoT Ecosystem Other RDBMS (apps, CRM, SAP) Other Sources (SaaS, SFDC, etc.) Ingestion Streaming analytics Big Data Storage Batch analytics, Machine learning Streaming data Traditional batch processing (ETL to EDW) Semantic Model Secure + Combine + Enrich
  • 11. 11 DV Layer @ Edge DV Layer @ Data Center/Cloud Lightweight DV Layer Beyond Edge Layer 1 Layer 2 Layer 3 Layer 4 Edge Computing Using Data Virtualization
  • 12. 12 Six Essential Capabilities of Data Virtualization 4. Self-service data services 4. Centralized metadata, security & governance 4. Location-agnostic architecture for multi-cloud, hybrid acceleration 1. Data abstraction 2. Zero replication, zero relocation 3. Real-time information
  • 13. 13 Big Data Queries Faster with Denodo Platform 1. Data Virtualization delivers better performance without needing to replicate data into Hadoop. 2. Data Virtualization leverages Data Source Architectures for what they are good at. Performance comparison of 5 different queries Impala Hadoop-only Runtime (s) Denodo Runtime (s) Denodo Runtime w/ Cache (s) Data Volumes Query 1 199 120 68 Queries 1,2,3,5 •Exadata Row Count: ~5M •Impala Row Count: ~500k Query 4 •Exadata Row Count: ~5M •Impala Row Count: ~2M Query 2 187 96 88 Query 3 120 212 115 Query 4 timeout 328 69 Query 5 46 91 56
  • 14. 14 Denodo Dynamic Query Optimizer System Execution Time Data Transferred Optimization Technique Denodo 9 sec. 4 M Aggregation push-down Without Denodo 125 sec. 292 M None: full scan SELECT c.id, SUM(s.amount) as total FROM customer c JOIN sales s ON c.id = s.customer_id GROUP BY c.id 290 M 2 M Sales Customer 2 M 2 M Sales Customer join group by join group by
  • 15. 15 join Group by ZIP join Group by ZIP Data Virtualization with Massively Parallel Processing 2M rows (sales by customer) Customer (2M rows) System Execution Time Optimization Techniques Others ~ 10 min Traditional Federation Tools No MPP 43 sec Aggregation push-down With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes) Sales (300 million rows) join Group by ZIP 1. Partial Aggregation Push-down Maximizes source processing Reduces network traffic 3. On-demand data transfer For SQL-on-Hadoop systems, Denodo automatically generates and uploads Parquet files 4. Integration with local and pre-cached data The optimization engine detects when data is cached, or when it is a native table in the MPP 2. Integrated with Cost-based Optimizer Based on data volume estimation and the cost of these particular operations, CBO can decide to move all or part of the execution tree to the MPP 5. Fast parallel execution Support for Spark, Presto, and Impala for fast analytical processing using inexpensive Hadoop-based solutions With MPP Integration group by customer ID
  • 16. 16 Customer-reported projected savings by percentage ROI and TCO of Data Virtualization Data Integration Cost reduction ▪ 60-80% savings Traditional Call Centres, Portals ▪ 30-70% savings BI and Reporting ▪ 40-60% savings ETL and Data Warehousing ▪ Project timelines of 6-12 months reduced to 3-6 months ▪ Up to 85% reduction in time • New sources can be configured in minutes, and fully integrated within days. • 100’s of application entities can be integrated within weeks. • New business functionality can be added within days. • Existing functionality can be enhanced with new data within days. • Data proliferation can be significantly reduced. • Common, consistent and timely access to all data via preferred visualization tools.
  • 17. 17 Customer Centricity / MDM ✓ Complete View of Customer Data Services ✓ Data as a Service ✓ Data Marketplace ✓ Data Services ✓ Application and Data Migration Cloud Solutions Data Governance ✓ GRC ✓ GDPR ✓ Data Privacy / Masking BI and Analytics ✓ Self-Service Analytics ✓ Logical Data Warehouse ✓ Enterprise Data Fabric Big Data ✓ Logical Data Lake ✓ Data Warehouse Offloading ✓ IoT Analytics ✓ Cloud Modernization ✓ Cloud Analytics ✓ Hybrid Data Fabric Denodo ‘Horizontal Solution’ Categories
  • 18. 18 Autodesk – Integration of Streaming Data AUTODESK 2017 CIO 100 Award AUTODESK FINALIST in 2017 Excellence Awards
  • 19. 19 Asurion: Data Virtualization On-Premises and Cloud ASURION 2017 Best Practices Award
  • 20. 20 Denodo The Leader in Data Virtualization DENODO OFFICES, CUSTOMERS, PARTNERS Palo Alto, CA. Global presence throughout North America, EMEA, APAC, and Latin America. LEADERSHIP ▪ Longest continuous focus on data virtualization – since 1999 ▪ Leader in 2018 Forrester Wave – Big Data Fabric ▪ Winner of numerous awards CUSTOMERS ~500 customers, including many F500 and G2000 companies across every major industry have gained significant business agility and ROI. FINANCIALS Backed by $4B+ private equity firm. 50+% annual growth; Profitable.
  • 21. www.denodo.com info@denodo.com © Copyright Denodo Technologies. All rights reserved Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.