Building a Single Logical Data Lake: For Advanced Analytics, Data Science, and Business Operations

Building a Single
Logical Data Lake
For Advanced Analytics, Data Science, and
Business Operations
Ravi Shankar, CMO
September, 2018

2
• Competition from a low cost
vendor
• Lower the price, affecting
margins?
• Or, maintain high price, but
differentiate in other ways?

3
Benefits
Large Heavy Equipment Manufacturer
Self-service / Predictive Analytics – IoT Integration
Improved asset performance and
proactive maintenance
Increased revenue from sale of
services and parts
Reduced warranty costs of parts
failure

4
“Big Data Challenges Impede Big Data Vision”

5
IT – Business Dilemma
IT Architecture is Unmanageable & Brittle because:
IT Focuses on
Data Collection
& Storage
Business
Focuses on Data
Visualization &
Analysis
No One Focused on Data Delivery
– So create 100’s to 1K’s of brittle direct connections and
replicate large volumes of data
Inventory System
(MS SQL Server)
Product Catalog
(Web Service -SOAP)
BI / Reporting
JDBC, ODBC,
ADO .NET
Web / Mobile
WS – REST JSON,
XML, HTML, RSSLog files
(.txt/.log files)
CRM
(MySQL)
Billing System
(Web Service - Rest)
ETL
Portals
JSR168 / 286,
Ms Web Parts
SOA, Middleware,
Enterprise Apps
WS – SOAP
Java API
Customer Voice
(Internet, Unstruc)

6
Big Data Fabric Architecture – Forrester Research, 2016

7
Big Data Fabric – Data Abstraction Layer
Abstracts access to disparate
data sources
Acts as a single repository
(virtual)
Makes data available in
real-time to consumers

8
Consume
in business
applications
Combine
related
data into
views
2
3 DATA CONSUMERS
Enterprise Applications, Reporting, BI, Portals, ESB, Mobile, Web, Users, IoT/Streaming Data
Connect
to disparate
data sources
1 DISPARATE DATA SOURCES
Databases & Warehouses, Cloud/Saas Applications, Big Data, NoSQL, Web, XML, Excel, PDF, Word...
Less StructuredMore Structured
Multiple protocols,
formats
Linked data services
query, search, browse
Request/Reply,
event driven
Secure
delivery
Library of
wrappers
Web
automation
Any data
or content
Read
& Write
DATA VIRTUALIZATION
DATA CONSUMERSAnalytical Operational
CONNECT COMBINE CONSUME
Share, Deliver,
Publish, Govern,
Collaborate
Discover,
Transform,
Prepare, Improve
Quality, Integrate
Normalized
views of
disparate data
Agile Development
Performance
Resource Management
Lifecycle Management Data Services
Data Catalog
Governance & Metadata
Security & Data Privacy

9
Logical Data Lake – Use Cases
Data Warehouse OffloadingIoT Integration

10
Data Virtualization in the IoT Ecosystem
Other RDBMS
(apps, CRM, SAP)
Other Sources
(SaaS, SFDC, etc.)
Ingestion Streaming
analytics
Big Data
Storage
Batch analytics,
Machine learning
Streaming data
Traditional batch
processing
(ETL to EDW)
Semantic
Model
Secure
+
Combine
+
Enrich

11
DV Layer @ Edge
DV Layer @ Data
Center/Cloud
Lightweight DV Layer
Beyond Edge
Layer 1 Layer 2 Layer 3 Layer 4
Edge Computing Using Data Virtualization

12
Six Essential Capabilities of Data Virtualization
4. Self-service data services
4. Centralized metadata, security &
governance
4. Location-agnostic architecture for
multi-cloud, hybrid acceleration
1. Data abstraction
2. Zero replication, zero relocation
3. Real-time information

13
Big Data Queries Faster with Denodo Platform
1. Data Virtualization delivers better performance without needing to replicate data into Hadoop.
2. Data Virtualization leverages Data Source Architectures for what they are good at.
Performance comparison of 5 different queries
Impala
Hadoop-only
Runtime (s)
Denodo
Runtime (s)
Denodo
Runtime w/
Cache (s)
Data Volumes
Query 1
199 120 68
Queries 1,2,3,5
•Exadata Row Count: ~5M
•Impala Row Count: ~500k
Query 4
•Exadata Row Count: ~5M
•Impala Row Count: ~2M
Query 2
187 96 88
Query 3
120 212 115
Query 4 timeout
328 69
Query 5
46 91 56

14
Denodo Dynamic Query Optimizer
System Execution Time Data Transferred Optimization Technique
Denodo 9 sec. 4 M Aggregation push-down
Without Denodo 125 sec. 292 M None: full scan
SELECT c.id, SUM(s.amount) as total
FROM customer c JOIN sales s
ON c.id = s.customer_id
GROUP BY c.id
290 M 2 M
Sales Customer
2 M
2 M
Sales Customer
join
group by join
group by

15
join
Group by ZIP
join
Group by ZIP
Data Virtualization with Massively Parallel Processing
2M rows
(sales by customer)
Customer
(2M rows)
System Execution Time Optimization Techniques
Others ~ 10 min Traditional Federation Tools
No MPP 43 sec Aggregation push-down
With MPP 11 sec Aggregation push-down + MPP integration (Impala 8 nodes)
Sales
(300 million rows)
join
Group by ZIP
1. Partial Aggregation
Push-down
Maximizes source processing
Reduces network traffic
3. On-demand data transfer
For SQL-on-Hadoop systems,
Denodo automatically generates
and uploads Parquet files
4. Integration with local
and pre-cached data
The optimization engine detects
when data is cached, or when it is a
native table in the MPP
2. Integrated with Cost-based Optimizer
Based on data volume estimation and
the cost of these particular operations,
CBO can decide to move all or part
of the execution tree to the MPP
5. Fast parallel execution
Support for Spark, Presto, and Impala
for fast analytical processing using
inexpensive Hadoop-based solutions
With MPP Integration
group by
customer ID

16
Customer-reported projected savings by percentage
ROI and TCO of Data Virtualization
Data Integration Cost reduction
▪ 60-80% savings
Traditional Call Centres, Portals
▪ 30-70% savings
BI and Reporting
▪ 40-60% savings
ETL and Data Warehousing
▪ Project timelines of 6-12 months reduced to 3-6 months
▪ Up to 85% reduction in time
• New sources can be configured in
minutes, and fully integrated within
days.
• 100’s of application entities can be
integrated within weeks.
• New business functionality can be
added within days.
• Existing functionality can be
enhanced with new data within days.
• Data proliferation can be significantly
reduced.
• Common, consistent and timely
access to all data via preferred
visualization tools.

17
Customer Centricity / MDM
✓ Complete View of Customer
Data Services
✓ Data as a Service
✓ Data Marketplace
✓ Data Services
✓ Application and Data Migration
Cloud Solutions
Data Governance
✓ GRC
✓ GDPR
✓ Data Privacy / Masking
BI and Analytics
✓ Self-Service Analytics
✓ Logical Data Warehouse
✓ Enterprise Data Fabric
Big Data
✓ Logical Data Lake
✓ Data Warehouse Offloading
✓ IoT Analytics
✓ Cloud Modernization
✓ Cloud Analytics
✓ Hybrid Data Fabric
Denodo ‘Horizontal Solution’ Categories

18
Autodesk – Integration of Streaming Data AUTODESK
2017 CIO 100
Award
AUTODESK
FINALIST in 2017
Excellence Awards

19
Asurion: Data Virtualization On-Premises and Cloud ASURION
2017 Best Practices
Award

20
Denodo
The Leader in Data Virtualization
DENODO OFFICES, CUSTOMERS, PARTNERS
Palo Alto, CA.
Global presence throughout North America,
EMEA, APAC, and Latin America.
LEADERSHIP
▪ Longest continuous focus on data
virtualization – since 1999
▪ Leader in 2018 Forrester Wave – Big
Data Fabric
▪ Winner of numerous awards
CUSTOMERS
~500 customers, including many F500 and
G2000 companies across every major industry
have gained significant business agility and ROI.
FINANCIALS
Backed by $4B+ private equity firm.
50+% annual growth; Profitable.

www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reserved
Unless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm,
without prior the written authorization from Denodo Technologies.

Building a Single Logical Data Lake: For Advanced Analytics, Data Science, and Business Operations

More Related Content

Building a Single Logical Data Lake: For Advanced Analytics, Data Science, and Business Operations