ADV Slides: Building and Growing Organizational Analytics with Data Lakes
•
0 likes•584 views
Data lakes are providing immense value to organizations embracing data science.
In this webinar, William will discuss the value of having broad, detailed, and seemingly obscure data available in cloud storage for purposes of expanding Data Science in the organization.
1 of 30
Download to read offline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
More Related Content
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
1. Building and Growing
Organizational Analytics with
Data Lakes
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group
An Inc. 5000 Company in 2018 and 2017
@williammcknight
www.mcknightcg.com
(214) 514-1444
Second Thursday of Every Month, at 2:00 ET
#AdvAnalytics
2. William McKnight
President, McKnight Consulting Group
• Frequent keynote speaker and trainer internationally
• Consulted to Pfizer, Scotiabank, Fidelity, TD Ameritrade, Teva
Pharmaceuticals, Verizon, and many other Global 1000 companies
• Hundreds of articles, blogs, benchmarks and white papers in publication
• Focused on delivering business value and solving business problems utilizing
proven, streamlined approaches to information management
• Former Database Engineer, Fortune 50 Information Technology executive
and Ernst&Young Entrepreneur of Year Finalist
• Owner/consultant: 2018 & 2017 Inc. 5000 Data Strategy and Implementation
consulting firm
• Brings 25+ years of information management and DBMS experience
3. Implementation
McKnight Consulting Group Offerings
Strategy
Training
Strategy
Trusted Advisor
Action Plans
Roadmaps
Tool Selections
Program Management
Training
Classes
Workshops
Implementation
Data/Data Warehousing/Business
Intelligence/Analytics
Master Data Management
Governance/Quality
Big Data
3
5. Right-Fitting Analytic Platforms
• The Enterprise Data Warehouse
• A Dependent Data Mart
• The Enterprise Data Lake
• A Big Data Cluster
• An Independent Data Mart
5
8. HDFS vs Cloud Storage
• Cloud Storage is more scalable and persistent
• Cloud Storage is backed up and supports
compression, making the cost of big data less
• HDFS has better query performance
• Cloud Storage has object size and single PUT
limits that need workarounds
8
9. Leveraging Cloud Storage for Data Lakes
• More Achievable separate compute and storage architecture
• Compute resources (Map/Reduce, Hive, Spark, etc.) can be taken
down, scaled up or out, or interchanged without data movement
• Storage can be centralized, but compute can be distributed
• Major players have mechanism to ensure consistency to achieve
ACID-like compliance for remote data changes
• Some vendors also have remote data replication to ensure
redundancy and recovery
• Most of the query execution is processing time, and not data
transport, so if cloud compute and storage are in the same cloud
vendor region, performance is hardly impacted
9
10. Data Lakes (Cloud Storage) vs Data Warehouses
(RDBMS)
Usually no pre-specified data model
A Data Lake is great for history data
A Data Lake easily accepts all data types
Fewer users
More exploration and discovery
Lighter data governance
Limitless big data
11. Balance of Analytics
Analytic Applications
DW
Data Lake
Analytic Applications
DW
Data Lake
Analytic Applications
DW Data Lake
DW
13. Data Lake
Data Warehouse Data Lake
Machine
Learning
Categorical Model
(e.g. Decision Tree)
Categorical Data Quantitative Data
Split
Quantitative Model
(e.g. Regression)
Train Train
Score Score
Evaluate
Historical
Transaction Data
Deploy
Scores
Real Time
Transactions
Actions
Conceptual Model
18. Data Governance for the Data Lake
• Data stewardship coverage
• Data catalog coverage
• Less transformation
• Nulls, blanks
• Date formatting
• Data ranges, patterns, outliers
• Relationships
• Business rules
• Data classification
• Confidence scores
18
Full Data Governance Data Lake Data Governance
21. Data Lake Setup
• Managed deployments in the Hadoop family of
products
• External tables in Hive metastore that point at
cloud storage (Amazon S3, Google Cloud
Storage, Azure Data Lake Storage Gen 2)
– To run SQL against the data
– HiveQL and Spark SQL require entries in the
metastore
21
22. Object Storage Instances
• Object Storage instances/clusters have local
storage, i.e., on the physical drives mounted to the
instances themselves, that is HDFS and Hive
• Object Storage technologies access their cloud
vendor’s respective cloud storage—viz.:
– Amazon EMR accesses S3
– Dataproc accesses Google Cloud Storage
– HDI accesses Azure Data Lake Storage Gen2
• Local storage is used by the Object Storage
platform for housekeeping
22
23. The Data Lake of the Future
• Pair a lake with an analytical engine that charges
only by what you use
• If you have a ton of data that can sit in cold storage
and only needs to be accessed or analyzed
occasionally, store it in Amazon S3/Azure Blob
Storage/Google Cloud Storage
– Use a database (on-premise or in the cloud) that can
create external tables that point at the storage
– Analysts can query directly against it, or draw down a
subset for some deeper/intensive analysis
– The GB/month storage fee plus data transfer/egress
fees will be much cheaper than leaving it in a data
warehouse
23
24. Sample Cluster Configuration
Google BigQuery
Cloud Provider Google Cloud
Platform Version 3.6
Hadoop Version 2.7.3
Hive Version 1.2.1
Spark Version 2.3.2
Instance Type n1-highmem-16
Head/Master Nodes 1
Worker Nodes 16 and 32
vCPUs (per node) 16
RAM (per node) 104 GB
Compute Cost
(per node per hour)
$0.947
Platform Premium (per node per hour) $0.160
24
25. Tips
• If possible, configure remote data to be stored in parquet
format, as opposed to comma-separated or other text format
• As new data sources are added to cloud storage, use a code
distribution system—like Github—to distribute new table
definitions to distributed teams
• Use data partitioning to improve performance—but don’t
forget new partitions have to be declared to the Hive
metastore when they are added to the data
• Co-locate compute and storage in the same region
• Use encryption on cloud storage bucket to ensure encryption
at-rest
• Drop commonly used data in the lake, like master data from
MDM
25
27. All Data is AI Data
• Call center recordings and chat logs
• Streaming sensor data, historical maintenance records and
search logs
• Customer account data and purchase history
• Email response metrics
• Product catalogs and data sheets
• Public references
• YouTube video content audio tracks
• User website behaviors
• Sentiment analysis, user-generated content, social graph data,
and other external data sources
27
28. Artificial Intelligence and Machine Learning
• Looming on the horizon is an injection of AI/ML
into every piece of software
• Consider the domain of data integration
– Predicting with high accuracy the steps ahead
– Fixing its bugs
• Machine learning is being built into databases so
the data will be analyzed as it is loaded
– I.e., Python with TensorFlow and Scala on Spark.
• The split of the necessary AI/ML between the
"edge" of corporate users and the software itself is
still to be determined
28
29. Training Data for Machine Learning & Artificial
Intelligence
• You must have enough data to analyze to build
models
• Your data determines the depth of AI you can
achieve -- for example, statistical modeling,
machine learning, or deep learning -- and its
accuracy
29
30. Building and Growing
Organizational Analytics with
Data Lakes
Presented by: William McKnight
“#1 Global Influencer in Data Warehousing” Onalytica
President, McKnight Consulting Group
An Inc. 5000 Company in 2018 and 2017
@williammcknight
www.mcknightcg.com
(214) 514-1444
Second Thursday of Every Month, at 2:00 ET
#AdvAnalytics