Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Confidential and Proprietary to Daugherty Business Solutions
Goals
By the end of this session
you will understand
• The role of popular Big
Data technologies
• The scope of data
engineering and data
analytics
• How this group supports
local companies using
these technologies
How does
this help
me?
Confidential and Proprietary to Daugherty Business Solutions
Big Data 101
3
Confidential and Proprietary to Daugherty Business Solutions 5
Agenda
• Introduction
• Big Data
• Innovations
• Data Engineering
• Data Analytics
• St. Louis
• Questions as we head into 2020
• Conclusion
Confidential and Proprietary to Daugherty Business Solutions
Or more simply put:
St. Louis Big Data IDEA
But let’s break that down…
St. Louis Big Data Innovations, Data Engineering, and Analytics Group
6
Confidential and Proprietary to Daugherty Business Solutions
• Bayer
• Mastercard
• ESI
• Centene
• AB
• RGA
• Panera
• Label Insight
7
St. Louis Companies
• Nestle Purina
• Enterprise Holdings
• Maritz
• Edward Jones
• Graybar
• Mercy
• Charter
• Magellan Health
Confidential and Proprietary to Daugherty Business Solutions 8
Big Data – A Mental Picture
Confidential and Proprietary to Daugherty Business Solutions 9
The 3 Vs
Confidential and Proprietary to Daugherty Business Solutions
• HDFS
• Hive/Impala
• Spark
• Hbase
• Solr
• Zeppelin
• Kafka
• NiFi
• Avro/Parquet/ORC
• S3/Ozone/ADLS
• Cloud technologies
10
Big Data Technologies to Know
Confidential and Proprietary to Daugherty Business Solutions
• In the beginning…
• How big is your block?
• Replication
• Partitioning & Compression
11
HDFS
Confidential and Proprietary to Daugherty Business Solutions
• CSV
• JSON
• Avro
• Parquet
• ORC
12
Serialization Formats
Confidential and Proprietary to Daugherty Business Solutions
• SQL abstraction on top of a storage layer
• Good for OLAP style queries with slowly changing dimensions
• Improvements
– Tez
– Calcite
– LLAP
13
Hive/Impala
Confidential and Proprietary to Daugherty Business Solutions
• General purpose, distributed computational framework
• First class support for Scala, Java, Python
• Runs on individual machines, Kubernetes, or Hadoop
14
Spark
Confidential and Proprietary to Daugherty Business Solutions
HBase is the diesel powered race car engine that drives applications with
Hadoop.
What is HBase?
Confidential and Proprietary to Daugherty Business Solutions
• Derives from work for Apache Lucene
• Search
– Full-text
– Faceted
– Hit Handling
– Real-time indexing
– Database integration
– Dynamic clustering
– NOSQL Features
• ELK stack
– Elastic Search - Search
– Logstash – Data collection and log parsing
– Kibana – Analytics and visualization platform
16
Apache SOLR
Confidential and Proprietary to Daugherty Business Solutions
• Data science notebook that supports Spark, SQL, Python, and 25 other
interpreters
• Allows users to share data science documents
17
Zeppelin
Confidential and Proprietary to Daugherty Business Solutions
• Publish and Subscribe Message Topics
• Process the data
• Store data
18
Apache Kafka
Confidential and Proprietary to Daugherty Business Solutions
• Apache NiFi supports powerful and scalable directed graphs of data
routing, transformation, and system mediation logic.
• Extensible
• Configurable
• Visual programming and monitoring
• Data provenance built-in
19
NiFi
Confidential and Proprietary to Daugherty Business Solutions 20
Block storage
Confidential and Proprietary to Daugherty Business Solutions
Technology AWS Azure GCP
Block storage S3 Azure Block
Storage
GCS
Serverless Compute Lambda Azure Functions GProc
Servers EC2 Azure Instances DataProc
Database RDS Azure Database Google BigTable
21
Cloud Technologies
Confidential and Proprietary to Daugherty Business Solutions
Turn and Talk
• What are some of the
technologies that you are
using that aren’t on the
list?
• What are some of the
technologies that you
want to use that aren’t on
the list?
• What technologies are
you using that you want
to replace?
22
23 © Cloudera, Inc. All rights reserved.
MARKET DEVELOPMENT & PHASES OF GROWTH
BIG DATA
TECH
DATA
PLATFORM
CIO
& Data Admins
ML, ANALYTICS
& CLOUD
LOB
& Data ScientistsIT early adopters &
Developers
ENTERPRISE DATA CLOUD
Digital Transformation
C-suite &
Boards
24 © Cloudera, Inc. All rights reserved.
HIERARCHY OF NEEDS FOR THE DATA-DRIVEN ENTERPRISE
The “AI Ladder”
AI
MACHINE
LEARNING
DATA SCIENCE
ANALYTICS
"BIG DATA"
25 © Cloudera, Inc. All rights reserved.
ENTERPRISE
DATA CLOUD
ARCHITECTURE
• Multi-function analytics
• Hybrid and multi-cloud
• Secure and governed
• Open platform
IOT, INGEST &
STREAMING
DATA
WAREHOUSING
SECURITY & GOVERNANCE
ML / AI
DATA SCIENCE
PUBLIC CLOUDS
compute & storage
DATACENTER
compute & storage
26 © Cloudera, Inc. All rights reserved.
CLOUDERA
DATA PLATFORM
Infrastructure
Private
Cloud
Hybrid
Cloud
Public
Multi-Cloud
Edge
Catalog | Schema | Migration | Security | GovernanceData
anywhere
• Public, private & hybrid clouds
• Shared data experience
• Powered by open source
• Analytics from the Edge to AI
• Unified data control plane
DF-X DE-X DW-X OD-X ML-X
Analytic
experiences
Data Flow &
Streaming
Data
Engineering
Data
Warehouse
Operational
Database
Machine
Learning
Identity | Orchestration | Management | OperationsUnified
control plane
Altus
DataPlane
Open source distribution DISTRO-X
Note: “DF-X”, “DE-X”, ”DW-X, ”OD-X”, “ML-X” and “DISTRO-X” names are project placeholders, pending CDP release later this year
27 © Cloudera, Inc. All rights reserved.
Any Cloud Multi-Function OpenSecure &
Governed
THE ENTERPRISE DATA CLOUD COMPANY
Confidential and Proprietary to Daugherty Business Solutions 28
Everyone is excited about Data Science
Confidential and Proprietary to Daugherty Business Solutions 29
What is Data Engineering?
Confidential and Proprietary to Daugherty Business Solutions 30
Big Data
Using the right mix of Big
Data ecosystem
components to deliver the
business value desired.
Cloud
Using the right mix of open
source, cloud-native, and
managed solutions to
deliver solutions
Containers
Containers simplify the
process of deployment
making it reliable and
repeatable
Streaming
because yesterday’s data
might be too old.
Architecting Distributed Systems
Confidential and Proprietary to Daugherty Business Solutions 31
Creating Reliable Pipelines
It’s not enough to do it once.
Reproducible
Performant
Robust
Flexible
Monitored
Governed
Confidential and Proprietary to Daugherty Business Solutions 32
Shaping Data Sources
Confidential and Proprietary to Daugherty Business Solutions
Collaborating with Data Scientists
33
Daugherty recommends
embedding 2-5 data
engineers for each data
scientist in order to
maximize the productivity
of the data scientist.
Confidential and Proprietary to Daugherty Business Solutions
• Storage Mechanisms
• Serialization Framework
• Compression Mechanisms
Architecting Data Storage
34
Confidential and Proprietary to Daugherty Business Solutions
• Streaming data
• Batch data analysis
• ETL @ Scale
• Machine Learning Pipelines
• Data Governance
35
Data Engineering Examples
Confidential and Proprietary to Daugherty Business Solutions
• Manual – Copies files
• Batch – Triggered process that runs jobs
• Tooled – Tool-based ETL/ELT
• Integrated – Combines ETL with Data Governance capabilities
• Streaming – Low latency transfers
• Insightful – Integrated with Data Science processes
36
Data Engineering Maturity
Manual
Batch
Tooled
Integrated
Streaming
Insightful
Confidential and Proprietary to Daugherty Business Solutions
Turn and Talk
• Which aspects of data
engineering are most in
line with your
understanding?
• Which aspects of data
engineering are most
foreign to your
understanding?
37
Confidential and Proprietary to Daugherty Business Solutions 38
What is Data Analytics?
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the
goal of discovering useful information, informing conclusions, and supporting decision-
making. Data analytics involves applying algorithmic or mechanical processes to derive
insights.
The conversion of data into information can take many forms
• Visualizations
• Statistical analytics
• Computational analytics
Confidential and Proprietary to Daugherty Business Solutions 39
Visualizations and Big Data
How do you represent data when one of its defining characteristics is Volume?
Confidential and Proprietary to Daugherty Business Solutions 40
Complex Cause and Effect Relationships
• Regulation, optimization
• Basic machine learning systems stabilize drones, using simple inputs
to determine how much power to send to each rotor
Cause Effect
Confidential and Proprietary to Daugherty Business Solutions
Forecasting and Prediction
Algorithms predict the weather based on previous day’s weather and
sensor readings
41
Confidential and Proprietary to Daugherty Business Solutions
Categorization/Segmentation
Netflix makes movie recommendations by grouping users based on
viewing habits, and recommending movies enjoyed by other users in
the same group
42
Confidential and Proprietary to Daugherty Business Solutions
Sensory Recognition
Siri, Google Voice, etc for Voice Recognition
43
Confidential and Proprietary to Daugherty Business Solutions
Network Analysis
Facebook recommends possible connections based on existing
network connections.
44
Confidential and Proprietary to Daugherty Business Solutions
Turn and Talk
• What are some examples
of data science that you
encounter in your
everyday life?
45
Confidential and Proprietary to Daugherty Business Solutions 46
Is Hadoop Dead?
Hadoop has not died, but it is evolving….
Confidential and Proprietary to Daugherty Business Solutions 47
What is the Dividing Line Between Spark and Hadoop?
Hadoop has always been about storage (distributed).
Spark is about compute
Storage and compute will forever be separated
Confidential and Proprietary to Daugherty Business Solutions 48
Mergers/Acquisitions of 2019
Confidential and Proprietary to Daugherty Business Solutions 49
How Does the Cloud Affect Hadoop?
The cloud enables Hadoop workloads and Hadoop enables the cloud
Focus will shift to security, governance and customer choice
Confidential and Proprietary to Daugherty Business Solutions 50
What About Streaming?
Streaming is alive and well
Streaming is becoming a necessity for any use case
Streaming will be the foundation for all ML and AI
Confidential and Proprietary to Daugherty Business Solutions 51
What is the Future of Open Source?
Apache will always be Apache
A new open source licensing will be implemented to protect the innocent
Confidential and Proprietary to Daugherty Business Solutions
• Local Companies
• Big Data
– Hadoop
– Cloud deployments
– Cloud-native technologies
– Spark
– Kafka
• Innovation
– New Big Data projects
– New Big Data services
– New Big Data applications
• Data Engineering
– Streaming data
– Batch data analysis
– Machine Learning Pipelines
– Data Governance
– ETL @ Scale
• Analytics
– Visualization
– Machine Learning
– Reporting
– Forecasting
So What is the STL Big Data IDEA interested in?
52
Confidential and Proprietary to Daugherty Business Solutions 53
Questions?

More Related Content

Big Data IDEA 101 2019

  • 1. Confidential and Proprietary to Daugherty Business Solutions Goals By the end of this session you will understand • The role of popular Big Data technologies • The scope of data engineering and data analytics • How this group supports local companies using these technologies How does this help me?
  • 2. Confidential and Proprietary to Daugherty Business Solutions Big Data 101 3
  • 3. Confidential and Proprietary to Daugherty Business Solutions 5 Agenda • Introduction • Big Data • Innovations • Data Engineering • Data Analytics • St. Louis • Questions as we head into 2020 • Conclusion
  • 4. Confidential and Proprietary to Daugherty Business Solutions Or more simply put: St. Louis Big Data IDEA But let’s break that down… St. Louis Big Data Innovations, Data Engineering, and Analytics Group 6
  • 5. Confidential and Proprietary to Daugherty Business Solutions • Bayer • Mastercard • ESI • Centene • AB • RGA • Panera • Label Insight 7 St. Louis Companies • Nestle Purina • Enterprise Holdings • Maritz • Edward Jones • Graybar • Mercy • Charter • Magellan Health
  • 6. Confidential and Proprietary to Daugherty Business Solutions 8 Big Data – A Mental Picture
  • 7. Confidential and Proprietary to Daugherty Business Solutions 9 The 3 Vs
  • 8. Confidential and Proprietary to Daugherty Business Solutions • HDFS • Hive/Impala • Spark • Hbase • Solr • Zeppelin • Kafka • NiFi • Avro/Parquet/ORC • S3/Ozone/ADLS • Cloud technologies 10 Big Data Technologies to Know
  • 9. Confidential and Proprietary to Daugherty Business Solutions • In the beginning… • How big is your block? • Replication • Partitioning & Compression 11 HDFS
  • 10. Confidential and Proprietary to Daugherty Business Solutions • CSV • JSON • Avro • Parquet • ORC 12 Serialization Formats
  • 11. Confidential and Proprietary to Daugherty Business Solutions • SQL abstraction on top of a storage layer • Good for OLAP style queries with slowly changing dimensions • Improvements – Tez – Calcite – LLAP 13 Hive/Impala
  • 12. Confidential and Proprietary to Daugherty Business Solutions • General purpose, distributed computational framework • First class support for Scala, Java, Python • Runs on individual machines, Kubernetes, or Hadoop 14 Spark
  • 13. Confidential and Proprietary to Daugherty Business Solutions HBase is the diesel powered race car engine that drives applications with Hadoop. What is HBase?
  • 14. Confidential and Proprietary to Daugherty Business Solutions • Derives from work for Apache Lucene • Search – Full-text – Faceted – Hit Handling – Real-time indexing – Database integration – Dynamic clustering – NOSQL Features • ELK stack – Elastic Search - Search – Logstash – Data collection and log parsing – Kibana – Analytics and visualization platform 16 Apache SOLR
  • 15. Confidential and Proprietary to Daugherty Business Solutions • Data science notebook that supports Spark, SQL, Python, and 25 other interpreters • Allows users to share data science documents 17 Zeppelin
  • 16. Confidential and Proprietary to Daugherty Business Solutions • Publish and Subscribe Message Topics • Process the data • Store data 18 Apache Kafka
  • 17. Confidential and Proprietary to Daugherty Business Solutions • Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. • Extensible • Configurable • Visual programming and monitoring • Data provenance built-in 19 NiFi
  • 18. Confidential and Proprietary to Daugherty Business Solutions 20 Block storage
  • 19. Confidential and Proprietary to Daugherty Business Solutions Technology AWS Azure GCP Block storage S3 Azure Block Storage GCS Serverless Compute Lambda Azure Functions GProc Servers EC2 Azure Instances DataProc Database RDS Azure Database Google BigTable 21 Cloud Technologies
  • 20. Confidential and Proprietary to Daugherty Business Solutions Turn and Talk • What are some of the technologies that you are using that aren’t on the list? • What are some of the technologies that you want to use that aren’t on the list? • What technologies are you using that you want to replace? 22
  • 21. 23 © Cloudera, Inc. All rights reserved. MARKET DEVELOPMENT & PHASES OF GROWTH BIG DATA TECH DATA PLATFORM CIO & Data Admins ML, ANALYTICS & CLOUD LOB & Data ScientistsIT early adopters & Developers ENTERPRISE DATA CLOUD Digital Transformation C-suite & Boards
  • 22. 24 © Cloudera, Inc. All rights reserved. HIERARCHY OF NEEDS FOR THE DATA-DRIVEN ENTERPRISE The “AI Ladder” AI MACHINE LEARNING DATA SCIENCE ANALYTICS "BIG DATA"
  • 23. 25 © Cloudera, Inc. All rights reserved. ENTERPRISE DATA CLOUD ARCHITECTURE • Multi-function analytics • Hybrid and multi-cloud • Secure and governed • Open platform IOT, INGEST & STREAMING DATA WAREHOUSING SECURITY & GOVERNANCE ML / AI DATA SCIENCE PUBLIC CLOUDS compute & storage DATACENTER compute & storage
  • 24. 26 © Cloudera, Inc. All rights reserved. CLOUDERA DATA PLATFORM Infrastructure Private Cloud Hybrid Cloud Public Multi-Cloud Edge Catalog | Schema | Migration | Security | GovernanceData anywhere • Public, private & hybrid clouds • Shared data experience • Powered by open source • Analytics from the Edge to AI • Unified data control plane DF-X DE-X DW-X OD-X ML-X Analytic experiences Data Flow & Streaming Data Engineering Data Warehouse Operational Database Machine Learning Identity | Orchestration | Management | OperationsUnified control plane Altus DataPlane Open source distribution DISTRO-X Note: “DF-X”, “DE-X”, ”DW-X, ”OD-X”, “ML-X” and “DISTRO-X” names are project placeholders, pending CDP release later this year
  • 25. 27 © Cloudera, Inc. All rights reserved. Any Cloud Multi-Function OpenSecure & Governed THE ENTERPRISE DATA CLOUD COMPANY
  • 26. Confidential and Proprietary to Daugherty Business Solutions 28 Everyone is excited about Data Science
  • 27. Confidential and Proprietary to Daugherty Business Solutions 29 What is Data Engineering?
  • 28. Confidential and Proprietary to Daugherty Business Solutions 30 Big Data Using the right mix of Big Data ecosystem components to deliver the business value desired. Cloud Using the right mix of open source, cloud-native, and managed solutions to deliver solutions Containers Containers simplify the process of deployment making it reliable and repeatable Streaming because yesterday’s data might be too old. Architecting Distributed Systems
  • 29. Confidential and Proprietary to Daugherty Business Solutions 31 Creating Reliable Pipelines It’s not enough to do it once. Reproducible Performant Robust Flexible Monitored Governed
  • 30. Confidential and Proprietary to Daugherty Business Solutions 32 Shaping Data Sources
  • 31. Confidential and Proprietary to Daugherty Business Solutions Collaborating with Data Scientists 33 Daugherty recommends embedding 2-5 data engineers for each data scientist in order to maximize the productivity of the data scientist.
  • 32. Confidential and Proprietary to Daugherty Business Solutions • Storage Mechanisms • Serialization Framework • Compression Mechanisms Architecting Data Storage 34
  • 33. Confidential and Proprietary to Daugherty Business Solutions • Streaming data • Batch data analysis • ETL @ Scale • Machine Learning Pipelines • Data Governance 35 Data Engineering Examples
  • 34. Confidential and Proprietary to Daugherty Business Solutions • Manual – Copies files • Batch – Triggered process that runs jobs • Tooled – Tool-based ETL/ELT • Integrated – Combines ETL with Data Governance capabilities • Streaming – Low latency transfers • Insightful – Integrated with Data Science processes 36 Data Engineering Maturity Manual Batch Tooled Integrated Streaming Insightful
  • 35. Confidential and Proprietary to Daugherty Business Solutions Turn and Talk • Which aspects of data engineering are most in line with your understanding? • Which aspects of data engineering are most foreign to your understanding? 37
  • 36. Confidential and Proprietary to Daugherty Business Solutions 38 What is Data Analytics? Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision- making. Data analytics involves applying algorithmic or mechanical processes to derive insights. The conversion of data into information can take many forms • Visualizations • Statistical analytics • Computational analytics
  • 37. Confidential and Proprietary to Daugherty Business Solutions 39 Visualizations and Big Data How do you represent data when one of its defining characteristics is Volume?
  • 38. Confidential and Proprietary to Daugherty Business Solutions 40 Complex Cause and Effect Relationships • Regulation, optimization • Basic machine learning systems stabilize drones, using simple inputs to determine how much power to send to each rotor Cause Effect
  • 39. Confidential and Proprietary to Daugherty Business Solutions Forecasting and Prediction Algorithms predict the weather based on previous day’s weather and sensor readings 41
  • 40. Confidential and Proprietary to Daugherty Business Solutions Categorization/Segmentation Netflix makes movie recommendations by grouping users based on viewing habits, and recommending movies enjoyed by other users in the same group 42
  • 41. Confidential and Proprietary to Daugherty Business Solutions Sensory Recognition Siri, Google Voice, etc for Voice Recognition 43
  • 42. Confidential and Proprietary to Daugherty Business Solutions Network Analysis Facebook recommends possible connections based on existing network connections. 44
  • 43. Confidential and Proprietary to Daugherty Business Solutions Turn and Talk • What are some examples of data science that you encounter in your everyday life? 45
  • 44. Confidential and Proprietary to Daugherty Business Solutions 46 Is Hadoop Dead? Hadoop has not died, but it is evolving….
  • 45. Confidential and Proprietary to Daugherty Business Solutions 47 What is the Dividing Line Between Spark and Hadoop? Hadoop has always been about storage (distributed). Spark is about compute Storage and compute will forever be separated
  • 46. Confidential and Proprietary to Daugherty Business Solutions 48 Mergers/Acquisitions of 2019
  • 47. Confidential and Proprietary to Daugherty Business Solutions 49 How Does the Cloud Affect Hadoop? The cloud enables Hadoop workloads and Hadoop enables the cloud Focus will shift to security, governance and customer choice
  • 48. Confidential and Proprietary to Daugherty Business Solutions 50 What About Streaming? Streaming is alive and well Streaming is becoming a necessity for any use case Streaming will be the foundation for all ML and AI
  • 49. Confidential and Proprietary to Daugherty Business Solutions 51 What is the Future of Open Source? Apache will always be Apache A new open source licensing will be implemented to protect the innocent
  • 50. Confidential and Proprietary to Daugherty Business Solutions • Local Companies • Big Data – Hadoop – Cloud deployments – Cloud-native technologies – Spark – Kafka • Innovation – New Big Data projects – New Big Data services – New Big Data applications • Data Engineering – Streaming data – Batch data analysis – Machine Learning Pipelines – Data Governance – ETL @ Scale • Analytics – Visualization – Machine Learning – Reporting – Forecasting So What is the STL Big Data IDEA interested in? 52
  • 51. Confidential and Proprietary to Daugherty Business Solutions 53 Questions?