1. Confidential and Proprietary to Daugherty Business Solutions
Goals
By the end of this session
you will understand
• The role of popular Big
Data technologies
• The scope of data
engineering and data
analytics
• How this group supports
local companies using
these technologies
How does
this help
me?
3. Confidential and Proprietary to Daugherty Business Solutions 5
Agenda
• Introduction
• Big Data
• Innovations
• Data Engineering
• Data Analytics
• St. Louis
• Questions as we head into 2020
• Conclusion
4. Confidential and Proprietary to Daugherty Business Solutions
Or more simply put:
St. Louis Big Data IDEA
But let’s break that down…
St. Louis Big Data Innovations, Data Engineering, and Analytics Group
6
5. Confidential and Proprietary to Daugherty Business Solutions
• Bayer
• Mastercard
• ESI
• Centene
• AB
• RGA
• Panera
• Label Insight
7
St. Louis Companies
• Nestle Purina
• Enterprise Holdings
• Maritz
• Edward Jones
• Graybar
• Mercy
• Charter
• Magellan Health
8. Confidential and Proprietary to Daugherty Business Solutions
• HDFS
• Hive/Impala
• Spark
• Hbase
• Solr
• Zeppelin
• Kafka
• NiFi
• Avro/Parquet/ORC
• S3/Ozone/ADLS
• Cloud technologies
10
Big Data Technologies to Know
9. Confidential and Proprietary to Daugherty Business Solutions
• In the beginning…
• How big is your block?
• Replication
• Partitioning & Compression
11
HDFS
10. Confidential and Proprietary to Daugherty Business Solutions
• CSV
• JSON
• Avro
• Parquet
• ORC
12
Serialization Formats
11. Confidential and Proprietary to Daugherty Business Solutions
• SQL abstraction on top of a storage layer
• Good for OLAP style queries with slowly changing dimensions
• Improvements
– Tez
– Calcite
– LLAP
13
Hive/Impala
12. Confidential and Proprietary to Daugherty Business Solutions
• General purpose, distributed computational framework
• First class support for Scala, Java, Python
• Runs on individual machines, Kubernetes, or Hadoop
14
Spark
13. Confidential and Proprietary to Daugherty Business Solutions
HBase is the diesel powered race car engine that drives applications with
Hadoop.
What is HBase?
14. Confidential and Proprietary to Daugherty Business Solutions
• Derives from work for Apache Lucene
• Search
– Full-text
– Faceted
– Hit Handling
– Real-time indexing
– Database integration
– Dynamic clustering
– NOSQL Features
• ELK stack
– Elastic Search - Search
– Logstash – Data collection and log parsing
– Kibana – Analytics and visualization platform
16
Apache SOLR
15. Confidential and Proprietary to Daugherty Business Solutions
• Data science notebook that supports Spark, SQL, Python, and 25 other
interpreters
• Allows users to share data science documents
17
Zeppelin
16. Confidential and Proprietary to Daugherty Business Solutions
• Publish and Subscribe Message Topics
• Process the data
• Store data
18
Apache Kafka
17. Confidential and Proprietary to Daugherty Business Solutions
• Apache NiFi supports powerful and scalable directed graphs of data
routing, transformation, and system mediation logic.
• Extensible
• Configurable
• Visual programming and monitoring
• Data provenance built-in
19
NiFi
19. Confidential and Proprietary to Daugherty Business Solutions
Technology AWS Azure GCP
Block storage S3 Azure Block
Storage
GCS
Serverless Compute Lambda Azure Functions GProc
Servers EC2 Azure Instances DataProc
Database RDS Azure Database Google BigTable
21
Cloud Technologies
20. Confidential and Proprietary to Daugherty Business Solutions
Turn and Talk
• What are some of the
technologies that you are
using that aren’t on the
list?
• What are some of the
technologies that you
want to use that aren’t on
the list?
• What technologies are
you using that you want
to replace?
22
28. Confidential and Proprietary to Daugherty Business Solutions 30
Big Data
Using the right mix of Big
Data ecosystem
components to deliver the
business value desired.
Cloud
Using the right mix of open
source, cloud-native, and
managed solutions to
deliver solutions
Containers
Containers simplify the
process of deployment
making it reliable and
repeatable
Streaming
because yesterday’s data
might be too old.
Architecting Distributed Systems
29. Confidential and Proprietary to Daugherty Business Solutions 31
Creating Reliable Pipelines
It’s not enough to do it once.
Reproducible
Performant
Robust
Flexible
Monitored
Governed
31. Confidential and Proprietary to Daugherty Business Solutions
Collaborating with Data Scientists
33
Daugherty recommends
embedding 2-5 data
engineers for each data
scientist in order to
maximize the productivity
of the data scientist.
32. Confidential and Proprietary to Daugherty Business Solutions
• Storage Mechanisms
• Serialization Framework
• Compression Mechanisms
Architecting Data Storage
34
33. Confidential and Proprietary to Daugherty Business Solutions
• Streaming data
• Batch data analysis
• ETL @ Scale
• Machine Learning Pipelines
• Data Governance
35
Data Engineering Examples
34. Confidential and Proprietary to Daugherty Business Solutions
• Manual – Copies files
• Batch – Triggered process that runs jobs
• Tooled – Tool-based ETL/ELT
• Integrated – Combines ETL with Data Governance capabilities
• Streaming – Low latency transfers
• Insightful – Integrated with Data Science processes
36
Data Engineering Maturity
Manual
Batch
Tooled
Integrated
Streaming
Insightful
35. Confidential and Proprietary to Daugherty Business Solutions
Turn and Talk
• Which aspects of data
engineering are most in
line with your
understanding?
• Which aspects of data
engineering are most
foreign to your
understanding?
37
36. Confidential and Proprietary to Daugherty Business Solutions 38
What is Data Analytics?
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the
goal of discovering useful information, informing conclusions, and supporting decision-
making. Data analytics involves applying algorithmic or mechanical processes to derive
insights.
The conversion of data into information can take many forms
• Visualizations
• Statistical analytics
• Computational analytics
37. Confidential and Proprietary to Daugherty Business Solutions 39
Visualizations and Big Data
How do you represent data when one of its defining characteristics is Volume?
38. Confidential and Proprietary to Daugherty Business Solutions 40
Complex Cause and Effect Relationships
• Regulation, optimization
• Basic machine learning systems stabilize drones, using simple inputs
to determine how much power to send to each rotor
Cause Effect
39. Confidential and Proprietary to Daugherty Business Solutions
Forecasting and Prediction
Algorithms predict the weather based on previous day’s weather and
sensor readings
41
40. Confidential and Proprietary to Daugherty Business Solutions
Categorization/Segmentation
Netflix makes movie recommendations by grouping users based on
viewing habits, and recommending movies enjoyed by other users in
the same group
42
41. Confidential and Proprietary to Daugherty Business Solutions
Sensory Recognition
Siri, Google Voice, etc for Voice Recognition
43
42. Confidential and Proprietary to Daugherty Business Solutions
Network Analysis
Facebook recommends possible connections based on existing
network connections.
44
43. Confidential and Proprietary to Daugherty Business Solutions
Turn and Talk
• What are some examples
of data science that you
encounter in your
everyday life?
45
44. Confidential and Proprietary to Daugherty Business Solutions 46
Is Hadoop Dead?
Hadoop has not died, but it is evolving….
45. Confidential and Proprietary to Daugherty Business Solutions 47
What is the Dividing Line Between Spark and Hadoop?
Hadoop has always been about storage (distributed).
Spark is about compute
Storage and compute will forever be separated
47. Confidential and Proprietary to Daugherty Business Solutions 49
How Does the Cloud Affect Hadoop?
The cloud enables Hadoop workloads and Hadoop enables the cloud
Focus will shift to security, governance and customer choice
48. Confidential and Proprietary to Daugherty Business Solutions 50
What About Streaming?
Streaming is alive and well
Streaming is becoming a necessity for any use case
Streaming will be the foundation for all ML and AI
49. Confidential and Proprietary to Daugherty Business Solutions 51
What is the Future of Open Source?
Apache will always be Apache
A new open source licensing will be implemented to protect the innocent
50. Confidential and Proprietary to Daugherty Business Solutions
• Local Companies
• Big Data
– Hadoop
– Cloud deployments
– Cloud-native technologies
– Spark
– Kafka
• Innovation
– New Big Data projects
– New Big Data services
– New Big Data applications
• Data Engineering
– Streaming data
– Batch data analysis
– Machine Learning Pipelines
– Data Governance
– ETL @ Scale
• Analytics
– Visualization
– Machine Learning
– Reporting
– Forecasting
So What is the STL Big Data IDEA interested in?
52