Retooling on the Modern Data and Analytics Tech Stack

Confidential and Proprietary to Daugherty Business Solutions
Retooling on the Modern Data and Analytics Stack
2/2020

Confidential and Proprietary to Daugherty Business Solutions 2
What is the Modern
Tech Stack?
The tools and
technologies needed to
solve difficult problems
due to their size, speed,
and complexity.

Competencies
Information Management
Data Solutions
Modern Data Architectures
Data Science
Data Governance

Information Management
Data loading
Data modeling
Querying

IM Focus: NoSQL

IM Focus: Platforms & Services

IM Focus: Serialization
JSON

Data Solutions
Tell me a story…

Data Solutions Focus
Profiling
Sampling
Aggregation

Data Governance
Governance outputs are eternal.

Focus On…
Scale

Data Engineering
Data Science
Decision ScienceData Science
Decision Science

Data Science

Focus On…

Modern Data Architecture
Programmatic
data
manipulation

Cloud

Big Data

Streaming
Kafka for Publish/Subscribe
KSQL – Kafka + SQL
Debezium – Change Data Capture

Data Engineering
https://www.logicalclocks.com/blog/feature-store-the-missing-data-layer-in-ml-pipelines

Focus on…

Five Steps to Retooling
Awareness
Exposure
Guided Practice
Evolving Practice
Growing Expertise

AWARENESS AWARENESS
AWARE,
-ISH

AWARENESS
Podcasts
Data science /
advanced tech
groups
Major tech
companies

EXPOSURE
Use case studies
Blogs
Try-it-for-free

GUIDED PRACTICE
Online courses
Free AWS and Azure accounts
Open-source downloads

EVOLVING PRACTICE
1. Pick something
familiar
2. Make it a little
strange
3. Rinse & repeat

Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
MySQL
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Local File
Local File

Pick Best
Candidate
Candidates
Clean People
Data
Save
Results
MySQL
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
AWS S3
AWS S3

Pick Best
Candidate
Candidates
Clean People
Data
Save
Results
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
AWS S3
AWS S3
AWS RDS

Pick Best
Candidate
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda

AWS Step Function
Pick Best
Candidate
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda

GROWING EXPERTISE
1. Add new features

AWS Step Function
Pick Best
Candidate
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Dynamo DB
Add New Features

GROWING EXPERTISE
1. Add new features
2. Improve scalability

AWS Step Function
Pick Best
Candidate
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
ECS Task ECS Task ECS Task
Python
Lambda
Dynamo DB
Improve Scalability

GROWING EXPERTISE
1. Add new features
3. Improve performance

AWS Step Function
Pick Best
Candidate
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
Python
Lambda
ECS Task EMR ECS Task
Python
Lambda
Dynamo DB
Improve Performance
Spark
Scala

GROWING EXPERTISE
1. Add new features
4. Batch vs. stream

GROWING EXPERTISE
1. Add new features
4. Batch vs. stream
5. Automation

Conclusion
Don’t do too much at once!

Questions?

Resources
General
• https://www.analyticsvidhya.com/blog/2018/11/data-engineer-comprehensive-list-resources-get-
started/
• https://towardsdatascience.com/who-is-a-data-engineer-how-to-become-a-data-engineer-
1167ddc12811
• https://www.dataquest.io/path/data-engineer/
• https://dataengweekly.com/
Podcasts
• https://towardsdatascience.com/our-podcast-c5c1129bc5cf
• https://www.stitcher.com/podcast/httpanalyticshourlibsyncom/the-digital-analytics-power-hour
• https://www.stitcher.com/podcast/data-stories-podcast/data-stories
• https://www.stitcher.com/podcast/data-skeptic-podcast/the-data-skeptic-podcast (Data Science
focused)
• https://www.stitcher.com/podcast/oreilly-media-2/the-oreilly-data-show-podcast?refid=stpr
• https://www.dataengineeringpodcast.com/
Reference Architectures
• https://medium.com/refraction-tech-everything/how-netflix-works-the-hugely-simplified-complex-
stuff-that-happens-every-time-you-hit-play-3a40c9be254b
• http://highscalability.com/blog/2015/11/9/a-360-degree-view-of-the-entire-netflix-stack.html (older
but interesting)
• https://medium.com/airbnb-engineering/airbnb-engineering-infrastructure/home
• https://towardsdatascience.com/how-linkedin-uber-lyft-airbnb-and-netflix-are-solving-data-
management-and-discovery-for-machine-9b79ee9184bb

Resources – continued
Use Cases
• https://www.mongodb.com/use-cases
• https://www.confluent.io/blog/category/use-cases/
• https://kafka.apache.org/uses
• https://aws.amazon.com/big-data/use-cases/
• https://www.dataversity.net/eight-big-data-analytics-options-on-microsoft-azure/
• https://www.toptal.com/spark/introduction-to-apache-spark
Try it for Free
• https://neo4j.com/sandbox/ (Neo4J
• https://www.mongodb.com/cloud/atlas/lp/general/try (MongoDB)
• https://www.postman.com/ + https://www.guru99.com/postman-tutorial.html (trying out
APIs)
• https://databricks.com/try-databricks (Spark)
• https://jupyter.org/try

Resources – continued
Open Source Downloads + Guides
• https://spark.apache.org/docs/latest/index.html
• https://kafka.apache.org/documentation/#gettingStarted
• https://www.mongodb.com/download-center/community
• https://www.python.org/about/gettingstarted/
Free Cloud Accounts
• https://aws.amazon.com/free/
• https://azure.microsoft.com/en-us/free/
Online Training
• www.acloud.guru *Recommended
• www.coursera.org
• www.udemy.com

Retooling on the Modern Data and Analytics Tech Stack

Related slideshows

More Related Content

Retooling on the Modern Data and Analytics Tech Stack

Editor's Notes