Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Confidential and Proprietary to Daugherty Business Solutions
Retooling on the Modern Data and Analytics Stack
2/2020
Confidential and Proprietary to Daugherty Business Solutions 2
What is the Modern
Tech Stack?
The tools and
technologies needed to
solve difficult problems
due to their size, speed,
and complexity.
Confidential and Proprietary to Daugherty Business Solutions 3
Competencies
Information Management
Data Solutions
Modern Data Architectures
Data Science
Data Governance
Confidential and Proprietary to Daugherty Business Solutions 4
Information Management
Data loading
Data modeling
Querying
Confidential and Proprietary to Daugherty Business Solutions 5
IM Focus: NoSQL
Confidential and Proprietary to Daugherty Business Solutions 6
IM Focus: Platforms & Services
Confidential and Proprietary to Daugherty Business Solutions 7
IM Focus: Serialization
JSON
Confidential and Proprietary to Daugherty Business Solutions 8
Data Solutions
Tell me a story…
Confidential and Proprietary to Daugherty Business Solutions 9
Data Solutions Focus
Profiling
Sampling
Aggregation
Confidential and Proprietary to Daugherty Business Solutions 10
Data Governance
Governance outputs are eternal.
Confidential and Proprietary to Daugherty Business Solutions 11
Focus On…
Scale
Confidential and Proprietary to Daugherty Business Solutions 12
Data Engineering
Data Science
Decision ScienceData Science
Decision Science
Confidential and Proprietary to Daugherty Business Solutions 13
Data Science
Confidential and Proprietary to Daugherty Business Solutions 14
Focus On…
Confidential and Proprietary to Daugherty Business Solutions 15
Modern Data Architecture
Programmatic
data
manipulation
Confidential and Proprietary to Daugherty Business Solutions 16
Cloud
Confidential and Proprietary to Daugherty Business Solutions 17
Big Data
Confidential and Proprietary to Daugherty Business Solutions 18
Streaming
Kafka for Publish/Subscribe
KSQL – Kafka + SQL
Debezium – Change Data Capture
Confidential and Proprietary to Daugherty Business Solutions 19
Data Engineering
https://www.logicalclocks.com/blog/feature-store-the-missing-data-layer-in-ml-pipelines
Confidential and Proprietary to Daugherty Business Solutions 20
Focus on…
Confidential and Proprietary to Daugherty Business Solutions 21
Five Steps to Retooling
Awareness
Exposure
Guided Practice
Evolving Practice
Growing Expertise
Confidential and Proprietary to Daugherty Business Solutions
AWARENESS AWARENESS
AWARE,
-ISH
Confidential and Proprietary to Daugherty Business Solutions
AWARENESS
Podcasts
Data science /
advanced tech
groups
Major tech
companies
Confidential and Proprietary to Daugherty Business Solutions
EXPOSURE
Use case studies
Blogs
Try-it-for-free
Confidential and Proprietary to Daugherty Business Solutions
GUIDED PRACTICE
Online courses
Free AWS and Azure accounts
Open-source downloads
Confidential and Proprietary to Daugherty Business Solutions
EVOLVING PRACTICE
1. Pick something
familiar
2. Make it a little
strange
3. Rinse & repeat
Confidential and Proprietary to Daugherty Business Solutions
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
MySQL
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Local File
Local File
Confidential and Proprietary to Daugherty Business Solutions
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
MySQL
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
AWS S3
AWS S3
Confidential and Proprietary to Daugherty Business Solutions
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
Python
(Local)
AWS S3
AWS S3
AWS RDS
Confidential and Proprietary to Daugherty Business Solutions
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Confidential and Proprietary to Daugherty Business Solutions
AWS Step Function
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Confidential and Proprietary to Daugherty Business Solutions
GROWING EXPERTISE
1. Add new features
Confidential and Proprietary to Daugherty Business Solutions
AWS Step Function
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Python
Lambda
Dynamo DB
Add New Features
Confidential and Proprietary to Daugherty Business Solutions
GROWING EXPERTISE
1. Add new features
2. Improve scalability
Confidential and Proprietary to Daugherty Business Solutions
AWS Step Function
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
AWS RDS
Python
Lambda
ECS Task ECS Task ECS Task
Python
Lambda
Dynamo DB
Improve Scalability
Confidential and Proprietary to Daugherty Business Solutions
GROWING EXPERTISE
1. Add new features
2. Improve scalability
3. Improve performance
Confidential and Proprietary to Daugherty Business Solutions
AWS Step Function
Pick Best
Candidate
Ingest People Data Find Match
Candidates
Clean People
Data
Save
Results
AWS S3
AWS S3
Python
Lambda
ECS Task EMR ECS Task
Python
Lambda
Dynamo DB
Improve Performance
Spark
Scala
Confidential and Proprietary to Daugherty Business Solutions
GROWING EXPERTISE
1. Add new features
2. Improve scalability
3. Improve performance
4. Batch vs. stream
Confidential and Proprietary to Daugherty Business Solutions
GROWING EXPERTISE
1. Add new features
2. Improve scalability
3. Improve performance
4. Batch vs. stream
5. Automation
Confidential and Proprietary to Daugherty Business Solutions 40
Conclusion
Don’t do too much at once!
Confidential and Proprietary to Daugherty Business Solutions 41
Questions?
Confidential and Proprietary to Daugherty Business Solutions
Resources
General
• https://www.analyticsvidhya.com/blog/2018/11/data-engineer-comprehensive-list-resources-get-
started/
• https://towardsdatascience.com/who-is-a-data-engineer-how-to-become-a-data-engineer-
1167ddc12811
• https://www.dataquest.io/path/data-engineer/
• https://dataengweekly.com/
Podcasts
• https://towardsdatascience.com/our-podcast-c5c1129bc5cf
• https://www.stitcher.com/podcast/httpanalyticshourlibsyncom/the-digital-analytics-power-hour
• https://www.stitcher.com/podcast/data-stories-podcast/data-stories
• https://www.stitcher.com/podcast/data-skeptic-podcast/the-data-skeptic-podcast (Data Science
focused)
• https://www.stitcher.com/podcast/oreilly-media-2/the-oreilly-data-show-podcast?refid=stpr
• https://www.dataengineeringpodcast.com/
Reference Architectures
• https://medium.com/refraction-tech-everything/how-netflix-works-the-hugely-simplified-complex-
stuff-that-happens-every-time-you-hit-play-3a40c9be254b
• http://highscalability.com/blog/2015/11/9/a-360-degree-view-of-the-entire-netflix-stack.html (older
but interesting)
• https://medium.com/airbnb-engineering/airbnb-engineering-infrastructure/home
• https://towardsdatascience.com/how-linkedin-uber-lyft-airbnb-and-netflix-are-solving-data-
management-and-discovery-for-machine-9b79ee9184bb
Confidential and Proprietary to Daugherty Business Solutions
Resources – continued
Use Cases
• https://www.mongodb.com/use-cases
• https://www.confluent.io/blog/category/use-cases/
• https://kafka.apache.org/uses
• https://aws.amazon.com/big-data/use-cases/
• https://www.dataversity.net/eight-big-data-analytics-options-on-microsoft-azure/
• https://www.toptal.com/spark/introduction-to-apache-spark
Try it for Free
• https://neo4j.com/sandbox/ (Neo4J
• https://www.mongodb.com/cloud/atlas/lp/general/try (MongoDB)
• https://www.postman.com/ + https://www.guru99.com/postman-tutorial.html (trying out
APIs)
• https://databricks.com/try-databricks (Spark)
• https://jupyter.org/try
Confidential and Proprietary to Daugherty Business Solutions
Resources – continued
Open Source Downloads + Guides
• https://spark.apache.org/docs/latest/index.html
• https://kafka.apache.org/documentation/#gettingStarted
• https://www.mongodb.com/download-center/community
• https://www.python.org/about/gettingstarted/
Free Cloud Accounts
• https://aws.amazon.com/free/
• https://azure.microsoft.com/en-us/free/
Online Training
• www.acloud.guru *Recommended
• www.coursera.org
• www.udemy.com

More Related Content

Retooling on the Modern Data and Analytics Tech Stack

  • 1. Confidential and Proprietary to Daugherty Business Solutions Retooling on the Modern Data and Analytics Stack 2/2020
  • 2. Confidential and Proprietary to Daugherty Business Solutions 2 What is the Modern Tech Stack? The tools and technologies needed to solve difficult problems due to their size, speed, and complexity.
  • 3. Confidential and Proprietary to Daugherty Business Solutions 3 Competencies Information Management Data Solutions Modern Data Architectures Data Science Data Governance
  • 4. Confidential and Proprietary to Daugherty Business Solutions 4 Information Management Data loading Data modeling Querying
  • 5. Confidential and Proprietary to Daugherty Business Solutions 5 IM Focus: NoSQL
  • 6. Confidential and Proprietary to Daugherty Business Solutions 6 IM Focus: Platforms & Services
  • 7. Confidential and Proprietary to Daugherty Business Solutions 7 IM Focus: Serialization JSON
  • 8. Confidential and Proprietary to Daugherty Business Solutions 8 Data Solutions Tell me a story…
  • 9. Confidential and Proprietary to Daugherty Business Solutions 9 Data Solutions Focus Profiling Sampling Aggregation
  • 10. Confidential and Proprietary to Daugherty Business Solutions 10 Data Governance Governance outputs are eternal.
  • 11. Confidential and Proprietary to Daugherty Business Solutions 11 Focus On… Scale
  • 12. Confidential and Proprietary to Daugherty Business Solutions 12 Data Engineering Data Science Decision ScienceData Science Decision Science
  • 13. Confidential and Proprietary to Daugherty Business Solutions 13 Data Science
  • 14. Confidential and Proprietary to Daugherty Business Solutions 14 Focus On…
  • 15. Confidential and Proprietary to Daugherty Business Solutions 15 Modern Data Architecture Programmatic data manipulation
  • 16. Confidential and Proprietary to Daugherty Business Solutions 16 Cloud
  • 17. Confidential and Proprietary to Daugherty Business Solutions 17 Big Data
  • 18. Confidential and Proprietary to Daugherty Business Solutions 18 Streaming Kafka for Publish/Subscribe KSQL – Kafka + SQL Debezium – Change Data Capture
  • 19. Confidential and Proprietary to Daugherty Business Solutions 19 Data Engineering https://www.logicalclocks.com/blog/feature-store-the-missing-data-layer-in-ml-pipelines
  • 20. Confidential and Proprietary to Daugherty Business Solutions 20 Focus on…
  • 21. Confidential and Proprietary to Daugherty Business Solutions 21 Five Steps to Retooling Awareness Exposure Guided Practice Evolving Practice Growing Expertise
  • 22. Confidential and Proprietary to Daugherty Business Solutions AWARENESS AWARENESS AWARE, -ISH
  • 23. Confidential and Proprietary to Daugherty Business Solutions AWARENESS Podcasts Data science / advanced tech groups Major tech companies
  • 24. Confidential and Proprietary to Daugherty Business Solutions EXPOSURE Use case studies Blogs Try-it-for-free
  • 25. Confidential and Proprietary to Daugherty Business Solutions GUIDED PRACTICE Online courses Free AWS and Azure accounts Open-source downloads
  • 26. Confidential and Proprietary to Daugherty Business Solutions EVOLVING PRACTICE 1. Pick something familiar 2. Make it a little strange 3. Rinse & repeat
  • 27. Confidential and Proprietary to Daugherty Business Solutions Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results MySQL (Local) Python (Local) Python (Local) Python (Local) Python (Local) Python (Local) Local File Local File
  • 28. Confidential and Proprietary to Daugherty Business Solutions Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results MySQL (Local) Python (Local) Python (Local) Python (Local) Python (Local) Python (Local) AWS S3 AWS S3
  • 29. Confidential and Proprietary to Daugherty Business Solutions Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results Python (Local) Python (Local) Python (Local) Python (Local) Python (Local) AWS S3 AWS S3 AWS RDS
  • 30. Confidential and Proprietary to Daugherty Business Solutions Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results AWS S3 AWS S3 AWS RDS Python Lambda Python Lambda Python Lambda Python Lambda Python Lambda
  • 31. Confidential and Proprietary to Daugherty Business Solutions AWS Step Function Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results AWS S3 AWS S3 AWS RDS Python Lambda Python Lambda Python Lambda Python Lambda Python Lambda
  • 32. Confidential and Proprietary to Daugherty Business Solutions GROWING EXPERTISE 1. Add new features
  • 33. Confidential and Proprietary to Daugherty Business Solutions AWS Step Function Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results AWS S3 AWS S3 AWS RDS Python Lambda Python Lambda Python Lambda Python Lambda Python Lambda Dynamo DB Add New Features
  • 34. Confidential and Proprietary to Daugherty Business Solutions GROWING EXPERTISE 1. Add new features 2. Improve scalability
  • 35. Confidential and Proprietary to Daugherty Business Solutions AWS Step Function Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results AWS S3 AWS S3 AWS RDS Python Lambda ECS Task ECS Task ECS Task Python Lambda Dynamo DB Improve Scalability
  • 36. Confidential and Proprietary to Daugherty Business Solutions GROWING EXPERTISE 1. Add new features 2. Improve scalability 3. Improve performance
  • 37. Confidential and Proprietary to Daugherty Business Solutions AWS Step Function Pick Best Candidate Ingest People Data Find Match Candidates Clean People Data Save Results AWS S3 AWS S3 Python Lambda ECS Task EMR ECS Task Python Lambda Dynamo DB Improve Performance Spark Scala
  • 38. Confidential and Proprietary to Daugherty Business Solutions GROWING EXPERTISE 1. Add new features 2. Improve scalability 3. Improve performance 4. Batch vs. stream
  • 39. Confidential and Proprietary to Daugherty Business Solutions GROWING EXPERTISE 1. Add new features 2. Improve scalability 3. Improve performance 4. Batch vs. stream 5. Automation
  • 40. Confidential and Proprietary to Daugherty Business Solutions 40 Conclusion Don’t do too much at once!
  • 41. Confidential and Proprietary to Daugherty Business Solutions 41 Questions?
  • 42. Confidential and Proprietary to Daugherty Business Solutions Resources General • https://www.analyticsvidhya.com/blog/2018/11/data-engineer-comprehensive-list-resources-get- started/ • https://towardsdatascience.com/who-is-a-data-engineer-how-to-become-a-data-engineer- 1167ddc12811 • https://www.dataquest.io/path/data-engineer/ • https://dataengweekly.com/ Podcasts • https://towardsdatascience.com/our-podcast-c5c1129bc5cf • https://www.stitcher.com/podcast/httpanalyticshourlibsyncom/the-digital-analytics-power-hour • https://www.stitcher.com/podcast/data-stories-podcast/data-stories • https://www.stitcher.com/podcast/data-skeptic-podcast/the-data-skeptic-podcast (Data Science focused) • https://www.stitcher.com/podcast/oreilly-media-2/the-oreilly-data-show-podcast?refid=stpr • https://www.dataengineeringpodcast.com/ Reference Architectures • https://medium.com/refraction-tech-everything/how-netflix-works-the-hugely-simplified-complex- stuff-that-happens-every-time-you-hit-play-3a40c9be254b • http://highscalability.com/blog/2015/11/9/a-360-degree-view-of-the-entire-netflix-stack.html (older but interesting) • https://medium.com/airbnb-engineering/airbnb-engineering-infrastructure/home • https://towardsdatascience.com/how-linkedin-uber-lyft-airbnb-and-netflix-are-solving-data- management-and-discovery-for-machine-9b79ee9184bb
  • 43. Confidential and Proprietary to Daugherty Business Solutions Resources – continued Use Cases • https://www.mongodb.com/use-cases • https://www.confluent.io/blog/category/use-cases/ • https://kafka.apache.org/uses • https://aws.amazon.com/big-data/use-cases/ • https://www.dataversity.net/eight-big-data-analytics-options-on-microsoft-azure/ • https://www.toptal.com/spark/introduction-to-apache-spark Try it for Free • https://neo4j.com/sandbox/ (Neo4J • https://www.mongodb.com/cloud/atlas/lp/general/try (MongoDB) • https://www.postman.com/ + https://www.guru99.com/postman-tutorial.html (trying out APIs) • https://databricks.com/try-databricks (Spark) • https://jupyter.org/try
  • 44. Confidential and Proprietary to Daugherty Business Solutions Resources – continued Open Source Downloads + Guides • https://spark.apache.org/docs/latest/index.html • https://kafka.apache.org/documentation/#gettingStarted • https://www.mongodb.com/download-center/community • https://www.python.org/about/gettingstarted/ Free Cloud Accounts • https://aws.amazon.com/free/ • https://azure.microsoft.com/en-us/free/ Online Training • www.acloud.guru *Recommended • www.coursera.org • www.udemy.com

Editor's Notes

  1. The modern tech stack is the tools and technologies needed to solve difficult problems due to their size, speed, and complexity.
  2. That set of tools and technologies is broad. To make it more manageable, it’s useful to divide it into areas of focus, or competencies: Information Management Data Solutions Modern Data Architecture Data Science Data Governance
  3. Information management is the competency that deals with databases. Loading, modeling, storing, querying. SQL is eternal. As you’ll see – many of the more modern technologies expose SQL interfaces.
  4. Traditional relational databases work well when your data is predictable and fits well into tables, columns, rows, and wherever queries are not very join-intensive. But if your data is not predictable, structured, or when it is highly connected or you need lightning fast performance, you may consider a noSQL database. NoSQL databases are an important part of the modern Information Management landscape. They fall into roughly four categories: Key-Value, Columnar, Document, and Graph. It’s a good idea to have a high-level understanding of each kind of NoSQL database, and to know the use cases for each. Broadly speaking: Key-value stores are great for caching. Columnar stores such as Cassandra are great when you’re dealing with big, big data. Document databases do a great job with storing semi-structured data. Graph databases are suitable when you are dealing with a rich, highly-connected data domain.
  5. Platforms such as Snowflake DB provide “warehouse as a service” and accommodate both structured data from relational sources, as well as semi-structured data. Open-source search tools (or services) such as ElasticSearch and search data ingestion tools such as Logstash also allow you to ingest and search data in almost any format. These are available as open-source downloads for you to deploy, as cloud deployments, and also as SaaS.
  6. Data serialization is the process of converting structured data to a format that allows sharing or storage of the data in a form that allows recovery of its original structure. In some cases, the secondary intention of data serialization is to minimize the data’s size which then reduces disk space or bandwidth requirements. Data serialization converts data objects into a byte stream for storage, transfer and distribution purposes on physical devices. Computer systems may vary in their hardware architecture, OS, addressing mechanisms. Internal binary representations of data also vary accordingly in every environment. Storing and exchanging data between such varying environments requires a platform-and-language-neutral data format that all systems understand. Choice of data serialization format for an application depends on factors such as data complexity, need for human readability, speed and storage space constraints. Three common formats are AVRO, JSON and Parquet. Of these three, only JSON is human-readable. The biggest difference between Avro and Parquet is how they store the data. Parquet stores data in columns, while Avro stores data in a row-based format. Column-oriented data stores are optimized for read-heavy analytical workloads, while row-based databases are best for write-heavy transactional workloads.
  7. Data Solutions is the competency that deals with the visual expression of data. Dashboard and Infographic development would fall into this competency. The real skill is story telling with data, or unearthing information hidden in all of the data. New tools in this space include Tableau, PowerBI, and Google Looker.
  8. As the size of datasets has exploded, the challenge of telling a story with those datasets has become more complex. A modern practitioner must master the skills of data profiling, sampling, and aggregation. Data profiling simply helps the visualization expert understand the dataset. Tools such as Talend and Informatica come with data profiling capabilities. An understanding of sampling strategies, such as probability-based vs. non-probability-based sampling and when to use which kind is an important skill for data visualization. (https://towardsdatascience.com/sampling-techniques-a4e34111d808) This is a crucial step, since the accuracy of insights from data analysis depends heavily on the amount and quality of data used. It is important to gather high-quality accurate data and a large enough amount to create relevant results.  Finally, what aggregations are meaningful to your industry and/or project? What aggregations might obfuscate vs. illustrate? How important are outliers? (Health Effects Story – Susan)
  9. Data Governance is the competency that changes least in terms of outputs. Changes most in terms of scale. Cheap storage enables sophisticated data lineage storage. Data Stewardship.
  10. Data Governance is the competency that changes least in terms of outputs. Changes most in terms of scale. Cheap storage enables sophisticated data lineage storage. Data Stewardship.
  11. Data Engineering combines with Data Science to form Decision Science.
  12. Data Science is the competency that garners the most hype. Rumplestiltskin-ing data into insights. Advanced analytics. Tools are commodifying data science to a degree. Machine learning is not just running Data Robot and picking the winner. Operationalizing data science. Need foundational math and statistical knowledge.
  13. Data Science is the competency that garners the most hype. Rumplestiltskin-ing data into insights. Advanced analytics. Tools are commodifying data science to a degree. Machine learning is not just running Data Robot and picking the winner. Operationalizing data science. Need foundational math and statistical knowledge.
  14. SUSAN: Modern Data Architecture is the competency most directly associated with the modern tech stack. MDA is broad, and it includes Cloud Big Data Streaming and Data Engineering. At its heart, MDA is the practice of accomplishing business tasks using programmatic manipulations of the data.
  15. The major cloud platform players continue to be Amazon with AWS, Microsoft with Azure, and Google Cloud. In November 2019, a Goldman Sachs report concluded that AWS, Azure, Google Cloud, and Alibaba Cloud made up 56% of the total cloud market, with that projected to grow to 84% in 2020. The report shows AWS in the considerable lead with 47% of the market projected for this year, with Azure and Google trailing at 22% and 8% market share, respectively. With that said, it will be interesting to see how the actual numbers play out, especially as Google is positioning itself for multi-cloud support, and Azure shows aggressive growth rates. The number of services available in each cloud platform is rapidly expanding, and it is impossible to be familiar with them all. AWS offers over 190 services at this time. I would recommend starting with the foundational services: for AWS that would be EC2 (virtual instances); S3 (object storage); Virtual Private Cloud (setting up your private network in the cloud); and IAM (identity and access management). Take a look at the core services of one platform, then port that knowledge to the other. They all contain roughly the same capabilities – at least at this point in time -- just packaged differently and aimed at users of different experience levels.
  16. So you may wonder: Is Hadoop dead? Not entirely. But it is steadily in decline. It is down to one major vendor: Cloudera. Hadoop has serious issues in terms of processing smaller datasets, security, and is limited to batch processing. It also requires mid-level programming skills. More and more, programmers are finding workarounds or fixes to Hadoop’s problems of security and medium-skill programming. For instance, new tools speed up Hadoop’s MapReduce functionality: Apache Spark processes data up to 100 times faster. And it provides APIs for Python, Scala, Java, R and a SQL interface as well. And Spark also supports streaming. Another popular option is Kubernetes, which clusters containers across public, private and hybrid clouds The open-source container orchestration technology is picking up major traction as developers overwhelmingly embrace container technology. Kubernetes’ speed offers near real-time data analysis, something that Hadoop and MapReduce just can’t offer. A comparison of Google search results indicates that Kubernetes is on the rise just as sharply as Hadoop is in decline. https://trends.google.com/trends/explore?date=all&geo=US&q=hadoop,kubernetes NOTE to self: Remember that Hadoop includes a number of components: the Hadoop Distributed File System (HDFS), which stores files in a Hadoop-native format and parallelizes them across a cluster; YARN, a schedule that coordinates application runtimes; and MapReduce, the algorithm that actually processes the data in parallel. Hadoop also includes Hive, a SQL-like interface allowing users to run queries on HDFS.
  17. The ability to perform near real-time analytics is now a basic enterprise expectation. And that often involves data streaming. One extremely popular technology for near real-time data transfer is Kafka. Like many of the technologies mentioned here today, it started as an Apache project. It is so broadly used that it really dominates messaging and pub-sub, so it’s a good technology to understand. And also like many of the technologies mentioned here today, as it has gained in popularity a SQL-like interface has been exposed to lower the barrier to entry. KSQL is the open-source streaming SQL engine on top of Apache Kafka, Another technology that is actually built on top of Kafka is Debezium. Debezium is an open-source, distributed platform for change data capture.
  18. Data Engineering is a relatively new area with Modern Data Architectures. Data Engineers are responsible for the creation and maintenance of analytics infrastructures that enable other functions, commonly – but not only – Data Science. Data Engineers use many of the technologies we’ve just discussed. One notable difference between data engineering and traditional information management is that data engineering manages the creation and maintenance analytics infrastructures within a software engineering framework. Two common programming languages used are Python and Scala, with Python having a considerably lower barrier to entry. Another difference is the kind of work that data engineering enables. Data engineers enable the delivery of machine learning solutions in production and at large scale They work to reduce model training time and infrastructure costs They work to make it easier and cheaper to build new models. One way to understand how data engineering approaches problem solving is to look a the Feature Store. The feature store is a central place to store curated features within an organization. A feature is a measurable property of some data sample. Features can be extracted directly from files and database tables, or can be derived values, computed from one or more data sources. Importantly, the feature store holds data in a format that is understandable for predictive models. Data engineers expose APIs for reading, searching and adding features. Data Scientists can search for features and use them to build models with minimal data engineering. In addition, features can be cached and reused by other models, reducing model training time and infrastructure costs. Features are now a managed, governed asset in the Enterprise.
  19. MDA should focus on acquiring skills with Python Spark Kafka Containerization – whether that’s Docker or Kubernetes -- and at least one cloud Platform. If you have an interest in data engineering, understanding software engineering practices is also helpful.
  20. Given the modern data and analytics landscape, how do you retool myself? Retooling is a six step process. Awareness, Exposure, Guided Practice, Evolving Practice, Growing Expertise Awareness is the first step. You can’t learn something that you don’t know that you don’t know. Unconscious incompetence. Awareness can be achieved by reading up on the state of the art in modern data. Recommendations are data engineering weekly, data engineering podcast. Meetups. Exposure is learning about the tool. Conscience incompetence. What does it do? What problems does it solve? Does it seem like the kind of thing that you want to learn? That you need to learn? Reviews – positive and negative. Guided Practice. Conscious Competence. Color by numbers exercises. Tutorials. Get someone to help you. Consulting? Evolving Practice. Start with a small problem. Possibly something you’ve solved before. Point a bazooka at an ant hill. Scott’s CD organizer. Retrospective. More complex problem. POC. Growing Expertise. Unconscious competence. Start on eventual development. Start probing for the edges of the technology. Not every tool is appropriate for every problem.
  21. As Shakespeare would say, I had awareness thrust upon me. I was moved to a new team that would be using all new tech.
  22. So if you don’t have awareness thrust upon you, or if you want to find ways of increasing your awareness, what do you do? There is an EXPLOSION of new tech out there. It can be overwhelming. If you want to narrow it down a bit and figure out where to concentrate, I recommend: Listening to Podcasts: these often feature new technology that “has legs” If your organization has an advanced technology or data science group, ask and see what they are using or looking at using Read up on some of the big companies (Amazon, Netflix, AirBnB). There are a lot of interesting articles out there on their data architectures.
  23. The team had already made a few tech selections: Kafka, Neo4J, Spark Streaming. They had an idea for how these would be used together, and I began to understand – at a high level – what the capabilities of these technologies were. TIP: I also read a fair number of use cases on these technologies. Do this before deciding where to invest your time or money. TIP: Many use cases are in blogs. TIP: Books are expensive and become outdated quickly. TIP: There are LOTS of try-it-for-frees… these are also useful for the next step: Guided Practice.
  24. Many of the technologies in use today are open-source, and have a low barrier to entry. I was able to download Kafka, and follow the getting started guide to get everything working on my local machine. That worked great because it was open-source, and there is a getting started guide right on the site. Through the course of working through these guides, I also got an understanding of JSON and Avro. Much of the modern tech stack is either 1) Available for free or 2) In the cloud – where you can sign up for a free account. And there are getting started guides for just about any technology.
  25. I was able to get a fair amount of exposure to Kafka, REST-ful services, and Neo4J. But I didn’t start getting any depth of exposure until I moved to a position at a new company. In this position, I would be building data pipelines and all the tech would be cloud-native, specifically AWS. The key here is to focus on that old saying: Jack of All Trades; Master of None. Don’t get hung up on mastery unless you are in a position where you can really afford to do so. Much of the power of the modern tech stack can only be realized when you use these tools in conjunction with one another. I decided that in order to do this job, I was going to need to really understand the tech I was using. But it was going to be overwhelming to build it all in a new environment. And I am also someone who works better within a known framework or frame of reference. So here’s what I did: I picked a data pipeline that was familiar to me. And I decided to build it on my own, using only one new piece of tech at a time. That would give me a little depth of practice while building out the larger picture.
  26. I picked a simple pipeline: ingest a file with some people data, clean that people data, match it against some known people, pick the best match for each input, and save the results. I knew that picking the best “match” for a person was going to be a feature our data scientists would want to use. In the past I had done this sort of thing with a basic ETL process developed using ETL tools. Data would be ingested, staged in tables, joined to other tables, etc. But one of the key features of the modern tech stack is a movement away from strictly SQL-based ETL, and a move towards code-based ETL. More specifically, JVM-based ETL (for performance reasons on large datasets). I knew that Python would be a valuable skill going forward and had a lower barrier to entry, so instead of going for a JVM-based language I started with Python. I had some experience with Python from tooling around with it at home. I wrote very simple Python scripts that did a bare minimum of each step, one calling the other. This got me experience with manipulating data in Python, and also introduced me to a variety of Python libraries.
  27. Next, I decided to replace the local input file and output files with files read from Amazon’s Simple Storage Service, or S3. S3 is a cornerstone service offering, and it is basically an object store. By deciding to read from and write to S3, I got experience using not just S3, but also using the AWS APIs for interacting with S3 (called boto for Python). Additionally, it got me accustomed to working with AWS security credentials and understanding how role based security works in AWS Identity Access Manager, as well as setting security policies on S3. TIP: Don’t skimp on security. Consider it from the beginning. You can’t afford to put it in last.
  28. Next I decided to replace my local MySQL database used for lookups with an AWS Relational Database Service instance (PostgreSQL flavor). This got me experience with creating RDS instances, accessing those instances, setting up security on those instances, and accessing data via the AWS APIs.
  29. Next I took my Python code, and used it to create AWS Lambdas – small, serverless functions you can deploy in AWS in a variety of languages. Lambdas can be easily triggered by an object landing in an S3 bucket. But they can only run for 15 minutes. This got me experience with triggering, Lambda creation and deployment.
  30. All my components were now in AWS, but I was having to manually run each Lambda function. The last piece to consider was orchestration. I opted to use an AWS Step Function as it allowed me to orchestrate my workflow, as well as to maintain a state machine that would catch errors.
  31. In order to grow your expertise, you can take your MVP and build upon it. Consider adding new features. New features may introduce new tools, or they may get you more experience with the tools you already use.
  32. For example, I decided to make the format of the input files configurable and used a DynamoDB key/value store in AWS to keep track of my configuration data.
  33. You can also grow your expertise by subjecting your MVP to increasing scale, or decided you want to support increasing scalability.
  34. For example, AWS Lambdas can only run for 15 minutes. If I wanted to clean, find candidate matches for, and pick best candidate for a number of inputs beyond, say, 30K… I was going to have to find ways to making my processes run longer. So I containerized my Python code (Docker!), pushed my containers to Amazon’s container repository (ECR!) and set up an ECS Cluster where the containers could be run. By doing cleansing, candidate generation, and candidate selection in ECS I could support longer running processes on larger input files.
  35. So that’s great that my processes can run longer now on larger inputs, but I’d really like my slowest tasks to become more performant. I am limited to referencing an RDS instance. But what if I could parallelize the candidate generation process by using Spark?
  36. I looked at spinning up an AWS Elastic MapReduce (AWS’s managed Hadoop solution) cluster and submitting candidate match selection jobs to it. I knew that Scala was more performant with Spark than Python, and so wrote a matching app using Scala. I didn’t even end up using it! But the experience I got using Spark, Scala, and EMR was put to good use on other data pipelines.
  37. Where would I go next? What if I wanted to stream in records and match them as they stream in, rather than waiting for a file to land? How would that affect my solution? Are any of my components re-usable? How could I re-architect to make them re-usable? At this point, this is a thought exercise.
  38. Introducing automation such as: Test automation Continuous integration and deployment pipelines (Jenkins, Groovy) Infrastructure as code (Terraform) Can also expose you to a wider variety of tools and technologies, as well as deepen your knowledge and discipline.
  39. Conclusion - Boiling the Ocean. Choose tools that you are going to get to use, that you want to use. Concentrate all fire power on the superstardestroyer. Don’t try to do too much at once.