Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
Data Engineering
April 2019 - STL Big Data User Group
Discussion Topics
Introductions
Background - Chris
Distributed systems - Chris
Day in the life - Tim
Demo - Kit
Why Data Engineering?
● It's not just about analytics.
● Treating data as an asset.
● Data science is built upon the work of good data engineering.
● The data we collect is changing. The way we store and access it must as well.
● Automate, automate, automate.
Data engineering   Stl Big Data IDEA user group
Distributed Systems
What does it mean?
A distributed system is a system whose components are located on
different networked computers, which communicate and coordinate
their actions by passing messages to one another.
Data engineering   Stl Big Data IDEA user group
Distributed Compute
1. Overview
2. Scaling
3. What does it mean for you?
Data Architecture - Lambda & Kappa
Lambda: Batch and Speed Layer
● Stores raw data
● Computes tables in batch
● Has streaming component
● Performs data modeling
● Exposes end tables to users
Kappa: Speed Layer
● All data as a stream
● In stream processing
● Real-time views
A Day In The Life
What’s not modern about this architecture?
● Cron job runs once a day to check for presence of a file on an FTP server
● CSV file is downloaded and loaded into an RDBMS star schema
Message Queues
● Publishers
● Exchanges
● Queues
● Consumers
● Messages are removed from
queues once a consumer reads
them
Scaling Up Message Queues
What if a single consumer can’t keep up with the volume?
Scaling Them Up Even Further...
What happens if each message needs to go to more than one set of consumers?
Distributed Logs
● Topic - a named stream of messages
● Partition - an ordered list of message. A topic is made up of a set of partitions
● Producer - a process that publishes messages to a topic
● Consumer - a process that consumes messages from a topic
● Consumer group - a named group of consumers who together consume
messages from a set of topics
Topics can be consumed by multiple consumer groups independently.
Each consumer group is responsible for tracking its own position in message
processing
Messages stay in the log for the defined retention period (are not removed once all
consumer groups have processed them)
Distributed Logs
Data Formats
Apache Avro
● Row oriented
● Fast at writes
● Rich support for schema evolution
●
● {
● "type" : "record",
● "namespace" : "STL Big Data",
● "name" : "Inventory",
● "fields" : [
● { "name" : "date" , "type" : "date" },
● { "name" : "price" , "type" : "int" },
● { "name" : "size" , "type" : "int" }
● ]
● }
Apache Avro
● Row oriented
● Fast at writes
● Rich support for schema evolution
●
● {
● "type" : "record",
● "namespace" : "STL Big Data",
● "name" : "Inventory",
● "fields" : [
● { "name" : "date" , "type" : "date" },
● { "name" : "price" , "type" : "int" },
● { "name" : "size" , "type" : "int" },
● { "name" : "qty", "type" : [ "null", "int" ], "default" : null }
● ]
● }
Column Oriented Storage
● Instead of storing all columns for a row together on disk, store each column
together on disk
● Enables higher compression ratios
● What kind of queries will you be executing against the data?
Apache ORC
● Column oriented
● Enables fast reading
● Compresses extremely well
● Commonly used with Hive and Presto
● Column oriented
● Enables fast reading
● Compresses well
● Commonly used with with Impala
Apache Parquet
Compression Comparison
Format Size
CSV 2.5 GB
Avro 1.2 GB
Parquet 817 MB
ORC 796 MB
Source: Netflix Prize Data https://www.kaggle.com/netflix-inc/netflix-prize-data
Demo
https://github.com/kitmenke/spark-hello-world
AWS Kinesis
API
AWS Comprehend
Appendix
Data engineering   Stl Big Data IDEA user group
Best in class modern architecture
… At 1904labs, we have a passion for open source
technologies, strive to build cloud first applications, and
are motivated by our desire to transform businesses into
data-driven enterprises.
Data engineering   Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user group
Infrastructure
Bare Metal / Servers On Site
Build & Maintain
Bundled Vendors
Cloud
Data engineering   Stl Big Data IDEA user group
These open source tools
are already used broadly
across industries today
Open-Source Adoption in the Industry
1904labs’ primary focus is to
help our clients design, build
and operate world class and
modern data, development
operations and decision science
capabilities...
...Utilizing best of breed open
source and/or commercial open
source tools and platforms.

More Related Content

Data engineering Stl Big Data IDEA user group

  • 1. Data Engineering April 2019 - STL Big Data User Group
  • 2. Discussion Topics Introductions Background - Chris Distributed systems - Chris Day in the life - Tim Demo - Kit
  • 3. Why Data Engineering? ● It's not just about analytics. ● Treating data as an asset. ● Data science is built upon the work of good data engineering. ● The data we collect is changing. The way we store and access it must as well. ● Automate, automate, automate.
  • 6. What does it mean? A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another.
  • 8. Distributed Compute 1. Overview 2. Scaling 3. What does it mean for you?
  • 9. Data Architecture - Lambda & Kappa Lambda: Batch and Speed Layer ● Stores raw data ● Computes tables in batch ● Has streaming component ● Performs data modeling ● Exposes end tables to users Kappa: Speed Layer ● All data as a stream ● In stream processing ● Real-time views
  • 10. A Day In The Life
  • 11. What’s not modern about this architecture? ● Cron job runs once a day to check for presence of a file on an FTP server ● CSV file is downloaded and loaded into an RDBMS star schema
  • 12. Message Queues ● Publishers ● Exchanges ● Queues ● Consumers ● Messages are removed from queues once a consumer reads them
  • 13. Scaling Up Message Queues What if a single consumer can’t keep up with the volume?
  • 14. Scaling Them Up Even Further... What happens if each message needs to go to more than one set of consumers?
  • 15. Distributed Logs ● Topic - a named stream of messages ● Partition - an ordered list of message. A topic is made up of a set of partitions ● Producer - a process that publishes messages to a topic ● Consumer - a process that consumes messages from a topic ● Consumer group - a named group of consumers who together consume messages from a set of topics Topics can be consumed by multiple consumer groups independently. Each consumer group is responsible for tracking its own position in message processing Messages stay in the log for the defined retention period (are not removed once all consumer groups have processed them)
  • 18. Apache Avro ● Row oriented ● Fast at writes ● Rich support for schema evolution ● ● { ● "type" : "record", ● "namespace" : "STL Big Data", ● "name" : "Inventory", ● "fields" : [ ● { "name" : "date" , "type" : "date" }, ● { "name" : "price" , "type" : "int" }, ● { "name" : "size" , "type" : "int" } ● ] ● }
  • 19. Apache Avro ● Row oriented ● Fast at writes ● Rich support for schema evolution ● ● { ● "type" : "record", ● "namespace" : "STL Big Data", ● "name" : "Inventory", ● "fields" : [ ● { "name" : "date" , "type" : "date" }, ● { "name" : "price" , "type" : "int" }, ● { "name" : "size" , "type" : "int" }, ● { "name" : "qty", "type" : [ "null", "int" ], "default" : null } ● ] ● }
  • 20. Column Oriented Storage ● Instead of storing all columns for a row together on disk, store each column together on disk ● Enables higher compression ratios ● What kind of queries will you be executing against the data?
  • 21. Apache ORC ● Column oriented ● Enables fast reading ● Compresses extremely well ● Commonly used with Hive and Presto ● Column oriented ● Enables fast reading ● Compresses well ● Commonly used with with Impala Apache Parquet
  • 22. Compression Comparison Format Size CSV 2.5 GB Avro 1.2 GB Parquet 817 MB ORC 796 MB Source: Netflix Prize Data https://www.kaggle.com/netflix-inc/netflix-prize-data
  • 27. Best in class modern architecture
  • 28. … At 1904labs, we have a passion for open source technologies, strive to build cloud first applications, and are motivated by our desire to transform businesses into data-driven enterprises.
  • 33. Infrastructure Bare Metal / Servers On Site Build & Maintain Bundled Vendors Cloud
  • 35. These open source tools are already used broadly across industries today Open-Source Adoption in the Industry 1904labs’ primary focus is to help our clients design, build and operate world class and modern data, development operations and decision science capabilities... ...Utilizing best of breed open source and/or commercial open source tools and platforms.