Introduction To Data Engineering

Uploaded by

Ankit Tiwari

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views

Introduction To Data Engineering

Uploaded by

Ankit Tiwari

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 8

Introduction to

Data Engineering

Speaker – Baitanik Talukder

| 2024-03-12 | Page 1 Ericsson

Course Content
Module 3: Data
Module 1: Introduction Module 2: Python Module 4: Introduction Module 5: Packaging Module 6: Capstone
Serialization and
to Data Engineering Fundamentals to DataFrame & Project setup Project
different connectors

Basic Data Types Real-world Data

Overview of Data Overview of Data Using of Pandas Using setup.py to
and Data Structures module Engineering
Engineering Serialization build wheel file
in Python Project

Control Structures Working with Dataframe Running in Presenting and

Role of Data Operations
and Functions in JSON and CSV Aggregation and containerized Demonstrating the
Engineers
Python Data Formats Transformations environment Capstone Project

Working with database Introduction to

Handling
Data Science Vs • RDBMS - Postgres Additional tool workflow
Exceptions and management
Data Engineering • NoSql - Elastic Search (py-spark)
Errors (argo/airflow)

Working with Files Connect with Assignment Assignment

( Handle large (Application
and File I/O Message Bus scale data packaging &
Operations (kafka) processing) deployment)

Assignment (Data
Assignment exchange between sql to
no-sql)

| 2024-03-12 | Page 2
What is Data Engineering

Aspects Description
● Data engineering is a field in computer science that
Data Acquisition Data engineers are involved in sourcing data from various
focuses on designing, building, and maintaining internal and external sources, such as databases, APIs,
streaming platforms, logs, sensors, and other data repositories
systems and infrastructure for managing large
volumes of data. Data engineers are responsible for
Data Storage Data engineers design and implement storage solutions that are
the development and operation of data pipelines, optimized for the organization's data requirements. This
includes selecting appropriate data storage technologies such as
data warehouses, and other data infrastructure relational databases, NoSQL databases, data lakes, distributed
file systems, and cloud storage services.
components that enable organizations to collect,
store, process, and analyze data efficiently and Data Integration Data engineers integrate data from disparate sources and
formats to create unified and consistent views of the data. This
reliably. involves resolving data schema inconsistencies, managing data
quality issues, and ensuring data integrity across the
organization.

Data Transformation Data engineers develop and maintain ETL (Extract, Transform,
Load) processes to move data between different systems and
formats. They may use batch processing or real-time streaming
techniques depending on the requirements of the use case.

| 2024-03-12 | Page 3
What is Data Engineering Cont.…

Overall, data engineering plays a critical role in Aspects Description

enabling organizations to derive actionable Data Quality and Data engineers implement data quality checks,
Governance monitoring, and governance mechanisms to ensure the
insights, make data-driven decisions, and drive accuracy, completeness, and reliability of the data. This
innovation by providing reliable, scalable, and includes establishing data quality metrics, implementing
data validation rules, and enforcing data governance
efficient data infrastructure and processes. Data policies.
engineers collaborate closely with data
scientists, analysts, and other stakeholders to Scalability and Performance Data engineers design data systems that can scale
horizontally and vertically to accommodate growing
ensure that data solutions meet the data volumes and user demands. They optimize data
pipelines and infrastructure for performance, reliability,
organization's business objectives and data and cost-effectiveness.
requirements.
Infrastructure Automation Data engineers leverage automation tools and
frameworks to provision, configure, and manage data
infrastructure resources efficiently. This may include
using infrastructure as code (IaC) tools, containerization
technologies, and cloud services for deployment and
orchestration

| 2024-03-12 | Page 4
The Evolving Role of the Data Engineer

Data engineers work in various settings to build

systems that collect, manage, and convert raw data
into usable information for data scientists and
business analysts to interpret. Their ultimate goal is to
make data accessible so that organizations can use it
to evaluate and optimize their performance.

| 2024-03-12 | Page 5
| 2024-03-12 | Page 6
Data Engineering vs Data Science

Area Data Engineering Data Science

Focus Primarily concerned with the design, development, and maintenance of data pipelines and Focuses on extracting insights and knowledge from data through advanced analytics, statistical
infrastructure. Data engineers focus on the collection, storage, and processing of data at scale, modeling, machine learning, and data visualization techniques. Data scientists leverage data to solve
ensuring its accessibility, reliability, and efficiency for downstream analytics and applications. complex problems, make predictions, and drive decision-making processes.

Skills Requires strong programming skills, particularly in languages like Python, Java, or Scala, along with Requires a combination of skills in statistics, mathematics, programming (often in Python or R),
expertise in data storage technologies (e.g., databases, data lakes, distributed file systems), data machine learning, data visualization, and domain expertise. Data scientists must be adept at
processing frameworks (e.g., Apache Spark, Hadoop), and proficiency in ETL (Extract, Transform, exploratory data analysis, predictive modeling, and communicating insights effectively.
Load) processes.

Responsibilities Responsibilities include designing and building data pipelines, integrating data from various sources, Responsibilities include identifying business problems that can be addressed with data analysis,
maintaining data infrastructure, optimizing data storage and retrieval, ensuring data quality and collecting and exploring relevant data, preprocessing and transforming data for analysis, developing
reliability, and collaborating with other teams (e.g., data science, software engineering) to support and validating predictive models, interpreting results, and communicating findings to stakeholders.
analytical and operational needs.

Tools and Technologies Utilizes tools and technologies for data storage (e.g., relational databases, NoSQL databases, data Relies on tools and technologies for data manipulation and analysis (e.g., Pandas, NumPy), statistical
lakes), data processing (e.g., Apache Spark, Apache Hadoop), workflow management (e.g., Apache modeling and machine learning (e.g., scikit-learn, TensorFlow, PyTorch), data visualization (e.g.,
Airflow, Luigi), and infrastructure automation (e.g., Kubernetes, Docker). Matplotlib, Seaborn, Plotly).

End Goals Aims to ensure efficient, reliable, and scalable data infrastructure to support various data-driven Aims to extract actionable insights, patterns, and predictions from data to inform decision-making,
applications and analytical needs within an organization. optimize processes, drive innovation, and create value for businesses and organizations.
While there is overlap between data engineering and data science, particularly in areas such as data
preprocessing and feature engineering, they represent distinct skill sets and roles within the broader
domain of data analytics. Effective collaboration between data engineers and data scientists is crucial
for successful data-driven initiatives, as they complement each other's expertise in building end-to-
end data solutions and extracting meaningful insights from data.

| 2024-03-12 | Page 7

sample cv
No ratings yet
sample cv
5 pages
Ethnotech - Data Science With Python
No ratings yet
Ethnotech - Data Science With Python
480 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Daa Lab Manual
No ratings yet
Daa Lab Manual
60 pages
Challenges and Scope of Data Science Project
No ratings yet
Challenges and Scope of Data Science Project
21 pages
Automation Anywhere PDF
0% (1)
Automation Anywhere PDF
19 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Cca62-Python Programming
No ratings yet
Cca62-Python Programming
53 pages
RAID (Redundant Arrays of Independent Disks) - GeeksforGeeks
No ratings yet
RAID (Redundant Arrays of Independent Disks) - GeeksforGeeks
4 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Unit 1 Full Notes
No ratings yet
Unit 1 Full Notes
52 pages
Sets in Python
No ratings yet
Sets in Python
7 pages
Big Data Analytics Unit 4
No ratings yet
Big Data Analytics Unit 4
83 pages
Object Oriented Analysis and Design - Syllabus
No ratings yet
Object Oriented Analysis and Design - Syllabus
1 page
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
No ratings yet
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
35 pages
II Cse Cs3352 Fds QB Unit2
No ratings yet
II Cse Cs3352 Fds QB Unit2
5 pages
Data Generalization
No ratings yet
Data Generalization
3 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
Unit-5 - Hive
No ratings yet
Unit-5 - Hive
31 pages
3-1 Bigdata (Spark)
No ratings yet
3-1 Bigdata (Spark)
3 pages
Game Playing: Adversarial Search
No ratings yet
Game Playing: Adversarial Search
66 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Unit 3 Notes UDS23201J Query Processing
No ratings yet
Unit 3 Notes UDS23201J Query Processing
38 pages
Python Notes 3rd Mca
No ratings yet
Python Notes 3rd Mca
99 pages
ML Unit-3 ppt
No ratings yet
ML Unit-3 ppt
92 pages
Machine Learning Unit 4
100% (1)
Machine Learning Unit 4
78 pages
Compiler Design Notes
No ratings yet
Compiler Design Notes
157 pages
Speech Processing Lab Manual
No ratings yet
Speech Processing Lab Manual
23 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Data Analytics New Quantum AKTU
No ratings yet
Data Analytics New Quantum AKTU
210 pages
[Ebooks PDF] download Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition) Prateek Gupta full chapters
100% (4)
[Ebooks PDF] download Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition) Prateek Gupta full chapters
50 pages
Java University Paper Questions MCA Mumbai University
No ratings yet
Java University Paper Questions MCA Mumbai University
2 pages
CCS341 Data Warehousing Notes Unit I
No ratings yet
CCS341 Data Warehousing Notes Unit I
30 pages
CS8091 Bigdata Analytics Lessonplan With Date
No ratings yet
CS8091 Bigdata Analytics Lessonplan With Date
11 pages
Irs Question Papers
No ratings yet
Irs Question Papers
6 pages
Data Analyst Roadmap by Shakra Shamim
0% (1)
Data Analyst Roadmap by Shakra Shamim
13 pages
Chapter 2 Introduction To R and Python
No ratings yet
Chapter 2 Introduction To R and Python
35 pages
CS01207
No ratings yet
CS01207
3 pages
BA Lab Manual
No ratings yet
BA Lab Manual
62 pages
Big Data Question Bank
No ratings yet
Big Data Question Bank
38 pages
Regular Expressions and Its Applications
No ratings yet
Regular Expressions and Its Applications
6 pages
Unit IV SQL
No ratings yet
Unit IV SQL
156 pages
Python Syllabus
No ratings yet
Python Syllabus
4 pages
Data Science A Beginner S Guide 1668243666
100% (1)
Data Science A Beginner S Guide 1668243666
26 pages
CCW331 BA IAT 1 Set 1 & Set 2 Questions
No ratings yet
CCW331 BA IAT 1 Set 1 & Set 2 Questions
19 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Java - Lab - Manual-21csl35 - Skit
No ratings yet
Java - Lab - Manual-21csl35 - Skit
30 pages
Advance Python Question Paper 2023
No ratings yet
Advance Python Question Paper 2023
2 pages
FDSA Unit-2
No ratings yet
FDSA Unit-2
41 pages
Artificial Intelligence Lab Manual: Python
No ratings yet
Artificial Intelligence Lab Manual: Python
15 pages
Download Programming Skills for Data Science Start Writing Code to Wrangle Analyze and Visualize Data with R 1st Edition Michael Freeman ebook All Chapters PDF
No ratings yet
Download Programming Skills for Data Science Start Writing Code to Wrangle Analyze and Visualize Data with R 1st Edition Michael Freeman ebook All Chapters PDF
51 pages
Outliers, Variances, Probability Distributions (1) (Read-Only)
No ratings yet
Outliers, Variances, Probability Distributions (1) (Read-Only)
8 pages
Practical Lab File Based ON Programing in C: Submitted by
No ratings yet
Practical Lab File Based ON Programing in C: Submitted by
6 pages
Lab Program
100% (1)
Lab Program
15 pages
Python Pyramid Program
No ratings yet
Python Pyramid Program
4 pages
20IT503 - Big Data Analytics - Unit2
No ratings yet
20IT503 - Big Data Analytics - Unit2
62 pages
Practical No - 01: Aim: Data Collection, Data Curation and Management For Unstructured Data (Nosql) Using Apache Couchdb
No ratings yet
Practical No - 01: Aim: Data Collection, Data Curation and Management For Unstructured Data (Nosql) Using Apache Couchdb
79 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
Internet & World Wide Web HOW To PROGRAM - Lecture Notes, Study Materials and Important Questions Answers
No ratings yet
Internet & World Wide Web HOW To PROGRAM - Lecture Notes, Study Materials and Important Questions Answers
15 pages
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Rebuild System Databases: SQL Server 2012
No ratings yet
Rebuild System Databases: SQL Server 2012
19 pages
JP Overton's Portfolio
No ratings yet
JP Overton's Portfolio
25 pages
Course Outline
No ratings yet
Course Outline
3 pages
Amit - Kumar - Patel - Leave Management System - Project - Sem - V
No ratings yet
Amit - Kumar - Patel - Leave Management System - Project - Sem - V
62 pages
SVT Prospectus
No ratings yet
SVT Prospectus
68 pages
Appendix - Global Inventory of Drought Hazard and Risk Modeling Tools v2 PDF
No ratings yet
Appendix - Global Inventory of Drought Hazard and Risk Modeling Tools v2 PDF
301 pages
Information SystemNOTES
100% (1)
Information SystemNOTES
61 pages
DBA Team Lead Job Spec
No ratings yet
DBA Team Lead Job Spec
2 pages
DDS Lecture 2
0% (1)
DDS Lecture 2
38 pages
Database Design
100% (1)
Database Design
43 pages
E-Book - SAP S4HANA Data Migration - A Complete Guide
No ratings yet
E-Book - SAP S4HANA Data Migration - A Complete Guide
59 pages
Lexicon User Guide
No ratings yet
Lexicon User Guide
130 pages
Data Warehousing and Mining Unit 1 (1) (1)
No ratings yet
Data Warehousing and Mining Unit 1 (1) (1)
15 pages
DB Lesson 1 Class Note
No ratings yet
DB Lesson 1 Class Note
13 pages
Canary Systems Support Sheet 2018
No ratings yet
Canary Systems Support Sheet 2018
2 pages
Database 9 Practical Questions
No ratings yet
Database 9 Practical Questions
2 pages
Quiz HTTT Bn123
No ratings yet
Quiz HTTT Bn123
44 pages
Annex XIII - UNFPA Client Communication Request Application
No ratings yet
Annex XIII - UNFPA Client Communication Request Application
36 pages
Group 7 Databases On The Web and Semi Structured Databases
No ratings yet
Group 7 Databases On The Web and Semi Structured Databases
33 pages
Application Development and Emerging Technology Source 345
No ratings yet
Application Development and Emerging Technology Source 345
18 pages
Tenable Security Center-User Guide
No ratings yet
Tenable Security Center-User Guide
849 pages
Employee Management System
No ratings yet
Employee Management System
121 pages
ALM Octane Installation Guide For Windows
No ratings yet
ALM Octane Installation Guide For Windows
96 pages
MSC Patran-301 Introduction To Patran
No ratings yet
MSC Patran-301 Introduction To Patran
24 pages
Curriculum / Scheme of Studies Of: Bachelor of Science in Information Technology (BS Information Technology)
No ratings yet
Curriculum / Scheme of Studies Of: Bachelor of Science in Information Technology (BS Information Technology)
120 pages
Big Data Analytics and Its Applications
No ratings yet
Big Data Analytics and Its Applications
4 pages
Postgresql On Solaris 10: Solaris™ 10 How To Guides
No ratings yet
Postgresql On Solaris 10: Solaris™ 10 How To Guides
18 pages
Access Tutorial 2 Building A Database and Defining Table Relationships
No ratings yet
Access Tutorial 2 Building A Database and Defining Table Relationships
32 pages