0% found this document useful (0 votes)

2 views

Course1_summary

Data engineering involves designing and building systems for data aggregation, storage, and analysis, enabling real-time insights from large datasets. It encompasses data pipelines, data architecture, and various storage solutions like data warehouses and data lakes, with a focus on data transformation and orchestration. Key practices include data ingestion, monitoring data pipelines for reliability, and utilizing machine learning frameworks for model deployment.

Uploaded by

daivshaladhepale

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Course1_summary

Uploaded by

daivshaladhepale

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

You are on page 1/ 4

What Is Data Engineering?

Data engineering is the practice of designing and building systems for the aggregation, storage and
analysis of data at scale. Data engineers empower organizations to get insights in real time from large
datasets.

From social media and marketing metrics to employee performance statistics and trend forecasts,
enterprises have all the data they need to compile a holistic view of their operations. Data engineers
transform massive quantities of data into valuable strategic findings.

With proper data engineering, stakeholders across an organization—executives, developers, data

scientists and business intelligence (BI) analysts—can access the datasets they need at any time in a
manner that is reliable, convenient and secure.

Organizations have access to more data—and more data types—than ever before. Every bit of data can
potentially inform a crucial business decision.

Use Cases of Data Engg-

· Machine learning

· Real-time data analysis

· Data collection, storage and management

Data Pipeline:

Data Pipelines are the main building block of Data Lifecycle Management. It is set of tools and processes
for collecting, processing, and delivering data from one or more sources to a destination where it can be
analyzed .

Data pipelines form the backbone of a well-functioning data infrastructure and are informed by the data
architecture requirements of the business they serve. Data observability is the practice by which data
engineers monitor their pipelines to ensure that end users receive reliable data.

· Batch Data Pipelines

· Real Time Data Pipeline

· Cloud Native Data Pipeline

· Open Source Data Pipeline

Architecture: Data Ingestion->Data Preproccessing->Data Extraction ->Data Presentastion

Data Maturity: Handling the complex data effectively.

DataOps- automating data pipelines, improving collaboration between data producers and consumers.

"Event streaming" refers to a system where data is continuously published (by producers) to a central
hub called a "broker," which then distributes the data to interested "consumers" in real-time.

Producer: The entity that generates and sends data (events) to the broker.

Broker: The central hub that receives, stores, and distributes events to consumers.

Consumer: The entity that receives and processes events published by producers via the broker.

Storage-

Storage: Storage is the central part of data pipeline

serialization is used to save or trasport data more efficiently.we serialize data into std format which is
sent around and deserialize on the receiving.

Row based serialization: XML JSON CSV

Column based Serialization: parquet,ORC, Arrow

Data compression is the process of reducing the size of data by removing redundancy, while caching
involves storing frequently accessed data in a temporary, faster storage location, both working together
to improve performance by minimizing data transfer and access times, especially in scenarios like web
applications and databases; essentially, compression shrinks the data size to be stored in the cache,
allowing more data to be readily available.

Scaling alters the size of a system. In the scaling process, we either compress or expand the system to
meet the expected needs.

Vertical Scaling: When new resources are added to the existing system to meet the expectation,

Horizontal Scaling: When new server racks are added to the existing system to meet the higher
expectation,

(huge no of peoples access data concurreently so we need HS )

ACID: Atomicity ,Consistency,Isolation,Durability(Single Machine transactiona -relational db)

BASE: Basically Softstate,Eventual Consistency(distributed db)

Online analytical processing (OLAP) and online transaction processing (OLTP) are both data processing
systems that store and analyze business data.

OLAP systems are used to analyze data for decision-making, while OLTP systems are used to manage
real-time transactions.

Data Warehouse: Optimized for structured data, Data Warehouses store information in a relational
format, supporting complex queries and analytics. They are typically used

Data Lake: Designed for unstructured, semi-structured, and structured data, Data Lakes store raw data in
its native format.

Lakehouse: Combines features of both, offering a unified approach to data storage. It enables structured
data analytics and unstructured data processing within the same system.

Data ingestion- Data ingestion is the movement of data from various sources into a single ecosystem.
These sources can include databases, cloud computing platforms such as Amazon Web Services (AWS),
IoT devices, data lakes and warehouses, websites and other customer touchpoints. Data engineers use
APIs to connect many of these data points into their pipelines.

Data transformation- Data transformation prepares the ingested data for end users such as executives or
machine learning engineers. It is a hygiene exercise that finds and corrects errors, removes duplicate
entries and normalizes data for greater data reliability. Then, the data is converted into the format
required by the end user.

Data serving- Once the data has been collected and processed, it’s delivered to the end user. Real-time
data modeling and visualization, machine learning datasets and automated reporting systems are all
examples of common data serving methods.

Orchestration-

Orchestration: central place from which most data engg proccesss get managed.

"tightly coupled" refers to a system where different components are highly dependent on each other,
meaning changes in one part can significantly impact other parts, while "loosely coupled" describes a
system where components are more independent, allowing modifications to one part without major
repercussions on others, promoting greater flexibility and maintainability.

Data architecture : Team size,speed,integration need to check while building data architecture.

Lambda and Kappa-

Modern data stack: data source-ingestion- cloud store-bi

BI stack: data sources-data engg-data analystys

Transformation" refers to the specific act of changing raw data into a desired format by cleaning,
filtering, and manipulating it, while "orchestration" is the broader process of managing and coordinating
the entire data pipeline, including multiple transformation steps, data ingestion, and delivery, ensuring
all operations happen smoothly .

Data lineage" refers to the process of tracking the movement of data from its source to its final
destination, recording all transformations and changes it undergoes.

ML Stack- Building and deploying machine learning models requires a set of software tools and
frameworks known collectively as an ML stack. Libraries for activities like data visualization and statistical
analysis are also common components of the ML stack, along with tools for data preparation, model
training, and model deployment.

Preparation– Tools for cleaning, preprocessing, and feature engineering are all examples of data
preparation and processing tools used to get data ready for use in machine learning models. Pandas and
Numpy are examples of data manipulation tools, whereas Scikit-learn is an example of a tool used for
preparing data.

Frameworks– It is possible to construct and train machine learning models with the help of libraries
called machine learning frameworks. Machine learning frameworks include examples like TensorFlow,
PyTorch, and Keras.

Deployment– Classification tools for models are used to put machine learning models into active usage
in the realm of real-life applications. Flask, Docker, and Kubernetes are all examples of deployment tools.

Analysis– Tools for visualizing and analyzing data are used at various stages of model creation.
Matplotlib and Seaborn are two examples of visualization tools; Jupyter Notebook and Google Colab are
two examples of analytical tools that encourage collaboration and active learning via data exploration.

Discover 2017 Update Training Manual Red
No ratings yet
Discover 2017 Update Training Manual Red
263 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Data Engineering
No ratings yet
Data Engineering
6 pages
M
No ratings yet
M
13 pages
Data Warehousing & Data Mining-A View
No ratings yet
Data Warehousing & Data Mining-A View
11 pages
Notes For DMML
No ratings yet
Notes For DMML
27 pages
Data Mining and Data Warehouse: Raju - Qis@yahoo - Co.in Praneeth - Grp@yahoo - Co.in
No ratings yet
Data Mining and Data Warehouse: Raju - Qis@yahoo - Co.in Praneeth - Grp@yahoo - Co.in
8 pages
Data Warehousing Glossary
No ratings yet
Data Warehousing Glossary
11 pages
Unit 1 - Data Mining - WWW - Rgpvnotes.in PDF
100% (1)
Unit 1 - Data Mining - WWW - Rgpvnotes.in PDF
13 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
Data Science - Hierarchy of Needs
No ratings yet
Data Science - Hierarchy of Needs
20 pages
Ds 6
No ratings yet
Ds 6
7 pages
DATA ENG
No ratings yet
DATA ENG
10 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
Unit 3-6
No ratings yet
Unit 3-6
14 pages
chp4 ccd
No ratings yet
chp4 ccd
8 pages
Medical
No ratings yet
Medical
3 pages
Big Data Storage Platforms
No ratings yet
Big Data Storage Platforms
19 pages
BDA_Unit_3
No ratings yet
BDA_Unit_3
18 pages
ccd 4,5,6
No ratings yet
ccd 4,5,6
21 pages
Chapter 2 Data Warehousing
No ratings yet
Chapter 2 Data Warehousing
57 pages
Lecture 2
No ratings yet
Lecture 2
25 pages
Data Infrastructure
No ratings yet
Data Infrastructure
7 pages
M1
No ratings yet
M1
8 pages
Unit 3 (1)
No ratings yet
Unit 3 (1)
16 pages
UNIT-1_BigData
No ratings yet
UNIT-1_BigData
10 pages
Basic Elements of Data Warehouse Architecture
100% (1)
Basic Elements of Data Warehouse Architecture
3 pages
4-Data Processing Pipelines in Science and Business
100% (1)
4-Data Processing Pipelines in Science and Business
22 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
Module-2-MINING DATA STREAMS
100% (3)
Module-2-MINING DATA STREAMS
17 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
What Is a Data Pipeline_ _ IBM
No ratings yet
What Is a Data Pipeline_ _ IBM
10 pages
Data Mining and Data Warehouse BY
100% (1)
Data Mining and Data Warehouse BY
12 pages
Booklet 1713343657
No ratings yet
Booklet 1713343657
79 pages
Big Data Analytics On Large Scale Shared Storage System: University of Computer Studies, Yangon, Myanmar
No ratings yet
Big Data Analytics On Large Scale Shared Storage System: University of Computer Studies, Yangon, Myanmar
7 pages
Unit 2
No ratings yet
Unit 2
10 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
bda ans
No ratings yet
bda ans
18 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
BDA QB Answers 8 To 15
No ratings yet
BDA QB Answers 8 To 15
18 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Database Vs Data Warehouse
No ratings yet
Database Vs Data Warehouse
21 pages
Data warehouse fourth unit notes
No ratings yet
Data warehouse fourth unit notes
11 pages
Big Data Architecture
No ratings yet
Big Data Architecture
4 pages
DM Mod1 PDF
No ratings yet
DM Mod1 PDF
16 pages
Data Mining and Data Warehouse
No ratings yet
Data Mining and Data Warehouse
11 pages
Data - Mining - and - Data by Manyindo
No ratings yet
Data - Mining - and - Data by Manyindo
3 pages
Implementation of Data Warehousing in Online Sales Company
No ratings yet
Implementation of Data Warehousing in Online Sales Company
5 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
79 pages
report mohit
No ratings yet
report mohit
17 pages
Big_Data_Integration_and_Processing_15_Marks (1)
No ratings yet
Big_Data_Integration_and_Processing_15_Marks (1)
5 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
20 pages
Data Warehouse Final Report
No ratings yet
Data Warehouse Final Report
19 pages
BDA UNIT-2 (Final)
No ratings yet
BDA UNIT-2 (Final)
27 pages
5-7-2010 Components of A Data Warehouse: Overall Architecture
No ratings yet
5-7-2010 Components of A Data Warehouse: Overall Architecture
4 pages
Chapter 5
No ratings yet
Chapter 5
47 pages
Data Warehousing
No ratings yet
Data Warehousing
7 pages
Pipeline
No ratings yet
Pipeline
19 pages
Analysis, Decision-Making, As Well As Other Activities Such As Support For Optimization of Organizational Operational Processes
No ratings yet
Analysis, Decision-Making, As Well As Other Activities Such As Support For Optimization of Organizational Operational Processes
4 pages
UNIT II (1) (1)
No ratings yet
UNIT II (1) (1)
20 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Students Attendance Management System: Sarvesh Umashankar Pathak
No ratings yet
Students Attendance Management System: Sarvesh Umashankar Pathak
66 pages
Kv2 Computer Science
No ratings yet
Kv2 Computer Science
184 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
El Sistema Literature Review
100% (2)
El Sistema Literature Review
4 pages
CST1340 - Section1GOALs - M00986493 - Amaan Khan
No ratings yet
CST1340 - Section1GOALs - M00986493 - Amaan Khan
9 pages
Zenworks
No ratings yet
Zenworks
14 pages
Sre Notes For Chapter 5
No ratings yet
Sre Notes For Chapter 5
5 pages
Backup +recovery
No ratings yet
Backup +recovery
13 pages
DS Course Curriculum
No ratings yet
DS Course Curriculum
19 pages
NOS and RDBMS For E2-E3
No ratings yet
NOS and RDBMS For E2-E3
23 pages
Spring Boot _ Batch Processing Notes - Raghu-224-249
No ratings yet
Spring Boot _ Batch Processing Notes - Raghu-224-249
26 pages
Second Normal Form
No ratings yet
Second Normal Form
6 pages
Log
No ratings yet
Log
46 pages
Wolaita Sodo University Schools of Informatics Department of Information Sysstem
No ratings yet
Wolaita Sodo University Schools of Informatics Department of Information Sysstem
14 pages
Arcgis Pro HandBook
No ratings yet
Arcgis Pro HandBook
52 pages
SEC554 - Ethereum JSON RPC - Cheat Sheet
No ratings yet
SEC554 - Ethereum JSON RPC - Cheat Sheet
1 page
Nitu Full Stack
No ratings yet
Nitu Full Stack
7 pages
Survey of PERL - Group 6 Presentation
No ratings yet
Survey of PERL - Group 6 Presentation
12 pages
337 Lecture-05
No ratings yet
337 Lecture-05
11 pages
Class VP Ver61
No ratings yet
Class VP Ver61
12 pages
Defining Users and Configuring Security
100% (1)
Defining Users and Configuring Security
24 pages
Salesforce Data Loader
No ratings yet
Salesforce Data Loader
53 pages
DB2 10 For Linux, UNIX and Windows Bootcamp Welcome: Information Management
No ratings yet
DB2 10 For Linux, UNIX and Windows Bootcamp Welcome: Information Management
18 pages
Knowledge Management BCA-5001 Unit-3
No ratings yet
Knowledge Management BCA-5001 Unit-3
12 pages
Chorki
No ratings yet
Chorki
5 pages
45859_FINAL_MyOracle CCC Presentation
No ratings yet
45859_FINAL_MyOracle CCC Presentation
23 pages
Segundo Intento de Practica
No ratings yet
Segundo Intento de Practica
29 pages
Complex SQL Queries Examples
No ratings yet
Complex SQL Queries Examples
5 pages
Oracle 19c Database Administrator
50% (4)
Oracle 19c Database Administrator
11 pages