Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Course1_summary

Data engineering involves designing and building systems for data aggregation, storage, and analysis, enabling real-time insights from large datasets. It encompasses data pipelines, data architecture, and various storage solutions like data warehouses and data lakes, with a focus on data transformation and orchestration. Key practices include data ingestion, monitoring data pipelines for reliability, and utilizing machine learning frameworks for model deployment.

Uploaded by

daivshaladhepale
Copyright
© © All Rights Reserved
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Course1_summary

Data engineering involves designing and building systems for data aggregation, storage, and analysis, enabling real-time insights from large datasets. It encompasses data pipelines, data architecture, and various storage solutions like data warehouses and data lakes, with a focus on data transformation and orchestration. Key practices include data ingestion, monitoring data pipelines for reliability, and utilizing machine learning frameworks for model deployment.

Uploaded by

daivshaladhepale
Copyright
© © All Rights Reserved
Available Formats
Download as RTF, PDF, TXT or read online on Scribd
You are on page 1/ 4

What Is Data Engineering?

Data engineering is the practice of designing and building systems for the aggregation, storage and
analysis of data at scale. Data engineers empower organizations to get insights in real time from large
datasets.

From social media and marketing metrics to employee performance statistics and trend forecasts,
enterprises have all the data they need to compile a holistic view of their operations. Data engineers
transform massive quantities of data into valuable strategic findings.

With proper data engineering, stakeholders across an organization—executives, developers, data


scientists and business intelligence (BI) analysts—can access the datasets they need at any time in a
manner that is reliable, convenient and secure.

Organizations have access to more data—and more data types—than ever before. Every bit of data can
potentially inform a crucial business decision.

Use Cases of Data Engg-

· Machine learning

· Real-time data analysis

· Data collection, storage and management

Data Pipeline:

Data Pipelines are the main building block of Data Lifecycle Management. It is set of tools and processes
for collecting, processing, and delivering data from one or more sources to a destination where it can be
analyzed .

Data pipelines form the backbone of a well-functioning data infrastructure and are informed by the data
architecture requirements of the business they serve. Data observability is the practice by which data
engineers monitor their pipelines to ensure that end users receive reliable data.

· Batch Data Pipelines

· Real Time Data Pipeline

· Cloud Native Data Pipeline

· Open Source Data Pipeline

Architecture: Data Ingestion->Data Preproccessing->Data Extraction ->Data Presentastion


Data Maturity: Handling the complex data effectively.

DataOps- automating data pipelines, improving collaboration between data producers and consumers.

"Event streaming" refers to a system where data is continuously published (by producers) to a central
hub called a "broker," which then distributes the data to interested "consumers" in real-time.

Producer: The entity that generates and sends data (events) to the broker.

Broker: The central hub that receives, stores, and distributes events to consumers.

Consumer: The entity that receives and processes events published by producers via the broker.

Storage-

Storage: Storage is the central part of data pipeline

serialization is used to save or trasport data more efficiently.we serialize data into std format which is
sent around and deserialize on the receiving.

Row based serialization: XML JSON CSV

Column based Serialization: parquet,ORC, Arrow

Data compression is the process of reducing the size of data by removing redundancy, while caching
involves storing frequently accessed data in a temporary, faster storage location, both working together
to improve performance by minimizing data transfer and access times, especially in scenarios like web
applications and databases; essentially, compression shrinks the data size to be stored in the cache,
allowing more data to be readily available.

Scaling alters the size of a system. In the scaling process, we either compress or expand the system to
meet the expected needs.

Vertical Scaling: When new resources are added to the existing system to meet the expectation,

Horizontal Scaling: When new server racks are added to the existing system to meet the higher
expectation,

(huge no of peoples access data concurreently so we need HS )

ACID: Atomicity ,Consistency,Isolation,Durability(Single Machine transactiona -relational db)

BASE: Basically Softstate,Eventual Consistency(distributed db)

Online analytical processing (OLAP) and online transaction processing (OLTP) are both data processing
systems that store and analyze business data.

OLAP systems are used to analyze data for decision-making, while OLTP systems are used to manage
real-time transactions.

Data Warehouse: Optimized for structured data, Data Warehouses store information in a relational
format, supporting complex queries and analytics. They are typically used

Data Lake: Designed for unstructured, semi-structured, and structured data, Data Lakes store raw data in
its native format.

Lakehouse: Combines features of both, offering a unified approach to data storage. It enables structured
data analytics and unstructured data processing within the same system.

Data ingestion- Data ingestion is the movement of data from various sources into a single ecosystem.
These sources can include databases, cloud computing platforms such as Amazon Web Services (AWS),
IoT devices, data lakes and warehouses, websites and other customer touchpoints. Data engineers use
APIs to connect many of these data points into their pipelines.

Data transformation- Data transformation prepares the ingested data for end users such as executives or
machine learning engineers. It is a hygiene exercise that finds and corrects errors, removes duplicate
entries and normalizes data for greater data reliability. Then, the data is converted into the format
required by the end user.

Data serving- Once the data has been collected and processed, it’s delivered to the end user. Real-time
data modeling and visualization, machine learning datasets and automated reporting systems are all
examples of common data serving methods.

Orchestration-

Orchestration: central place from which most data engg proccesss get managed.

"tightly coupled" refers to a system where different components are highly dependent on each other,
meaning changes in one part can significantly impact other parts, while "loosely coupled" describes a
system where components are more independent, allowing modifications to one part without major
repercussions on others, promoting greater flexibility and maintainability.

Data architecture : Team size,speed,integration need to check while building data architecture.

Lambda and Kappa-

Modern data stack: data source-ingestion- cloud store-bi


BI stack: data sources-data engg-data analystys

Transformation" refers to the specific act of changing raw data into a desired format by cleaning,
filtering, and manipulating it, while "orchestration" is the broader process of managing and coordinating
the entire data pipeline, including multiple transformation steps, data ingestion, and delivery, ensuring
all operations happen smoothly .

Data lineage" refers to the process of tracking the movement of data from its source to its final
destination, recording all transformations and changes it undergoes.

ML Stack- Building and deploying machine learning models requires a set of software tools and
frameworks known collectively as an ML stack. Libraries for activities like data visualization and statistical
analysis are also common components of the ML stack, along with tools for data preparation, model
training, and model deployment.

Preparation– Tools for cleaning, preprocessing, and feature engineering are all examples of data
preparation and processing tools used to get data ready for use in machine learning models. Pandas and
Numpy are examples of data manipulation tools, whereas Scikit-learn is an example of a tool used for
preparing data.

Frameworks– It is possible to construct and train machine learning models with the help of libraries
called machine learning frameworks. Machine learning frameworks include examples like TensorFlow,
PyTorch, and Keras.

Deployment– Classification tools for models are used to put machine learning models into active usage
in the realm of real-life applications. Flask, Docker, and Kubernetes are all examples of deployment tools.

Analysis– Tools for visualizing and analyzing data are used at various stages of model creation.
Matplotlib and Seaborn are two examples of visualization tools; Jupyter Notebook and Google Colab are
two examples of analytical tools that encourage collaboration and active learning via data exploration.

You might also like