Course1_summary
Course1_summary
Data engineering is the practice of designing and building systems for the aggregation, storage and
analysis of data at scale. Data engineers empower organizations to get insights in real time from large
datasets.
From social media and marketing metrics to employee performance statistics and trend forecasts,
enterprises have all the data they need to compile a holistic view of their operations. Data engineers
transform massive quantities of data into valuable strategic findings.
Organizations have access to more data—and more data types—than ever before. Every bit of data can
potentially inform a crucial business decision.
· Machine learning
Data Pipeline:
Data Pipelines are the main building block of Data Lifecycle Management. It is set of tools and processes
for collecting, processing, and delivering data from one or more sources to a destination where it can be
analyzed .
Data pipelines form the backbone of a well-functioning data infrastructure and are informed by the data
architecture requirements of the business they serve. Data observability is the practice by which data
engineers monitor their pipelines to ensure that end users receive reliable data.
DataOps- automating data pipelines, improving collaboration between data producers and consumers.
"Event streaming" refers to a system where data is continuously published (by producers) to a central
hub called a "broker," which then distributes the data to interested "consumers" in real-time.
Producer: The entity that generates and sends data (events) to the broker.
Broker: The central hub that receives, stores, and distributes events to consumers.
Consumer: The entity that receives and processes events published by producers via the broker.
Storage-
serialization is used to save or trasport data more efficiently.we serialize data into std format which is
sent around and deserialize on the receiving.
Data compression is the process of reducing the size of data by removing redundancy, while caching
involves storing frequently accessed data in a temporary, faster storage location, both working together
to improve performance by minimizing data transfer and access times, especially in scenarios like web
applications and databases; essentially, compression shrinks the data size to be stored in the cache,
allowing more data to be readily available.
Scaling alters the size of a system. In the scaling process, we either compress or expand the system to
meet the expected needs.
Vertical Scaling: When new resources are added to the existing system to meet the expectation,
Horizontal Scaling: When new server racks are added to the existing system to meet the higher
expectation,
Online analytical processing (OLAP) and online transaction processing (OLTP) are both data processing
systems that store and analyze business data.
OLAP systems are used to analyze data for decision-making, while OLTP systems are used to manage
real-time transactions.
Data Warehouse: Optimized for structured data, Data Warehouses store information in a relational
format, supporting complex queries and analytics. They are typically used
Data Lake: Designed for unstructured, semi-structured, and structured data, Data Lakes store raw data in
its native format.
Lakehouse: Combines features of both, offering a unified approach to data storage. It enables structured
data analytics and unstructured data processing within the same system.
Data ingestion- Data ingestion is the movement of data from various sources into a single ecosystem.
These sources can include databases, cloud computing platforms such as Amazon Web Services (AWS),
IoT devices, data lakes and warehouses, websites and other customer touchpoints. Data engineers use
APIs to connect many of these data points into their pipelines.
Data transformation- Data transformation prepares the ingested data for end users such as executives or
machine learning engineers. It is a hygiene exercise that finds and corrects errors, removes duplicate
entries and normalizes data for greater data reliability. Then, the data is converted into the format
required by the end user.
Data serving- Once the data has been collected and processed, it’s delivered to the end user. Real-time
data modeling and visualization, machine learning datasets and automated reporting systems are all
examples of common data serving methods.
Orchestration-
Orchestration: central place from which most data engg proccesss get managed.
"tightly coupled" refers to a system where different components are highly dependent on each other,
meaning changes in one part can significantly impact other parts, while "loosely coupled" describes a
system where components are more independent, allowing modifications to one part without major
repercussions on others, promoting greater flexibility and maintainability.
Data architecture : Team size,speed,integration need to check while building data architecture.
Transformation" refers to the specific act of changing raw data into a desired format by cleaning,
filtering, and manipulating it, while "orchestration" is the broader process of managing and coordinating
the entire data pipeline, including multiple transformation steps, data ingestion, and delivery, ensuring
all operations happen smoothly .
Data lineage" refers to the process of tracking the movement of data from its source to its final
destination, recording all transformations and changes it undergoes.
ML Stack- Building and deploying machine learning models requires a set of software tools and
frameworks known collectively as an ML stack. Libraries for activities like data visualization and statistical
analysis are also common components of the ML stack, along with tools for data preparation, model
training, and model deployment.
Preparation– Tools for cleaning, preprocessing, and feature engineering are all examples of data
preparation and processing tools used to get data ready for use in machine learning models. Pandas and
Numpy are examples of data manipulation tools, whereas Scikit-learn is an example of a tool used for
preparing data.
Frameworks– It is possible to construct and train machine learning models with the help of libraries
called machine learning frameworks. Machine learning frameworks include examples like TensorFlow,
PyTorch, and Keras.
Deployment– Classification tools for models are used to put machine learning models into active usage
in the realm of real-life applications. Flask, Docker, and Kubernetes are all examples of deployment tools.
Analysis– Tools for visualizing and analyzing data are used at various stages of model creation.
Matplotlib and Seaborn are two examples of visualization tools; Jupyter Notebook and Google Colab are
two examples of analytical tools that encourage collaboration and active learning via data exploration.