Data Integration
Data Integration
2
Data Warehouse Data Lake Data Lakehouse
Storage Works well with structured Works well with semi- Can handle structured, semi-
Data Type data structured and unstructured structured, and unstructured
data data
Purpose Optimal for data analytics and Suitable for machine learning Suitable for both data analytics
business intelligence (BI) use- (ML) and artificial intelligence and machine learning
cases (AI) workloads workloads
Cost Storage is costly and time- Storage is cost-effective, fast, Storage is cost-effective, fast,
consuming and flexible and flexible
3
Five Approaches to Data Integration
To implement these processes, data engineers, architects and
developers can either manually code an architecture using SQL
or, more often, they set up and manage a data integration tool,
which streamlines development and automates the system.
4
1. ETL (Extract, Transform, and Load )
Extract: the process of pulling data from a source such as an SQL
or NoSQL database, an XML file or a cloud platform holding data
for systems such as marketing tools, CRM systems, or transactional
systems.
Transform: the process of converting the format or structure of
the data set to match the target system.
Load: the process of placing the data set into the target system
which can be a database, data warehouse, an application, such as
CRM platform or a cloud data warehouse, data lake or data
lakehouse from providers such as Snowflake, Amazon RedShift, and
Google BigQuery.
5
1. What is an ETL ?
6
2. ELT (“Extract, Load, and Transform” )
The ELT process is broken out as follows:
Extract. A data extraction tool pulls data from a source or
sources such as SQL or NoSQL databases, cloud platforms or XML
files. This extracted data is often stored temporarily in a staging area
in a database to confirm data integrity and to apply any necessary
business rules.
Load. The second step involves placing the data into the target
system, typically a cloud data warehouse, where it is ready to be
analyzed by BI tools or data analytics tools.
Transform. Data transformation refers to converting the
structure or format of a data set to match that of the target system.
Examples of transformations include data mapping, replacing codes
with values and applying concatenations or calculations.
7
2. ELT
8
2. ELT (“Extract, Load, and Transform” )
9
3. Data Streaming
Instead of loading data into a new repository in batches,
streaming data integration moves data continuously in real-time
from source to target.
Modern data integration (DI) platforms can deliver analytics-
ready data into streaming and cloud platforms, data warehouses,
and data lakes.
10
4. Application Integration
Application integration (API) allows separate applications to
work together by moving and syncing data between them.
The most typical use case is to support operational needs such
as ensuring that your HR system has the same data as your
finance system.
Application integration must provide consistency between the
data sets.
Also, these various applications usually have unique APIs for
giving and taking data so SaaS application automation tools
can help you create and maintain native API integrations
efficiently and at scale.
11
4. Application Integration
13
5. Data Virtualization
14
Data Integration Benefits
15
Data Integration Benefits
16
Integrating Data using Python
Exercise :
In this exercise, we'll merge the details of students from two
datasets, namely student.csv and marks.csv.
The student dataset contains columns such as Age, Gender,
Grade, and Employed.
17
Integrating Data using Python
load the student.csv and marks.csv datasets into the
stud_data and mark_data pandas DataFrames:
18
Integrating Data using Python
19
Student_id is common to both datasets. Perform data
integration on both the DataFrames with respect to the
Student_id column using the pd.merge() function, and then
print the first 10 values of the new DataFrame:
20