Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
6 views

Data Integration Concepts - 1315060699

Uploaded by

loggerrkey
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Integration Concepts - 1315060699

Uploaded by

loggerrkey
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

Data Integration

and
Synchronization
Module 4
• To differentiate data synchronization from data
Learning integration

outcomes • To discuss the ETL process


• To enumerate data integration security challenges.
 Data synchronization is the process of ensuring
that data is consistent across multiple devices,
systems, or locations. It involves updating data in
real-time or near-real-time to make sure that any
changes made in one location are reflected in
another.
What is data
synchronizati
on?
 Key types of data synchronization include:
 1. One-way synchronization - Data is updated from
a source to a destination without any updates going
back to the source.
 2. Two-way synchronization - Changes made in
both locations (source and destination) are
Data synchronized in both directions, ensuring both
locations are up-to-date.
Synchronizati  3. Real-time synchronization - Changes are
on Types reflected immediately across systems, often used in
cloud services, collaboration tools, and mobile
devices.
 4. Periodic synchronization - Data is synchronized
at intervals, instead of in real-time, which may be
useful when immediate updates aren't necessary or
bandwidth is limited.
 Examples
 1. One-way synchronization - Backing up files from your
local computer to a cloud storage service like Google Drive
or Amazon S3. Files are pushed from your computer to the
cloud, but changes in the cloud don't affect your local files.
 2. Two-way synchronization - Syncing your emails
Data between your phone and computer using an email service
Synchronizati like Gmail. If you delete an email on your phone, it will be
deleted on your computer, and vice versa.
on Types  3. Real-time synchronization - Collaboration on a Google
Docs document by multiple users. Any edits made by one
Examples person are instantly visible to others, ensuring everyone
sees the most up-to-date version.
 4. Periodic synchronization - Syncing fitness data from a
smartwatch to a mobile app. Instead of syncing data
constantly, the watch sends data to the app at set
intervals, such as every hour or when the user opens the
app.
 Consistency across devices and systems - It ensures that users
and systems are working with the most up-to-date information,
preventing data conflicts and inconsistencies. For example, if you
edit a file on your laptop, it will be updated on your phone as well.
 Improved collaboration - Real-time synchronization allows teams
to work on the same data or documents simultaneously without
overwriting each other's work. This is crucial in collaborative
Importance of environments, like shared documents or project management tools.
 Data availability - Synchronization ensures data is available
data anytime and anywhere, regardless of the device being used. It
enables access to up-to-date information even when switching
synchronizati between devices, like syncing calendars or contacts between a
phone and computer.
on  Data backup and recovery - In one-way sync scenarios (e.g.,
backups), synchronization ensures that critical data is stored
securely in the cloud or on external systems. This minimizes the risk
of data loss due to device failure or accidental deletion.
 Support for offline work - With synchronization, users can work
offline (e.g., on a document or app) and when they regain internet
connectivity, their work will automatically sync, ensuring that they
don’t lose progress.
 Purpose
 Data Integration - The goal is to combine data from
multiple, often diverse, sources into a unified view or
Data system. Integration allows data from different
databases, applications, or platforms to be used
Integration vs together for analytics, reporting, or business
processes. It's about making disparate data work
Data together.
Synchronizati  Data Synchronization - The goal is to ensure that
data is consistent and up-to-date across multiple
on locations, devices, or systems. Synchronization
ensures that when data is changed in one place, the
changes are reflected everywhere the data is used or
stored.
 Process
 Data Integration - Involves extracting,
Data transforming, and loading (ETL) data from different
Integration vs sources into a centralized system (like a data
warehouse or data lake). It may involve combining
Data structured and unstructured data, standardizing
formats, or merging datasets.
Synchronizati  Data Synchronization - Involves regularly updating
on multiple copies of the same data across different
systems or devices to ensure consistency. This is
often done in real-time or at regular intervals.
 Use Case
 Data Integration - Often used in business
Data intelligence (BI) systems, analytics, and reporting
Integration vs where data from different systems (like CRM, ERP,
marketing platforms) needs to be combined to
Data provide insights and decision-making.

Synchronizati  Data Synchronization - Common in scenarios


where users or systems access the same dataset
on from different locations or devices, such as syncing
contacts between a phone and a cloud server, or
ensuring a database replica stays updated.
 Scope

Data • Data Integration - Focuses on combining and


transforming data, often dealing with data from
Integration vs unrelated sources. Integration is more about creating
a cohesive data ecosystem for analysis and decision-
Data making.

Synchronizati • Data Synchronization - Focuses on keeping


specific datasets or data points up-to-date and
on identical across locations. The data involved is
generally the same but must be kept current across
multiple systems.
 Example
Data • Data Integration - An organization integrates data
Integration vs from its sales, marketing, and customer service
databases to create a comprehensive view of
Data customer behavior.

Synchronizati • Data Synchronization - When you update a


calendar event on your smartphone, the change is
on synchronized across your laptop, tablet, and other
devices.
Extract
Transform
Load
 ETL process is typically automated through tools like
Talend, Apache Nifi, Microsoft SSIS, or cloud services
like AWS Glue and Google Dataflow, which help
streamline the process.
 Example
ETL  A retail company may extract data from its point-of-sale
system, online store, and customer support system.
Examples  In the transformation phase, they clean and unify
customer records, ensuring consistency in names,
addresses, and purchase histories.
 Finally, they load this data into a data warehouse where
the marketing and sales teams can run reports and
analyses to understand customer behavior and improve
targeting.
 In the extraction phase, data is collected from
different sources, such as databases, files, APIs, or
cloud services. These sources can be structured (e.g.,
relational databases), semi-structured (e.g., JSON
files), or unstructured (e.g., text files).
Extract  Connect to the source systems.

Transform 

Extract the required data efficiently.
Handle different formats (e.g., databases, CSVs, logs).
Load  Minimize disruption to the source systems by using
optimized queries and scheduling.
 Example: Extracting customer data from multiple
systems like a CRM, an ERP, and a marketing
platform.
 The transformation phase is where the extracted data is
cleaned, enriched, and formatted to meet the
requirements of the target system. It often involves
converting data types, resolving inconsistencies, or
standardizing data.
 Data cleaning - Removing duplicates, handling missing
data, correcting errors (e.g., misspelled names, wrong

Extract dates)
 Data transformation - Converting data into the required

Transform format, such as changing all dates to a standard format,


applying business rules, or creating calculated fields.

Load  Data aggregation - Summarizing data or combining data


from multiple sources into a unified form.
 Data enrichment - Adding new information to enhance
the dataset (e.g., adding geo-location data based on an
address).
 Example: Converting data from multiple systems into a
common format, such as making all date fields consistent
(e.g., MM/DD/YYYY), and removing any duplicate
customer records.
 The final step is loading the transformed data into the
target system, usually a data warehouse, a database,
or a data lake. Depending on the system’s
requirements and the business's needs, this can be
done in bulk (all at once) or incrementally (at regular
intervals).
Extract  Ensure the target system is not overloaded during
loading (especially in high-volume data scenarios).
Transform  Schedule load times to minimize disruption to business
operations.
Load  Apply indexing or partitioning to improve query
performance in the data warehouse.
 Example: Loading the cleaned and standardized
customer data into a data warehouse like Amazon
Redshift or Google BigQuery, where it can be used for
reporting and analysis.
 Data Breaches During Transfer
 Risk - Data can be intercepted by malicious actors
during extraction or transfer between systems. In an
unencrypted or improperly secured connection,
sensitive information like personal data, financial
records, or proprietary business information can be
exposed.
Security  Mitigation - Use strong encryption protocols (e.g.,
SSL/TLS) during data transfer, and secure network
issues in connections to prevent unauthorized access or
eavesdropping.
data
integration
 Insecure APIs and Data Sources
 Risk - Many data integration processes rely on APIs to
extract data from external systems. If these APIs are
not properly secured (e.g., lack of authentication,
outdated encryption), they can be an entry point for
attackers to access or manipulate data.

Security  Mitigation - Secure APIs with robust authentication


(e.g., OAuth, API keys), use secure API gateways, and
issues in regularly update API security protocols.

data
integration
 Data Leakage
 Risk - Data leakage can occur when sensitive data is
inadvertently exposed during the integration process.
For example, if personally identifiable information (PII)
is unintentionally included in a dataset that is publicly
shared or transferred to an unauthorized system, it can
Security lead to compliance violations or breaches.
 Mitigation - Implement strong access controls and
issues in data masking techniques to ensure sensitive data is
only accessible by authorized personnel. Use role-based
data access control (RBAC) to restrict permissions.

integration
 Inadequate Access Controls
 Risk - Without proper access controls, unauthorized
users could gain access to integrated data, leading to
potential data theft or manipulation. This is particularly
problematic if sensitive data from multiple sources is
merged and accessible in one location.
Security  Mitigation - Use multi-factor authentication (MFA) and
granular permissions to control who has access to the
issues in integrated data. Ensure that access control policies are
consistently applied across all systems and sources.
data
integration
 Data Integrity Issues
 Risk - During the transformation phase, if data is
altered or corrupted (intentionally or accidentally), it
could lead to inaccurate reports and business decisions.
Attackers might exploit vulnerabilities in the
transformation process to inject malicious data.
Security  Mitigation - Implement data validation checks, ensure
audit logs are maintained to track changes, and use
issues in checksums or hashes to verify data integrity
throughout the ETL process.
data
integration
 Compliance and Privacy Violations
 Risk - Data integration processes that involve PII,
health data, or financial data (PCI DSS) may violate
compliance regulations if handled improperly. For
instance, transferring sensitive data between systems
located in different countries might breach data
residency laws like GDPR.
Security  Mitigation - Ensure compliance with local and
international regulations by applying appropriate data
issues in handling, encryption, and anonymization techniques.
Conduct regular audits to ensure compliance with
data standards like GDPR, CCPA, HIPAA, etc.

integration
 Len Bass, Ingo Weber, and Liming Zhu, Devops: A
Software Architect’s Perspective, Addison‐Wesley
Professional, 1st edition, May 28, 2015. ISBN: 978‐
0134049847.
References  Gene Kim, Kevin Behr, and George Spafford, The
Phoenix Project: A Novel About IT, DevOps, and
Helping Your Business, IT Revolution Press, January
10, 2013. ISBN: 978‐0988262577.

You might also like