Data Processing, Data Transformation and Data Analysis

Data processing involves collecting raw data and converting it into a usable format through six stages: data collection, preparation, input, processing, output/interpretation, and storage. Data transformation changes the structure or values of data and generally occurs in two places in the data pipeline - during an ETL process with on-premises warehouses, or at query time with cloud warehouses using an ELT process. It involves interpreting, checking, translating, and re-checking the data through multiple steps to ensure high quality transformed data. Both data processing and transformation aim to organize data for easier human and computer use and facilitate compatibility across systems.

Uploaded by

Babar Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

207 views

Data Processing, Data Transformation and Data Analysis

Uploaded by

Babar Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Data Processing, Data transformation

and Data Analysis

Data processing
Data processing occurs when data is collected and translated into
usable information.
Usually performed by a data scientist or team of data scientists, it is
important for data processing to be done correctly as not to negatively
affect the end product, or data output.
Continue…
Data processing starts with data in its raw form and converts it into a
more readable format (graphs, documents, etc.), giving it the form and
context necessary to be interpreted by computers and utilized by
employees throughout an organization.
Six stages of data processing
1. Data collection
2. Data preparation
3. Data input
4. Processing
5. Data output/interpretation
6. Data storage
1. Data collection
Collecting data is the first step in data processing.
Data is pulled from available sources, including data lakes and data
warehouses.
It is important that the data sources available are trustworthy and
well-built so the data collected (and later used as information) is of
the highest possible quality.
2. Data preparation
Once the data is collected, it then enters the data preparation stage.
Data preparation, often referred to as “pre-processing” is the stage
at which raw data is cleaned up and organized for the following stage
of data processing.
3. Data input
The clean data is then entered into its destination (perhaps a CRM
like Salesforce or a data warehouse like Redshift), and translated into
a language that it can understand.
Data input is the first stage in which raw data begins to take the
form of usable information.
4. Processing
During this stage, the data inputted to the computer in the previous
stage is actually processed for interpretation.
Processing is done using machine learning algorithms, though the
process itself may vary slightly depending on the source of data being
processed (data lakes, social networks, connected devices etc).
5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally
usable to non-data scientists.
It is translated, readable, and often in the form of graphs, videos,
images, plain text, etc.).
6. Data storage
The final stage of data processing is storage.
After all of the data is processed, it is then stored for future use.
Data transformation
Data transformation is the process of changing the format, structure,
or values of data.
For data analytics projects, data may be transformed at two stages
of the data pipeline.
Organizations that use on-premises data warehouses generally use
an ETL (extract, transform, load) process, in which data
transformation is the middle step.
Continue…
Most organizations use cloud-based data warehouses, which can
scale compute and storage resources with latency measured in
seconds or minutes.
The scalability of the cloud platform lets organizations skip preload
transformations and load raw data into the data warehouse, then
transform it at query time — a model called ELT ( extract, load,
transform).
Step of Data transformation
Step 1: Data interpretation
The first step in data transformation is interpreting your data to determine
which type of data you currently have, and what you need to transform it into.
Data interpretation can be harder than it looks.
 As a simple example, consider the fact that many operating systems and
applications make assumptions about how data is formatted based on the
extension that is appended to a file name. Thus, your computer is likely to
assume that a file name video.avi is a video file, or that text.doc is a Microsoft
Word file.
Step 2: Pre-translation data quality check
Once you have figured out which kind of data formats you are
working with and which forms you will transform data into, you
should run a data quality check on the data.
A data quality check allows you to identify problems, such as
missing or corrupt values within a database, in the source data that
could lead to problems during later steps of the data transformation
process.
Step 3: Data translation
After the data quality of your source data has been maximized, you
can begin the process of actually translating data.
Data translation means taking each part of your source data and
replacing it with data that fits within the formatting requirements or
your target data format.
Step 4: Post-translation data quality check
In order to ensure that your translated data will be maximally
useful, you will also want to perform a data quality check.
In this step of the process, you look for inconsistencies, missing
information or other errors that may have been introduced during the
data translation process.
Even if your data was error-free before translation, there is a decent
chance that problems will have been introduced during translation.
Data transformation may be
Constructive (adding, copying, and replicating data)
destructive (deleting fields and records),
aesthetic (standardizing salutations or street names)
structural (renaming, moving, and combining columns in a
database).
Benefits of data transformation:
Data is transformed to make it better-organized. Transformed data
may be easier for both humans and computers to use.
Properly formatted and validated data improves data quality and
protects applications from potential landmines such as null values,
unexpected duplicates, incorrect indexing, and incompatible formats.
Data transformation facilitates compatibility between applications,
systems, and types of data.
Challenges of data transformation
Data transformation can be expensive. The cost is dependent on the
specific infrastructure, software, and tools used to process data.
Data transformation processes can be resource-intensive.
Lack of expertise and carelessness can introduce problems during
transformation.
Data analysts without appropriate subject matter expertise are less
likely to notice typos or incorrect data because they are less familiar
with the range of accurate and permissible values.
How to transform data?
The first phase of data transformations should include things like data
type conversion and flattening of hierarchical data.
These operations shape data to increase compatibility with analytics
systems.
Data analysts and data scientists can implement further
transformations additively as necessary as individual layers of
processing.
 Each layer of processing should be designed to perform a specific set
of tasks that meet a known business or technical requirement.
Data Analysis

Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Nike Supply Chain PR Final
No ratings yet
Nike Supply Chain PR Final
61 pages
Module 3 - Computer Programming Development - Problem Solving
No ratings yet
Module 3 - Computer Programming Development - Problem Solving
77 pages
18CSC303J - Database Management Systems
No ratings yet
18CSC303J - Database Management Systems
2 pages
Use Cases and Scenarios: Use Case Diagram Examples
No ratings yet
Use Cases and Scenarios: Use Case Diagram Examples
7 pages
Power BI Data Analysis and Visualization
100% (7)
Power BI Data Analysis and Visualization
289 pages
Data Cleaning 2021
No ratings yet
Data Cleaning 2021
61 pages
Data Quality and Cleaning
No ratings yet
Data Quality and Cleaning
9 pages
Ethics of Artificial Intelligence
No ratings yet
Ethics of Artificial Intelligence
26 pages
Data Integrity Notes
No ratings yet
Data Integrity Notes
2 pages
Software Requirement Specification
No ratings yet
Software Requirement Specification
29 pages
Data Integrity
No ratings yet
Data Integrity
2 pages
SETLabs Briefings Software Validation
No ratings yet
SETLabs Briefings Software Validation
75 pages
CSV Data Integrity
No ratings yet
CSV Data Integrity
8 pages
Data Visualization For Industry 4
No ratings yet
Data Visualization For Industry 4
3 pages
Data Flow Testing
50% (2)
Data Flow Testing
51 pages
Ethical Consideration in Artificial Intelligence Development and Deployment
No ratings yet
Ethical Consideration in Artificial Intelligence Development and Deployment
6 pages
Debugging Techniques: Troubleshooting Computer Problems
No ratings yet
Debugging Techniques: Troubleshooting Computer Problems
18 pages
Appendix 5 Check List For Change Control Assessment
No ratings yet
Appendix 5 Check List For Change Control Assessment
1 page
Data Security and Integrity
No ratings yet
Data Security and Integrity
3 pages
AI ETHICAL FRAMEWORK WhitePaper
No ratings yet
AI ETHICAL FRAMEWORK WhitePaper
14 pages
DataMining S
No ratings yet
DataMining S
103 pages
Time Series 1
No ratings yet
Time Series 1
23 pages
What Is Unit Testing?: How Do You Perform Unit Tests?
No ratings yet
What Is Unit Testing?: How Do You Perform Unit Tests?
12 pages
Software Testing Notes
No ratings yet
Software Testing Notes
3 pages
Data Manipulation With Pandas
No ratings yet
Data Manipulation With Pandas
39 pages
IA Ethique 15-04
No ratings yet
IA Ethique 15-04
22 pages
Time Series and Forecasting
No ratings yet
Time Series and Forecasting
75 pages
The Big Data System, Components, Tools, and Technologies A Survey
No ratings yet
The Big Data System, Components, Tools, and Technologies A Survey
100 pages
Machine Learning For Automation Software Testing Challenges, Use Cases Advantages & Disadvantages
No ratings yet
Machine Learning For Automation Software Testing Challenges, Use Cases Advantages & Disadvantages
7 pages
Ftalk-21 CFR PDF
No ratings yet
Ftalk-21 CFR PDF
34 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Bayesian Learning: An Introduction: Jo Ao Gama
No ratings yet
Bayesian Learning: An Introduction: Jo Ao Gama
65 pages
Datascience Lab Manual
No ratings yet
Datascience Lab Manual
46 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Test Data Management Overview
No ratings yet
Test Data Management Overview
4 pages
GCLUTO - An Interactive Clustering, Visualization, and Analysis System
No ratings yet
GCLUTO - An Interactive Clustering, Visualization, and Analysis System
10 pages
Abb Part11
No ratings yet
Abb Part11
12 pages
Big Data Analytics
No ratings yet
Big Data Analytics
3 pages
Ethics Of AI GDC
No ratings yet
Ethics Of AI GDC
49 pages
Data Science New
No ratings yet
Data Science New
9 pages
Conducting Data Quality Assessments
No ratings yet
Conducting Data Quality Assessments
9 pages
Data Warehouse Testing - Practical Approach
No ratings yet
Data Warehouse Testing - Practical Approach
8 pages
How To Do Database Migration Testing Effectively and Quickly - Software Testing Articles - Help Guide On Tools Test Automation, Strategies, Updates
No ratings yet
How To Do Database Migration Testing Effectively and Quickly - Software Testing Articles - Help Guide On Tools Test Automation, Strategies, Updates
5 pages
Chapter 3: Control Structures: 1. Higher Order Organization of Python Instructions
No ratings yet
Chapter 3: Control Structures: 1. Higher Order Organization of Python Instructions
7 pages
Methodology For Data Validation v1.0 Rev-2016-06 Final
No ratings yet
Methodology For Data Validation v1.0 Rev-2016-06 Final
76 pages
Advanced Certification in Data Science and Artificial Intelligence
No ratings yet
Advanced Certification in Data Science and Artificial Intelligence
18 pages
Database Design: Logical Design-Part1
No ratings yet
Database Design: Logical Design-Part1
42 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Software Version Control
No ratings yet
Software Version Control
19 pages
Map Reduce
100% (1)
Map Reduce
33 pages
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
No ratings yet
Data Cleaning Data Transformation Data Reduction Discretization and Generating Concept Hierarchies
25 pages
And Lists: Jason Myers
No ratings yet
And Lists: Jason Myers
114 pages
Software Quality Assurance Notes in Urdu Hindi Explanation
No ratings yet
Software Quality Assurance Notes in Urdu Hindi Explanation
5 pages
02 - Data Analytics Prefessional Course
100% (1)
02 - Data Analytics Prefessional Course
16 pages
Regular Expressions and Its Applications
No ratings yet
Regular Expressions and Its Applications
6 pages
Class Object
No ratings yet
Class Object
26 pages
Anaconda Installation Guidelines
No ratings yet
Anaconda Installation Guidelines
6 pages
Master Test Plan Data Warehouse
No ratings yet
Master Test Plan Data Warehouse
12 pages
Software Requirements Document
No ratings yet
Software Requirements Document
17 pages
Introduction To Data Science and Machine Learning
No ratings yet
Introduction To Data Science and Machine Learning
23 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Introduction Hadoop Ecosystem Hdfs I Slides
No ratings yet
Introduction Hadoop Ecosystem Hdfs I Slides
12 pages
A DEA Approach For Evaluating Quality Circles: Sathasivam Mathiyalakan and Chen Chung
No ratings yet
A DEA Approach For Evaluating Quality Circles: Sathasivam Mathiyalakan and Chen Chung
12 pages
LSMW Data Migration - Step by Step: Step 1-Maintain Object Attributes
No ratings yet
LSMW Data Migration - Step by Step: Step 1-Maintain Object Attributes
7 pages
Logic Gates & Communication System: Click To Edit Master Subtitle Style
No ratings yet
Logic Gates & Communication System: Click To Edit Master Subtitle Style
28 pages
Oracle SQL Tuning: For Day-to-Day Data Warehouse Support
No ratings yet
Oracle SQL Tuning: For Day-to-Day Data Warehouse Support
68 pages
Dbms
No ratings yet
Dbms
35 pages
What's New: Informatica Cloud Data Integration April 2022
No ratings yet
What's New: Informatica Cloud Data Integration April 2022
29 pages
Ethnography Principles in Practice 4th Edition Martyn Hammersley Paul Atkinson instant download
100% (2)
Ethnography Principles in Practice 4th Edition Martyn Hammersley Paul Atkinson instant download
55 pages
Netapp Certification Program: Reference Document List
No ratings yet
Netapp Certification Program: Reference Document List
21 pages
Unit6 Pointers 1
No ratings yet
Unit6 Pointers 1
80 pages
PostGIS INTRO
No ratings yet
PostGIS INTRO
58 pages
1Z0-915-1-Demo
No ratings yet
1Z0-915-1-Demo
5 pages
Data Migration Plan - SF - EC
No ratings yet
Data Migration Plan - SF - EC
1 page
Elp 2
No ratings yet
Elp 2
49 pages
Ccw331 Unit 3
No ratings yet
Ccw331 Unit 3
33 pages
Deposit Analysis of Nabil Bank Limited.
No ratings yet
Deposit Analysis of Nabil Bank Limited.
7 pages
Front Page
No ratings yet
Front Page
11 pages
Eyasu Research Draft
No ratings yet
Eyasu Research Draft
44 pages
Unit 4(Database Architecture)
No ratings yet
Unit 4(Database Architecture)
15 pages
Normalization Assignment
No ratings yet
Normalization Assignment
9 pages
18Cs53: Database Management Systems: Introduction To Transaction Processing Concepts and Theory
No ratings yet
18Cs53: Database Management Systems: Introduction To Transaction Processing Concepts and Theory
37 pages
Sri Ram - Practical Connection Assignment
No ratings yet
Sri Ram - Practical Connection Assignment
4 pages
Practical Research 1 Quiz 1
No ratings yet
Practical Research 1 Quiz 1
1 page
1.2 NFS Server Conf in Rhel7
No ratings yet
1.2 NFS Server Conf in Rhel7
12 pages
MINE 310 Project Setup in MineSight
No ratings yet
MINE 310 Project Setup in MineSight
18 pages
All Lab Assignment Kcs-551
No ratings yet
All Lab Assignment Kcs-551
11 pages

Data Processing, Data Transformation and Data Analysis

Uploaded by

Data Processing, Data Transformation and Data Analysis

Uploaded by

Data Processing, Data transformation

and Data Analysis

You might also like