Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Processing, Data Transformation and Data Analysis

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 31

Data Processing, Data transformation

and Data Analysis


Data processing
Data processing occurs when data is collected and translated into
usable information.
Usually performed by a data scientist or team of data scientists, it is
important for data processing to be done correctly as not to negatively
affect the end product, or data output.
Continue…
Data processing starts with data in its raw form and converts it into a
more readable format (graphs, documents, etc.), giving it the form and
context necessary to be interpreted by computers and utilized by
employees throughout an organization.
Six stages of data processing
1. Data collection
2. Data preparation
3. Data input
4. Processing
5. Data output/interpretation
6. Data storage
1. Data collection
Collecting data is the first step in data processing.
Data is pulled from available sources, including data lakes and data
warehouses.
It is important that the data sources available are trustworthy and
well-built so the data collected (and later used as information) is of
the highest possible quality.
2. Data preparation
Once the data is collected, it then enters the data preparation stage.
Data preparation, often referred to as “pre-processing” is the stage
at which raw data is cleaned up and organized for the following stage
of data processing.
3. Data input
The clean data is then entered into its destination (perhaps a CRM
like Salesforce or a data warehouse like Redshift), and translated into
a language that it can understand.
Data input is the first stage in which raw data begins to take the
form of usable information.
4. Processing
During this stage, the data inputted to the computer in the previous
stage is actually processed for interpretation.
Processing is done using machine learning algorithms, though the
process itself may vary slightly depending on the source of data being
processed (data lakes, social networks, connected devices etc).
5. Data output/interpretation
The output/interpretation stage is the stage at which data is finally
usable to non-data scientists.
It is translated, readable, and often in the form of graphs, videos,
images, plain text, etc.).
6. Data storage
The final stage of data processing is storage.
After all of the data is processed, it is then stored for future use.
Data transformation
Data transformation is the process of changing the format, structure,
or values of data.
For data analytics projects, data may be transformed at two stages
of the data pipeline.
Organizations that use on-premises data warehouses generally use
an ETL (extract, transform, load) process, in which data
transformation is the middle step.
Continue…
Most organizations use cloud-based data warehouses, which can
scale compute and storage resources with latency measured in
seconds or minutes.
The scalability of the cloud platform lets organizations skip preload
transformations and load raw data into the data warehouse, then
transform it at query time — a model called ELT ( extract, load,
transform).
Step of Data transformation
Step 1: Data interpretation
The first step in data transformation is interpreting your data to determine
which type of data you currently have, and what you need to transform it into.
Data interpretation can be harder than it looks.
 As a simple example, consider the fact that many operating systems and
applications make assumptions about how data is formatted based on the
extension that is appended to a file name. Thus, your computer is likely to
assume that a file name video.avi is a video file, or that text.doc is a Microsoft
Word file.
Step 2: Pre-translation data quality check
Once you have figured out which kind of data formats you are
working with and which forms you will transform data into, you
should run a data quality check on the data.
A data quality check allows you to identify problems, such as
missing or corrupt values within a database, in the source data that
could lead to problems during later steps of the data transformation
process.
Step 3: Data translation
After the data quality of your source data has been maximized, you
can begin the process of actually translating data.
Data translation means taking each part of your source data and
replacing it with data that fits within the formatting requirements or
your target data format.
Step 4: Post-translation data quality check
In order to ensure that your translated data will be maximally
useful, you will also want to perform a data quality check.
In this step of the process, you look for inconsistencies, missing
information or other errors that may have been introduced during the
data translation process.
Even if your data was error-free before translation, there is a decent
chance that problems will have been introduced during translation.
Data transformation may be
Constructive (adding, copying, and replicating data)
destructive (deleting fields and records),
aesthetic (standardizing salutations or street names)
structural (renaming, moving, and combining columns in a
database).
Benefits of data transformation:
Data is transformed to make it better-organized. Transformed data
may be easier for both humans and computers to use.
Properly formatted and validated data improves data quality and
protects applications from potential landmines such as null values,
unexpected duplicates, incorrect indexing, and incompatible formats.
Data transformation facilitates compatibility between applications,
systems, and types of data.
Challenges of data transformation
Data transformation can be expensive. The cost is dependent on the
specific infrastructure, software, and tools used to process data.
Data transformation processes can be resource-intensive.
Lack of expertise and carelessness can introduce problems during
transformation.
Data analysts without appropriate subject matter expertise are less
likely to notice typos or incorrect data because they are less familiar
with the range of accurate and permissible values.
How to transform data?
The first phase of data transformations should include things like data
type conversion and flattening of hierarchical data.
These operations shape data to increase compatibility with analytics
systems.
Data analysts and data scientists can implement further
transformations additively as necessary as individual layers of
processing.
 Each layer of processing should be designed to perform a specific set
of tasks that meet a known business or technical requirement.
Data Analysis

You might also like