Data processing involves collecting raw data and converting it into a usable format through six stages: data collection, preparation, input, processing, output/interpretation, and storage. Data transformation changes the structure or values of data and generally occurs in two places in the data pipeline - during an ETL process with on-premises warehouses, or at query time with cloud warehouses using an ELT process. It involves interpreting, checking, translating, and re-checking the data through multiple steps to ensure high quality transformed data. Both data processing and transformation aim to organize data for easier human and computer use and facilitate compatibility across systems.
Data processing involves collecting raw data and converting it into a usable format through six stages: data collection, preparation, input, processing, output/interpretation, and storage. Data transformation changes the structure or values of data and generally occurs in two places in the data pipeline - during an ETL process with on-premises warehouses, or at query time with cloud warehouses using an ELT process. It involves interpreting, checking, translating, and re-checking the data through multiple steps to ensure high quality transformed data. Both data processing and transformation aim to organize data for easier human and computer use and facilitate compatibility across systems.
Original Description:
Data process,data transformation
Original Title
Data Processing, Data transformation and Data Analysis
Data processing involves collecting raw data and converting it into a usable format through six stages: data collection, preparation, input, processing, output/interpretation, and storage. Data transformation changes the structure or values of data and generally occurs in two places in the data pipeline - during an ETL process with on-premises warehouses, or at query time with cloud warehouses using an ELT process. It involves interpreting, checking, translating, and re-checking the data through multiple steps to ensure high quality transformed data. Both data processing and transformation aim to organize data for easier human and computer use and facilitate compatibility across systems.
Data processing involves collecting raw data and converting it into a usable format through six stages: data collection, preparation, input, processing, output/interpretation, and storage. Data transformation changes the structure or values of data and generally occurs in two places in the data pipeline - during an ETL process with on-premises warehouses, or at query time with cloud warehouses using an ELT process. It involves interpreting, checking, translating, and re-checking the data through multiple steps to ensure high quality transformed data. Both data processing and transformation aim to organize data for easier human and computer use and facilitate compatibility across systems.
Download as PPTX, PDF, TXT or read online from Scribd
Download as pptx, pdf, or txt
You are on page 1of 31
Data Processing, Data transformation
and Data Analysis
Data processing Data processing occurs when data is collected and translated into usable information. Usually performed by a data scientist or team of data scientists, it is important for data processing to be done correctly as not to negatively affect the end product, or data output. Continue… Data processing starts with data in its raw form and converts it into a more readable format (graphs, documents, etc.), giving it the form and context necessary to be interpreted by computers and utilized by employees throughout an organization. Six stages of data processing 1. Data collection 2. Data preparation 3. Data input 4. Processing 5. Data output/interpretation 6. Data storage 1. Data collection Collecting data is the first step in data processing. Data is pulled from available sources, including data lakes and data warehouses. It is important that the data sources available are trustworthy and well-built so the data collected (and later used as information) is of the highest possible quality. 2. Data preparation Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to as “pre-processing” is the stage at which raw data is cleaned up and organized for the following stage of data processing. 3. Data input The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data warehouse like Redshift), and translated into a language that it can understand. Data input is the first stage in which raw data begins to take the form of usable information. 4. Processing During this stage, the data inputted to the computer in the previous stage is actually processed for interpretation. Processing is done using machine learning algorithms, though the process itself may vary slightly depending on the source of data being processed (data lakes, social networks, connected devices etc). 5. Data output/interpretation The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is translated, readable, and often in the form of graphs, videos, images, plain text, etc.). 6. Data storage The final stage of data processing is storage. After all of the data is processed, it is then stored for future use. Data transformation Data transformation is the process of changing the format, structure, or values of data. For data analytics projects, data may be transformed at two stages of the data pipeline. Organizations that use on-premises data warehouses generally use an ETL (extract, transform, load) process, in which data transformation is the middle step. Continue… Most organizations use cloud-based data warehouses, which can scale compute and storage resources with latency measured in seconds or minutes. The scalability of the cloud platform lets organizations skip preload transformations and load raw data into the data warehouse, then transform it at query time — a model called ELT ( extract, load, transform). Step of Data transformation Step 1: Data interpretation The first step in data transformation is interpreting your data to determine which type of data you currently have, and what you need to transform it into. Data interpretation can be harder than it looks. As a simple example, consider the fact that many operating systems and applications make assumptions about how data is formatted based on the extension that is appended to a file name. Thus, your computer is likely to assume that a file name video.avi is a video file, or that text.doc is a Microsoft Word file. Step 2: Pre-translation data quality check Once you have figured out which kind of data formats you are working with and which forms you will transform data into, you should run a data quality check on the data. A data quality check allows you to identify problems, such as missing or corrupt values within a database, in the source data that could lead to problems during later steps of the data transformation process. Step 3: Data translation After the data quality of your source data has been maximized, you can begin the process of actually translating data. Data translation means taking each part of your source data and replacing it with data that fits within the formatting requirements or your target data format. Step 4: Post-translation data quality check In order to ensure that your translated data will be maximally useful, you will also want to perform a data quality check. In this step of the process, you look for inconsistencies, missing information or other errors that may have been introduced during the data translation process. Even if your data was error-free before translation, there is a decent chance that problems will have been introduced during translation. Data transformation may be Constructive (adding, copying, and replicating data) destructive (deleting fields and records), aesthetic (standardizing salutations or street names) structural (renaming, moving, and combining columns in a database). Benefits of data transformation: Data is transformed to make it better-organized. Transformed data may be easier for both humans and computers to use. Properly formatted and validated data improves data quality and protects applications from potential landmines such as null values, unexpected duplicates, incorrect indexing, and incompatible formats. Data transformation facilitates compatibility between applications, systems, and types of data. Challenges of data transformation Data transformation can be expensive. The cost is dependent on the specific infrastructure, software, and tools used to process data. Data transformation processes can be resource-intensive. Lack of expertise and carelessness can introduce problems during transformation. Data analysts without appropriate subject matter expertise are less likely to notice typos or incorrect data because they are less familiar with the range of accurate and permissible values. How to transform data? The first phase of data transformations should include things like data type conversion and flattening of hierarchical data. These operations shape data to increase compatibility with analytics systems. Data analysts and data scientists can implement further transformations additively as necessary as individual layers of processing. Each layer of processing should be designed to perform a specific set of tasks that meet a known business or technical requirement. Data Analysis