Data Engineering Assignment Report

The document discusses using Apache Airflow to orchestrate an ETL workflow to analyze health insurance plan data. It describes ingesting raw CSV files using Airflow, transforming the data with Python scripts, and monitoring the process. The key steps are: 1. Ingesting raw health plan CSV files from CMS into Airflow using its DAG scheduler. 2. Defining Python operators in Airflow to clean, normalize and engineer features from the raw data through scripts. 3. Ordering the Python operators and establishing dependencies between them to perform the transformations in logical sequence. 4. Implementing logging and monitoring in Airflow to track the workflow's progress, failures and provide debugging information. The

Uploaded by

Ranjita Mishra

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

313 views

Data Engineering Assignment Report

Uploaded by

Ranjita Mishra

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

INTRODUCTION

Background

The healthcare analytics project aims to leverage data engineering techniques to process and
analyze the Health Insurance Marketplace Public Use Files, containing data on health and dental
plans offered through the US Health Insurance Marketplace.

Objectives

• Ingest and transform the processed dataset using a chosen framework.

• Implement data quality checks to ensure accuracy.
• Conduct data analysis to understand market dynamics, plan rates, benefits, and variations
across different factors.

DATA SOURCE
Dataset Information

The Health Insurance Marketplace Public Use Files, originally prepared and released by the Centres
for Medicare & Medicaid Services (CMS), contain data on health and dental plans offered to
individuals and small businesses through the US Health Insurance Marketplace.

Processed Data Components

The processed version of the data includes six CSV files with the following components:

• BenefitsCostSharing.csv
• BusinessRules.csv
• Network.csv
• PlanAttributes.csv
• Rate.csv
• ServiceArea.csv

INGESTION AND ETL

Framework Selection: Apache Airflow

Apache Airflow Overview:

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor
workflows. It uses Directed Acyclic Graphs (DAGs) to define a workflow and executes tasks in a
specified order. It's a powerful tool for orchestrating complex data workflows.

Installation Process:
1. Install Python:
• Ensure that Python is installed on your system.
• Apache Airflow is compatible with both Python 2 and Python 3, but it's recommended
to use Python 3 for the latest versions.
2. Install Apache Airflow:
• You can install Apache Airflow using pip, a package installer for Python. Run the
following command:
Bash command
pip install apache-airflow
3. Initialize Airflow Database:
• Initialize the metadata database used by Airflow to store its configuration settings and
job metadata.
Bash command
airflow db init
4. Start the Airflow Web Server:
• Start the web server, which provides the Airflow user interface.
Bash command
airflow webserver --port 8080
5. Start the Scheduler:
• Start the scheduler, which orchestrates the execution of tasks defined in your DAGs.
Bash command
airflow scheduler
6. Access the Airflow UI:
• Open a web browser and navigate to http://localhost:8080 to access the Airflow UI.

Tools and OS Requirements:

Operating System:
• Apache Airflow is platform-agnostic and can be installed on various operating systems,
including Linux and Windows.
Python Virtual Environment (Optional but Recommended):
• It's good practice to create a virtual environment to isolate the dependencies of your project.
Bash command
python3 -m venv myenv source myenv/bin/activate # On Unix
Additional Tools (if needed for specific tasks):
• Database:
• If you plan to use a specific database backend (e.g., PostgreSQL, MySQL, SQLite), you
may need to install the corresponding database software and Python drivers.
• Additional Python Libraries:
• Depending on the specific data processing tasks in your ETL process, you may need to
install additional Python libraries. For example:
Bash command
pip install pandas numpy sqlalchemy
• Other Dependencies:
• Some tasks might require additional tools or libraries. Ensure you install them based
on your project requirements.
Docker Integration:

1. Install Docker:
• Ensure Docker is installed on your system. You can download and install Docker from the
official website: Docker
2. Create Dockerfile:
• Create a Dockerfile in your project directory to define the Docker image for your Apache
Airflow environment. An example Dockerfile might look like this:

FROM apache/airflow:2.1.2 USER root RUN pip install pandas numpy sqlalchemy USER airflow
3. Build Docker Image:
• In the same directory as your Dockerfile, run the following command to build the Docker
image:
Bash command
docker build -t my_airflow_image .
4. Run Apache Airflow in Docker Container:
• Start Apache Airflow within a Docker container using the built image:
Bash command
docker run -p 8080:8080 my_airflow_image
• Access the Airflow UI at http://localhost:8080.

Visual Studio Code Integration:

1. Install Visual Studio Code:

• Download and install Visual Studio Code from the official website: VS Code
2. Install Docker Extension:
• Install the "Docker" extension for Visual Studio Code to manage Docker containers and
images directly from the VS Code interface.
3. Install Python Extension:
• Install the "Python" extension for Visual Studio Code to enhance Python development within
the editor.
4. Connect to Docker from VS Code:
• Open the Docker extension in VS Code, and it will automatically detect running Docker
containers and images.
5. Develop and Debug in VS Code:
• Write your Python scripts, Apache Airflow DAGs, and ETL logic within VS Code.
• Utilize VS Code's debugging features for Python development.
6. Docker Compose:
• If your project involves multiple services or components, consider using Docker Compose to
define and run multi-container Docker applications.
• Create a docker-compose.yml file to specify your services, volumes, and networks.
• Use the docker-compose CLI to manage the lifecycle of your application.
TRANSFORMATION LOGIC
The transformation logic will include tasks such as data cleaning, normalization, and feature
engineering. Python scripts will be utilized within the Apache Airflow workflow to perform these
transformations.

Data Cleaning:
Handling Missing Values:
• Identify and handle any missing or null values in the dataset. Depending on the context and the
specific columns with missing values, you might choose to impute them using statistical measures or
remove rows/columns.
Outlier Detection and Removal:
• Identify and handle outliers that might adversely affect the analysis. This could involve using statistical
methods or domain knowledge to define and remove outliers.

Normalization:
Scaling Numerical Features:
• Normalize numerical features to ensure they are on a similar scale. Techniques like Min-Max scaling or
Z-score normalization can be applied to prevent certain features from dominating others in the
analysis.
Categorical Variable Encoding:
• Encode categorical variables using techniques such as one-hot encoding or label encoding. This is
crucial for machine learning models and ensures that categorical variables are represented in a format
suitable for analysis.

Python Scripts within Apache Airflow Workflow:

Define Python Operators:

• Use Apache Airflow to define PythonOperators, each corresponding to a specific
transformation task. These operators will execute Python scripts encapsulating the
transformation logic.
Order of Execution:
• Define the order in which these Python Operators should execute within the workflow. Ensure
dependencies are established so that transformations are performed in a logical sequence.
Parameterization:
• Utilize Apache Airflow's parameterization capabilities to make the workflow flexible.
Parameters can include file paths, column names, or any other configuration required by the
Python scripts.
Error Handling:
• Implement error handling mechanisms within Python scripts to gracefully handle unexpected
issues during the transformation process. This ensures the workflow can recover from failures
and provides meaningful error messages for debugging.
Logging and Monitoring:

Logging:
• Implement logging within Python scripts to capture key events, errors, or any other relevant
information. This aids in troubleshooting and monitoring the ETL process.
Monitoring:
• Set up monitoring tools within Apache Airflow to track the progress of the workflow. This
includes checking for successful task completion, identifying failures, and triggering alerts if
needed.

APACHE AIRFLOW

Apache Airflow is an open-source platform for orchestrating complex workflows as Directed Acyclic
Graphs (DAGs). It's widely used for managing and scheduling ETL (Extract, Transform, Load)
processes. Below are some basics of Apache Airflow specific to ETL processes:

Key Concepts:

1. DAG (Directed Acyclic Graph):

• In Airflow, a DAG is a collection of tasks with defined dependencies. It represents the
workflow you want to orchestrate.
• A DAG is defined in a Python script and includes tasks, operators, and the relationships
between them.
2. Operators:
• Operators define the atomic steps in the workflow. Each operator performs a specific
action, such as executing SQL queries, running Python scripts, or interacting with
external systems.
• Common operators include PythonOperator, BashOperator, SqlSensor, etc.
3. Tasks:
• A task is an instance of an operator. It represents a single, identifiable unit of work
within a DAG.
• Tasks are defined within a DAG, and the relationships between tasks determine the
flow of the workflow.
4. Task Dependencies:
• Dependencies between tasks are specified in the DAG definition. A task can depend on
the success or failure of one or more tasks before it can be executed.
• Dependencies are established using the set_upstream() and set_downstream() methods.
Python Code for creating DAG file in Apache Airflow
1 import pandas as pd
2 from sqlalchemy import create_engine
3 from datetime import datetime, timedelta
4 from airflow import DAG
5 from airflow.operators.python_operator import PythonOperator
6
7 def load_csv():
8 # List of file names to be loaded
9 file_names = [
10 r"C:\Users\HP\Downloads\dataset\BenefitsCostSharing.csv",
11 r"C:\Users\HP\Downloads\dataset\BusinessRules.csv",
12 r"C:\Users\HP\Downloads\dataset\Network.csv",
13 r"C:\Users\HP\Downloads\dataset\PlanAttributes.csv",
14 r"C:\Users\HP\Downloads\dataset\Rate.csv",
15 r"C:\Users\HP\Downloads\dataset\ServiceArea.csv"]
16
17 # Load all CSV files into a list of DataFrames
18 dfs = [pd.read_csv(file) for file in file_names]
19
20 # Concatenate the DataFrames into a single DataFrame
21 combined_df = pd.concat(dfs, ignore_index=True)
22 return combined_df
23
24 def perform_transformation(**kwargs):
25 ti = kwargs['ti']
26 raw_data = ti.xcom_pull(task_ids='load_task')
27
28 # Perform your transformations
29 transformed_data = raw_data.copy()
30
31 # Add a new column with a complex calculation
32 transformed_data['new_calculated_column'] = (
33 transformed_data['existing_column'] * 3 + transformed_data['another_column']
34 )
35
36 # Handle missing values (replace NaN with a default value)
37 transformed_data['existing_column'].fillna(0, inplace=True)
38
39 # Apply a custom function to a column
40 def custom_function(value):
41 # Example: Apply a function to each value in a column
42 return value + 10
43
44 transformed_data['another_column'] =
transformed_data['another_column'].apply(custom_function)
45
46 # Push the transformed data to XCom for later use
47 ti.xcom_push(key='transformed_data', value=transformed_data)
48
49 def store_to_mysql(**kwargs):
50 ti = kwargs['ti']
51 transformed_data = ti.xcom_pull(task_ids='transform_task')
52
53 # Replace 'your_mysql_connection_string' with your actual MySQL connection string
54 engine =
create_engine('mysql+mysqlconnector://username:password@localhost:3306/your_database')
55
56 # Upload transformed data to MySQL table
57 transformed_data.to_sql('Health Insurance Marketplace ', con=engine, index=False,
if_exists='replace')
58
59 default_args = {
60 'owner': 'airflow',
61 'start_date': datetime(2023, 12, 14),
62 'depends_on_past': False,
63 'retries': 1,
64 'retry_delay': timedelta(minutes=5),
65 }
66
67 dag = DAG(
68 'transform_dag',
69 default_args=default_args,
70 description='A DAG for data transformation and storage',
71 schedule_interval=timedelta(days=1),
72 )
73
74 load_task = PythonOperator(
75 task_id='load_task',
76 python_callable=load_csv,
77 dag=dag,
78 )
79
80 transform_task = PythonOperator(
81 task_id='transform_task',
82 python_callable=perform_transformation,
83 provide_context=True,
84 dag=dag,
85 )
86
87 store_task = PythonOperator(
88 task_id='store_task',
89 python_callable=store_to_mysql,
90 provide_context=True,
91 dag=dag,
92 )
93
94 # Set task dependencies
95 load_task >> transform_task >> store_task
Workflow:

1. Extraction Task (extract_task):

• Defines the logic to extract data from the source, such as a database, API, or file.
2. Transformation Task (transform_task):
• Performs data transformations on the extracted data. This could involve cleaning,
aggregating, or reshaping the data.
3. Loading Task (load_task):
• Loads the transformed data into the target destination, such as a data warehouse or
database.
4. Dependencies:
• extract_task must complete successfully before transform_task can run, and
similarly, transform_task must complete before load_task.

Running the DAG:

• Save the script and place it in the DAGs directory configured in your Airflow installation.
• Airflow will automatically detect and schedule the DAG based on the defined
schedule_interval.
• Monitor the progress and logs through the Airflow web UI (http://localhost:8080 by
default).

LEARNING OUTCOMES
1. Understanding of ETL Processes:
• Gain a deep understanding of Extract, Transform, Load (ETL) processes and the
importance of orchestrating workflows in data engineering.
2. Apache Airflow Proficiency:
• Acquire proficiency in using Apache Airflow as a workflow orchestration tool.
• Understand how to define Directed Acyclic Graphs (DAGs) for complex workflows.
3. Python Scripting for Data Processing:
• Develop skills in writing Python scripts for data processing tasks within the Apache
Airflow framework.
• Learn to use Python operators for various data manipulation operations.
4. Data Quality Checks:
• Implement data quality checks to ensure the accuracy and reliability of the processed
data.
5. Framework Integration:
• Gain experience in integrating Apache Airflow with other frameworks, tools, or services
to enhance the overall ETL pipeline.
6. Docker Integration:
• Learn how to use Docker for containerization, allowing for consistent and reproducible
environments.
7. Version Control:
• Utilize version control tools (e.g., Git) to manage code changes and collaborate
effectively with team members.
8. Documentation Skills:
• Practice creating comprehensive project documentation, including README files and
reports, to communicate the project details effectively.
9. Project Management:
• Gain project management skills by organizing and managing the development,
testing, and deployment phases of the ETL project.
10. Troubleshooting and Debugging:
• Develop skills in troubleshooting and debugging Apache Airflow workflows, Python
scripts, and any issues that arise during the ETL process.
11. Collaboration and Communication:
• Enhance collaboration and communication skills through interactions with team
members, stakeholders, and the broader data engineering community.
12. Practical Experience:
• Acquire hands-on, practical experience in building end-to-end ETL workflows,
providing a real-world understanding of data engineering challenges and solutions.

Apache Airflow
No ratings yet
Apache Airflow
24 pages
Data Engineering Assignment Report
No ratings yet
Data Engineering Assignment Report
9 pages
NRF - R134a-R1234yf Airconditioning Filling Chart
67% (3)
NRF - R134a-R1234yf Airconditioning Filling Chart
1 page
Flow Designer
No ratings yet
Flow Designer
11 pages
SRS - How to build a Pen Test and Hacking Platform
From Everand
SRS - How to build a Pen Test and Hacking Platform
alasdair gilchrist
2/5 (1)
Elcid Test
No ratings yet
Elcid Test
7 pages
Part - 6 Default ASP.NET Core Web API Files and Folders
No ratings yet
Part - 6 Default ASP.NET Core Web API Files and Folders
4 pages
CA Best Practices 9
100% (2)
CA Best Practices 9
25 pages
Lecture 2 - Intro To Laravel
100% (1)
Lecture 2 - Intro To Laravel
26 pages
Azure Adf Python Script1
No ratings yet
Azure Adf Python Script1
7 pages
IoT U-V
No ratings yet
IoT U-V
65 pages
8- Angular - Getting Started With Angular
No ratings yet
8- Angular - Getting Started With Angular
36 pages
Veerraju Palacharla (PY Project)
No ratings yet
Veerraju Palacharla (PY Project)
11 pages
Django Web Framework: Zhaojie Zhang CSCI5828 Class Presenta On 03/20/2012
No ratings yet
Django Web Framework: Zhaojie Zhang CSCI5828 Class Presenta On 03/20/2012
40 pages
Tools and Software
100% (1)
Tools and Software
7 pages
Unit-4 Containers and Docker
No ratings yet
Unit-4 Containers and Docker
44 pages
T3IRFX3 Provider API 201809
No ratings yet
T3IRFX3 Provider API 201809
54 pages
MVC Programming0
No ratings yet
MVC Programming0
31 pages
What Is Pro
No ratings yet
What Is Pro
23 pages
Artintelligences Spec 0042
No ratings yet
Artintelligences Spec 0042
4 pages
BookStore - Copy (2) FGHGFH
No ratings yet
BookStore - Copy (2) FGHGFH
2 pages
09Flask
No ratings yet
09Flask
23 pages
Unit-1 (Part-1)
No ratings yet
Unit-1 (Part-1)
61 pages
C1000-074_STU_stuC1000074
No ratings yet
C1000-074_STU_stuC1000074
6 pages
Codeigniter User Guide Version 1.6.3: Basic Info General Topics Class Reference Helper Reference
No ratings yet
Codeigniter User Guide Version 1.6.3: Basic Info General Topics Class Reference Helper Reference
129 pages
Proiect Individual - Individual Project: Firewall Frontend
No ratings yet
Proiect Individual - Individual Project: Firewall Frontend
18 pages
Project Guidelines For Software Reengineering Course (Spring-2024)
No ratings yet
Project Guidelines For Software Reengineering Course (Spring-2024)
9 pages
Backend developer Assignment
No ratings yet
Backend developer Assignment
3 pages
Django1 DjangoFundamentals1
No ratings yet
Django1 DjangoFundamentals1
16 pages
333 (Autosaved)
No ratings yet
333 (Autosaved)
4 pages
Develop A Backend Application With Node - Js
0% (1)
Develop A Backend Application With Node - Js
26 pages
Asio Doc
No ratings yet
Asio Doc
964 pages
API Development
67% (3)
API Development
26 pages
The Working of Codeigniter Application Is Mentioned in A Simple Flowchart Given Below
No ratings yet
The Working of Codeigniter Application Is Mentioned in A Simple Flowchart Given Below
91 pages
5 Unit5 Part2
No ratings yet
5 Unit5 Part2
32 pages
Reverse Engineering Linux ELF Binaries On The x86 Platform: (C) 2002 Sean Burford The University of Adelaide
No ratings yet
Reverse Engineering Linux ELF Binaries On The x86 Platform: (C) 2002 Sean Burford The University of Adelaide
68 pages
Apache Airflow 191030162837
No ratings yet
Apache Airflow 191030162837
12 pages
Error Handling Techniques Uipath
No ratings yet
Error Handling Techniques Uipath
4 pages
Devops Lab Maual
No ratings yet
Devops Lab Maual
27 pages
Unit 4
No ratings yet
Unit 4
47 pages
Assessment 3
No ratings yet
Assessment 3
17 pages
Presentation 33408 Content Document 20250321092717AM
No ratings yet
Presentation 33408 Content Document 20250321092717AM
60 pages
Complete Reference To Informatica PDF
100% (3)
Complete Reference To Informatica PDF
52 pages
CS353 HW4 Tutorial
No ratings yet
CS353 HW4 Tutorial
30 pages
Unit V Google App Engine
No ratings yet
Unit V Google App Engine
20 pages
Azure DevOps Pipeline
No ratings yet
Azure DevOps Pipeline
11 pages
YDP_API&MS_UNIT_II[1]
No ratings yet
YDP_API&MS_UNIT_II[1]
26 pages
ELSA Assigment - Senior Devops Engineer 2024
No ratings yet
ELSA Assigment - Senior Devops Engineer 2024
3 pages
CFD Libraries
No ratings yet
CFD Libraries
13 pages
Attended Bots Bluerrsim
No ratings yet
Attended Bots Bluerrsim
8 pages
MLOps notes
No ratings yet
MLOps notes
14 pages
How To Structure An ML Project For Reproducibility
No ratings yet
How To Structure An ML Project For Reproducibility
27 pages
SpringBootInterviewQuestions akshu
No ratings yet
SpringBootInterviewQuestions akshu
30 pages
Telemetry open source
No ratings yet
Telemetry open source
7 pages
Infinx
No ratings yet
Infinx
177 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
FS Fist IA QB Ans
No ratings yet
FS Fist IA QB Ans
28 pages
Python Automation for Beginners: A Practical Guide with Examples
From Everand
Python Automation for Beginners: A Practical Guide with Examples
William E. Clark
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
C++ Basics for New Programmers: A Practical Guide with Examples
From Everand
C++ Basics for New Programmers: A Practical Guide with Examples
William E. Clark
No ratings yet
Python OOP Step by Step: A Practical Guide with Examples
From Everand
Python OOP Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
Mastering Go Network Automation
From Everand
Mastering Go Network Automation
Ian Taylor
No ratings yet
Control Statement in python
No ratings yet
Control Statement in python
8 pages
ENPREP 114E - TDS US English
No ratings yet
ENPREP 114E - TDS US English
4 pages
Steps To A Basic Company Financial Analysis
No ratings yet
Steps To A Basic Company Financial Analysis
21 pages
Deepak Singh Resume P
No ratings yet
Deepak Singh Resume P
3 pages
Birthday Programme - V2
No ratings yet
Birthday Programme - V2
1 page
Wind Energy
No ratings yet
Wind Energy
11 pages
Functional Specification
No ratings yet
Functional Specification
4 pages
CMMS Technical Object and Preventative Maintenance Form
No ratings yet
CMMS Technical Object and Preventative Maintenance Form
7 pages
Order Vyakti Vikas Kendra VS Manoj Misra
No ratings yet
Order Vyakti Vikas Kendra VS Manoj Misra
3 pages
Lab (Feed Formulation)
No ratings yet
Lab (Feed Formulation)
4 pages
DTC Agreement Between Cyprus and United States
No ratings yet
DTC Agreement Between Cyprus and United States
30 pages
Project Title: Bachelor of Technology IN Mechanical Engineering
No ratings yet
Project Title: Bachelor of Technology IN Mechanical Engineering
11 pages
Supply Chain Management: by Tauseef Iqbal Khan
No ratings yet
Supply Chain Management: by Tauseef Iqbal Khan
22 pages
MANNUL of Double Ram BOP Cameron Type U
No ratings yet
MANNUL of Double Ram BOP Cameron Type U
44 pages
PFP Notes
No ratings yet
PFP Notes
18 pages
Cover Letter
No ratings yet
Cover Letter
2 pages
He Who Has Never Learned To Obey Cannot Be A Good
No ratings yet
He Who Has Never Learned To Obey Cannot Be A Good
4 pages
Group 3 MKT330
No ratings yet
Group 3 MKT330
28 pages
Seasonal S
No ratings yet
Seasonal S
10 pages
Countries and Nationalities 1
No ratings yet
Countries and Nationalities 1
2 pages
Lecture # 2 Technology Infrastructure: The Internet and The World Wide Web
No ratings yet
Lecture # 2 Technology Infrastructure: The Internet and The World Wide Web
35 pages
Persepsi Dan Motivasi Mahasiswa Dalam Memilih Program Studi Pada Jurusan Pendidikan Bahasa Dan Seni
No ratings yet
Persepsi Dan Motivasi Mahasiswa Dalam Memilih Program Studi Pada Jurusan Pendidikan Bahasa Dan Seni
11 pages
List of Securities
No ratings yet
List of Securities
1,335 pages
ดอกเบี้ยและมูลค่าของเงิน
No ratings yet
ดอกเบี้ยและมูลค่าของเงิน
29 pages
Open Channel Flow
No ratings yet
Open Channel Flow
54 pages
NETZSCH PlantEngineering e
No ratings yet
NETZSCH PlantEngineering e
12 pages
Axor Catalog 2015 PDF
No ratings yet
Axor Catalog 2015 PDF
212 pages
Btomorf: B-Scant" Tenprint 1051
No ratings yet
Btomorf: B-Scant" Tenprint 1051
2 pages