Build a data engineering project, with step-by-step instructions

Build a data engineering project, with step-by-step instructions

Build a data engineering project, with step-by-step instructions

Code for the blog: Build data engineering projects with step-by-step instruction
Live workshop link

Data used

Let's assume we are working with a car part seller database (tpch). The data is available in a duckdb database. See the data model below:

We can create fake input data using the create_input_data.py.

Architecture

Most data teams have their version of the 3-hop architecture. For example, dbt has its own version (stage, intermediate, mart), and Spark has medallion (bronze, silver, gold) architecture.

Tools used:

Setup

You have two options to run the exercises in this repo

Option 1: Github codespaces (Recommended)

Steps:

Create Github codespaces with this link.
Wait for Github to install the requirements.txt. This step can take about 5minutes.
Now open the setup-data-project.ipynb and it will open in a Jupyter notebook interface. You will be asked for your kernel choice, choose Python Environments and then python3.12.00 Global.
The setup-data-project notebook that goes over how to create a data pipeline.
In the terminal run the following commands to setup input data, run etl and run tests.

# setup input data
python ./setup/create_input_data.py
# run pipeline
python dags/run_pipeline.py
# run tests
python -m pytest dags/tests/unit/test_dim_customer.py

Option 2: Run locally

Steps:

Clone this repo, cd into the cloned repo
Start a virtual env and install requirements.
Start Jupyter lab and run the setup-data-project.ipynb notebook that goes over how to create a data pipeline.

git clone https://github.com/josephmachado/de_project.git
cd de_project 
rm -rf env
python -m venv ./env # create a virtual env
source env/bin/activate # use virtual environment
pip install -r requirements.txt
jupyter lab

In the terminal run the following commands to setup input data, run etl and run tests.

# setup input data
python ./setup/create_input_data.py
# run pipeline
python dags/run_pipeline.py
# run tests
python -m pytest dags/tests/unit/test_dim_customer.py

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.devcontainer		.devcontainer
.github		.github
assets		assets
containers/airflow		containers/airflow
dags		dags
logs/scheduler		logs/scheduler
setup		setup
terraform		terraform
.gitignore		.gitignore
Makefile		Makefile
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
answers-setup-data-project.ipynb		answers-setup-data-project.ipynb
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
setup-data-project.ipynb		setup-data-project.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Build a data engineering project, with step-by-step instructions

Data used

Architecture

Setup

Option 1: Github codespaces (Recommended)

Option 2: Run locally

About

Releases

Packages

Languages

josephmachado/de_project

Folders and files

Latest commit

History

Repository files navigation

Build a data engineering project, with step-by-step instructions

Data used

Architecture

Setup

Option 1: Github codespaces (Recommended)

Option 2: Run locally

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages