-
Code for the blog: Build data engineering projects with step-by-step instruction
-
Live workshop link
Let's assume we are working with a car part seller database (tpch). The data is available in a duckdb database. See the data model below:
We can create fake input data using the create_input_data.py.
Most data teams have their version of the 3-hop architecture. For example, dbt has its own version (stage, intermediate, mart), and Spark has medallion (bronze, silver, gold) architecture.
Tools used:
You have two options to run the exercises in this repo
Steps:
- Create Github codespaces with this link.
- Wait for Github to install the requirements.txt. This step can take about 5minutes.
- Now open the
setup-data-project.ipynb
and it will open in a Jupyter notebook interface. You will be asked for your kernel choice, choosePython Environments
and thenpython3.12.00 Global
. - The setup-data-project notebook that goes over how to create a data pipeline.
- In the terminal run the following commands to setup input data, run etl and run tests.
# setup input data
python ./setup/create_input_data.py
# run pipeline
python dags/run_pipeline.py
# run tests
python -m pytest dags/tests/unit/test_dim_customer.py
Steps:
- Clone this repo, cd into the cloned repo
- Start a virtual env and install requirements.
- Start Jupyter lab and run the
setup-data-project.ipynb
notebook that goes over how to create a data pipeline.
git clone https://github.com/josephmachado/de_project.git
cd de_project
rm -rf env
python -m venv ./env # create a virtual env
source env/bin/activate # use virtual environment
pip install -r requirements.txt
jupyter lab
- In the terminal run the following commands to setup input data, run etl and run tests.
# setup input data
python ./setup/create_input_data.py
# run pipeline
python dags/run_pipeline.py
# run tests
python -m pytest dags/tests/unit/test_dim_customer.py