Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

josephmachado/de_project

Repository files navigation

Build a data engineering project, with step-by-step instructions

Data used

Let's assume we are working with a car part seller database (tpch). The data is available in a duckdb database. See the data model below:

TPCH data model

We can create fake input data using the create_input_data.py.

Architecture

Most data teams have their version of the 3-hop architecture. For example, dbt has its own version (stage, intermediate, mart), and Spark has medallion (bronze, silver, gold) architecture.

Data Flow

Tools used:

  1. Polars logo
  2. Docker logo
  3. Apache Airflow logo
  4. Pytest logo
  5. DuckDB logo

Setup

You have two options to run the exercises in this repo

Option 1: Github codespaces (Recommended)

Steps:

  1. Create Github codespaces with this link.
  2. Wait for Github to install the requirements.txt. This step can take about 5minutes. installation
  3. Now open the setup-data-project.ipynb and it will open in a Jupyter notebook interface. You will be asked for your kernel choice, choose Python Environments and then python3.12.00 Global. Jupyter notebook in VScode
  4. The setup-data-project notebook that goes over how to create a data pipeline.
  5. In the terminal run the following commands to setup input data, run etl and run tests.
# setup input data
python ./setup/create_input_data.py
# run pipeline
python dags/run_pipeline.py
# run tests
python -m pytest dags/tests/unit/test_dim_customer.py

Option 2: Run locally

Steps:

  1. Clone this repo, cd into the cloned repo
  2. Start a virtual env and install requirements.
  3. Start Jupyter lab and run the setup-data-project.ipynb notebook that goes over how to create a data pipeline.
git clone https://github.com/josephmachado/de_project.git
cd de_project 
rm -rf env
python -m venv ./env # create a virtual env
source env/bin/activate # use virtual environment
pip install -r requirements.txt
jupyter lab
  1. In the terminal run the following commands to setup input data, run etl and run tests.
# setup input data
python ./setup/create_input_data.py
# run pipeline
python dags/run_pipeline.py
# run tests
python -m pytest dags/tests/unit/test_dim_customer.py