Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content

ETL (Extract, Transform & Load) Pipeline to extract event data from S3 hosted JSON files and ingestion into Amazon Redshift. This Project is part of the [Udacity Data Engineering nanodegree](https://www.udacity.com/course/data-engineer-nanodegree--nd027).

Notifications You must be signed in to change notification settings

EtienneEs/Sparkify-Amazon-Redshift

Repository files navigation

Sparkify-Amazon-Redshift

ETl pipline leveraging S3, Redshift and Infrastructure as Code (IAC) to (build and) load an analytics Data Warehouse. The pipeline extracts user activity and song data from json logs stored in S3, stages the data in a Redshift Cluster and transforms the data into a Data Warehouse with star schema. This Project is part of the Udacity Data Engineering nanodegree. For convenience the infrastructure is set up leveraging Infrasctructure as Code (IAC), using AWS SDK for Python (Boto3).

About

This project was created for Sparkify, a (fictional) music streaming provider. A continuously, fast growing user base requires Sparkify to leverage the scalability of the cloud. Currently the data resides in S3 as json logs of user activity and json song metadata. To improve the analysis of the user data by the analytics team a Star Schema Data Warehouse is required. The project will build a Data Warehouse, extract the data from S3, stage the data in AWS Redshift and transform & load the data into the Data Warehouse.

Raw Data

The raw data consists of two data sources, the song data and the log data. Both data sources are stored in S3:

  • Song data s3://udacity-dend/song_data
  • Log data s3://udacity-dend/log_data

Data Warehouse Design

Star Schema is used to facilitate analytics.

Fact Table:

  • songplays

Dimension Tables:

  • users
  • songs
  • time
  • artists

ETL Design

The file etl.py contains two steps:

  1. Loading/Staging the data into Amazon Redshift as is.
  2. Transforming and Loading of the data into the Data Warehouse.

Tech/Framework used

  • AWS S3
  • AWS Redshift
  • AWS IAM
  • Python
  • Python SDK for AWS (boto3)

Getting Started

Setting Up the Environment

This project requires an AWS Account. It is assumed that Anaconda is installed.

In order to run this project create a virtual environment.

conda env create -f environment.yml

This will create the virtual environment aws-etl. Activate the environment with the following command.

conda activate aws-etl

optional: Run the following command to make this environment available in your (base) jupyter notebook.

ipython kernel install --name "aws-etl" --user

Credentials

This project requires an AWS Account.

Rename dwh_example.cfg to dwh.cfg. Configure accordingly.

For leveraging Infrastructure as Code an IAM user is required.

Creating a new IAM user:

  1. Go to AWS IAM service and create a new IAM user.
  2. Name the user accordingly
  3. Choose "Programmatic access" as access type.
  4. Select Attach existing policies directly and choose "AdministratorAccess".
  5. Add tags according to your company guidelines and create user.

Update dwh.cfg with the KEY and SECRET of the IAM user.

Create Infrastructure using Python SDK

Ensure that the environment is activated and that the credentials have been added to dwh.cfg. Run IAC.ipynb to set up the infrastructure (and to shut it down). Once the Cluster is available, open an incoming TCP port and test the connection using IAC.ipynb.

Create DataWarehouse

To create the Data Warehouse run: ensure that aws-etl environment is active.

python create_tables.py

Run ETL Pipeline

To run the ETL pipeline run:

python etl.py

Troubleshooting

Checking the Raw Data - modify link accordingly:

https://s3.console.aws.amazon.com/s3/buckets/udacity-dend?region=us-west-2&tab=objects

For troubleshooting log into AWS Redshift Query Editor and run: select * from stl_load_errors;

Files

  • README.md Documentation of the project.
  • dwh.cfg Data Warehouse and ETL Pipeline configuration file.
  • environment.yml Conda environment file for creating the environment.
  • create_tables.py python file for creating the required Data Warehouse tables.
  • etl.py ETL Pipeline file.
  • sql_queries.py SQL Queries used for create_tables.py and etl.py.
  • IAC.ipynb Infrastructure As Code for deployment and disposal of the infrastructure.
  • .gitignore Python gitignore provided by GitHub and adapted to Project needs.

Credits and Acknowledgments

This Project is part of the Udacity Data Engineer Nanodegree.

About

ETL (Extract, Transform & Load) Pipeline to extract event data from S3 hosted JSON files and ingestion into Amazon Redshift. This Project is part of the [Udacity Data Engineering nanodegree](https://www.udacity.com/course/data-engineer-nanodegree--nd027).

Resources

Stars

Watchers

Forks