Guide - Data Science 2.0 Capstone Project
Guide - Data Science 2.0 Capstone Project
Guide - Data Science 2.0 Capstone Project
Capstone Project
Guidebook
Introduction
Congratulations on successfully completing the data science program! Throughout this journey,
you have acquired a wide range of data science skills, including data analysis, machine learning,
and data visualization. Now, it's time to put all that learning into practice by working on your
data science capstone project.
The data science capstone project is a pivotal part of the program and serves as a showcase of
your abilities as a data scientist. This project is an opportunity for you to demonstrate your
problem-solving skills, analytical thinking, and creativity in tackling real-world data challenges.
In this guide, we will provide you with essential points and guidelines for your data science
capstone project. While we offer some project outlines, you are also encouraged to come up
with your own unique project idea that aligns with your interests and showcases your skills
effectively.
Completing this capstone project is a mandatory requirement to successfully graduate from the
program, and it will be a gradable activity. So, make sure to approach it with enthusiasm and
dedication.
If you have any questions or need guidance during the project development, don't hesitate to
reach out to your buddy or learning advisor. We are here to support you throughout this
journey.
Happy coding and best of luck with your data science capstone project!
1. Data Exploration and Analysis: Conduct thorough data exploration and analysis to gain
insights into the data's characteristics, patterns, and relationships. Use appropriate
visualizations and descriptive statistics to understand the data better.
2. Data Preprocessing: Clean, preprocess, and transform the data to make it suitable for
modeling. Handle missing values, outliers, and data inconsistencies appropriately.
3. Feature Engineering: Create new features or transform existing features to improve
model performance and capture relevant information from the data.
4. Model Selection: Choose appropriate machine learning algorithms or statistical models
based on the problem type and data characteristics. Consider the trade-offs between
interpretability and predictive power.
5. Model Training and Evaluation: Split the data into training and test sets and train the
model on the training data. Evaluate the model's performance on the test data using
relevant evaluation metrics.
6. Model Interpretability: Pay attention to model interpretability, especially if the project
has real-world implications, use feature importance to explain model predictions.
7. Documentation: Provide clear and comprehensive documentation for your project.
Describe the problem statement, data sources, data preprocessing steps, feature
engineering, model selection, and evaluation metrics.
8. Data Visualization: Use effective data visualizations to communicate insights and model
results. Visuals should be clear, informative, and visually appealing.
9. Model Deployment: Consider the deployment of the trained model in a real-world
scenario if applicable. Discuss how the model can be integrated into an existing system
or make predictions in real-time.
10. Communication: Clearly communicate your findings, methodology, and results in a way
that is understandable to both technical and non-technical stakeholders.
11. Bonus Points: For bonus points, consider the following:
During Development: In this phase, organize your project files in the following folder structure:
project-folder: Name this folder to reflect your project's name in all lowercase, using kebab
casing with no spaces in the name.
1. notebooks: This folder should contain Jupyter notebooks or Python scripts where you
perform data exploration, preprocessing, feature engineering, and model building.
2. data: This folder should contain the dataset(s) used in the project. Include both raw and
processed data, along with a README file explaining the dataset's attributes.
3. models: This folder should contain the trained machine learning models or statistical
models used in the project. Include model serialization files (e.g., Pickle files) and any
model artifacts.
4. visuals: This folder should contain data visualizations and plots generated during
exploratory data analysis or model evaluation. Visualizations should be saved as image
files (e.g., PNG or JPEG).
5. README.md: This Markdown file should include a detailed description of your project,
problem statement, data sources, and explanations of your code and models. Also,
provide instructions on how to run your code and reproduce the results.
Deployment: When setting up your project for deployment, follow these steps:
1. Package Dependencies: Ensure you have a requirements.txt file that lists all the Python
dependencies required to run your project. This file is essential for installing the
necessary libraries on the deployment server.
2. Model Serialization: Serialize your trained models into files that can be easily loaded and
used for predictions. Save them in a specific folder (e.g., "models") within your project.
Remember to update your README.md with instructions on how to access and interact with
your deployed project. Providing clear and concise documentation ensures ease of evaluation
and demonstrates your professionalism as a data scientist.
By organizing your data science capstone project assets effectively, you make it easier for
evaluators to understand your work and appreciate the effort you put into building a compelling
data-driven solution. Good luck with your data science capstone project!
Submitting the Data Science Project
Before submitting your data science capstone project, make sure to check the following points:
1. Functionality: Ensure that your data science project is working correctly and meets the
defined objectives. All data preprocessing, modeling, and evaluation steps should be
functional and produce valid results.
2. Code Structure: Organize your data science project code according to the advised folder
structure. Keep all relevant files, notebooks, and scripts in appropriate directories to
maintain a clear and logical project structure.
4. GitHub Repository: Store your well-organized data science project code in a GitHub
repository. Make sure to include all necessary files and folders, including data,
notebooks, scripts, models, and visualizations.
If you can confidently answer "Yes" to all the above points, you are ready to submit your data
science capstone project. Submit the following to your learning advisor:
1. GitHub Repo URL: Provide the URL to your GitHub repository containing the well-
organized and documented data science project code.
2. Deployed Application URL: If applicable, include the URL to the deployed application.
Ensure that the application is publicly accessible for evaluation purposes.
What Should I Build?
Your data science capstone project is an opportunity to showcase your skills and abilities as a
data scientist. It should reflect your expertise and demonstrate your problem-solving capabilities
using data-driven approaches. Here are some guidelines and ideas to help you decide what to
build for your data science capstone project:
1. Choose a Relevant Problem: Select a problem that is relevant and has practical
applications in real-world scenarios. Look for challenges that involve data analysis,
predictive modeling, recommendation systems, or any other data-related tasks that align
with your interests.
2. Real Data: Whenever possible, work with real-world datasets. Real data provides unique
challenges and insights that can enrich your learning experience and demonstrate your
ability to handle real-world data.
3. Focus on Depth and Quality: Instead of tackling multiple small projects, focus on
building a single project with depth and high-quality results. A comprehensive, well-
executed project demonstrates your ability to work on complex data science problems.
7. Industry Relevance: If you have a specific industry or domain of interest, try to align
your project with it. Demonstrating domain knowledge and expertise can be
advantageous in certain job applications.
11. Bring Your Own Ideas: While the guide may provide some ideas, feel free to come up
with your own project idea that excites you. Personal projects that reflect your interests
and passion often lead to better outcomes.
Remember, the goal is not just to complete the program but to create an impressive and
impactful data science capstone project that showcases your abilities and sets you apart from
other candidates. Put in the time, effort, and attention to detail to create a project that you can
be proud of and that leaves a lasting impression on potential employers.
Project 1
AnomaData (Automated Anomaly Detection for Predictive Maintenance)
Problem Statement:
Many different industries need predictive maintenance solutions to reduce risks and gain
actionable insights through processing data from their equipment.
Although system failure is a very general issue that can occur in any machine, predicting the
failure and taking steps to prevent such failure is most important for any machine or software
application.
This Capstone project is aimed at predicting the machine breakdown by identifying the
anomalies in the data.
The data we have contains about 18000+ rows collected over few days. The column ‘y’ contains
the binary labels, with 1 denoting there is an anomaly. The rest of the columns are predictors.
The following is recommendation of the steps that should be employed towards attempting to
solve this problem statement:
Exploratory Data Analysis: Analyze and understand the data to identify patterns,
relationships, and trends in the data by using Descriptive Statistics and Visualizations.
Data Cleaning: This might include standardization, handling the missing values and
outliers in the data.
Feature Engineering: Create new features or transform the existing features for better
performance of the ML Models.
Model Selection: Choose the most appropriate model that can be used for this project.
Model Training: Split the data into train & test sets and use the train set to estimate the
best model parameters.
Model Validation: Evaluate the performance of the model on data that was not used
during the training process. The goal is to estimate the model's ability to generalize to
new, unseen data and to identify any issues with the model, such as overfitting.
Deliverables
Tasks/Activities List
Collect the time series data from the CSV file linked here.
Exploratory Data Analysis (EDA) - Show the Data quality check, treat the missing values,
outliers etc if any.
Success Metrics
Below are the metrics for the successful submission of this case study.
The accuracy of the model on the test data set should be > 75%(Subjective in nature)
Bonus Points
You can package your solution in a zip file included with a README that explains the
installation and execution of the end-to-end pipeline.
You can demonstrate your documentation skills by describing how it benefits our
company.
Data
The dataset for this project can be accessed by clicking the link provided below.
Anoma_data.csv
Project 2
FindDefault (Prediction of Credit Card fraud)
Problem Statement:
A credit card is one of the most used financial products to make online purchases and payments.
Though the Credit cards can be a convenient way to manage your finances, they can also be
risky. Credit card fraud is the unauthorized use of someone else's credit card or credit card
information to make purchases or withdraw cash.
It is important that credit card companies are able to recognize fraudulent credit card
transactions so that customers are not charged for items that they did not purchase.
The dataset contains transactions made by credit cards in September 2013 by European
cardholders. This dataset presents transactions that occurred in two days, where we have 492
frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds)
account for 0.172% of all transactions.
The following is recommendation of the steps that should be employed towards attempting to
solve this problem statement:
Exploratory Data Analysis: Analyze and understand the data to identify patterns,
relationships, and trends in the data by using Descriptive Statistics and Visualizations.
Data Cleaning: This might include standardization, handling the missing values and
outliers in the data.
Dealing with Imbalanced data: This data set is highly imbalanced. The data should be
balanced using the appropriate methods before moving onto model building.
Feature Engineering: Create new features or transform the existing features for better
performance of the ML Models.
Model Selection: Choose the most appropriate model that can be used for this project.
Model Training: Split the data into train & test sets and use the train set to estimate the
best model parameters.
Model Validation: Evaluate the performance of the model on data that was not used
during the training process. The goal is to estimate the model's ability to generalize to
new, unseen data and to identify any issues with the model, such as overfitting.
Deliverables
Tasks/Activities List
Collect the time series data from the CSV file linked here.
Exploratory Data Analysis (EDA) - Show the Data quality check, treat the missing values,
outliers etc if any.
Success Metrics
Below are the metrics for the successful submission of this case study.
The accuracy of the model on the test data set should be > 75% (Subjective in nature)
You can package your solution in a zip file included with a README that explains the
installation and execution of the end-to-end pipeline.
You can demonstrate your documentation skills by describing how it benefits our
company.
Data:
The dataset for this project can be accessed by clicking the link provided below.
creditcard.csv
Project 3
DocAssist (Building Intelligent Medical Decision Support System)
Problem Statement
The objective of this project is to develop an intelligent medical decision support system that
analyzes patient data to assist doctors in making informed decisions about the best treatment
options for individual patients. By leveraging machine learning and data analysis, the system will
provide personalized treatment recommendations based on the patient's medical history,
symptoms, lab results, and other relevant factors.
Focus Areas:
Data Collection: Gather patient data from electronic health records (EHRs), medical
databases, and other relevant sources.
Data Preprocessing: Clean, normalize, and handle missing values in the patient data to
prepare it for analysis.
Feature Engineering: Identify and extract meaningful features from the patient data,
such as demographic information, medical history, diagnostic test results, and
medication history.
User Interface: Design an intuitive user interface for doctors to input patient data and
receive treatment recommendations.
Deliverables:
Tasks/Activities List:
Data Collection:
o Gather patient data from electronic health records (EHRs) and medical databases.
Data Preprocessing:
o Clean and preprocess the patient data to handle missing values and remove
noise.
Feature Engineering:
o Extract relevant features from the patient data, such as demographics, medical
history, and diagnostic results.
Model Development:
Model Interpretability:
User Interface:
o Design a user-friendly interface that allows doctors to input patient data and
receive treatment recommendations.
Success Metrics:
The predictive models should achieve high accuracy and precision in treatment outcome
predictions.
Doctors' feedback on the usefulness and effectiveness of the treatment
recommendations.
Bonus Points:
Ensure compliance with healthcare regulations and data privacy laws throughout the
project.
Data:
The dataset for this project can be accessed by clicking the link provided below.
dataset.csv
Project 4
For-rest from Fires (Fire Detection Model using Deep Learning)
Problem Statement:
Lately, there have been many fire outbreaks which is becoming a growing issue and the damage
caused by these types of incidents is tremendous to nature and human interests. Such incidents
have highlighted the need for more effective and efficient fire detection systems.
The traditional systems that rely on temperature or smoke sensors have limitations such as slow
response time and inefficiency when the fire is at a distance from the detectors.
Moreover, these systems are also costly. As an alternative, researchers are exploring computer
vision and image processing techniques as a cost-effective solution. One such method is the use
of surveillance cameras to detect fires and alert the relevant parties.
Computer vision-based fire detection, which utilizes image processing techniques, has the
potential to be a useful approach in situations where traditional methods are not feasible. The
algorithm for fire detection uses visual characteristics of fires such as brightness, color, texture,
flicker, and trembling edges to distinguish them from other stimuli.
This project is aimed at building a Fire Detection Model using Deep Learning.
Objectives:
The main goal of this project is to create a fire detection system. The key objectives of the
project is to identify fires from the images we can get from surveillance system or other
resources
The following is recommendation of the steps that should be employed towards attempting to
solve this problem statement:
Data collection: Collect data on different types of fires, including images and videos of
flames, smoke, and heat. The data should be collected in a controlled environment to
ensure that it is representative of real-world fires.
Data preprocessing: Clean and preprocess the data to ensure that it is ready for use in
training the deep learning model. This may include resizing images, normalizing pixel
values, and splitting the data into training, validation, and test sets.
Timeline
Deliverables
Tasks/Activities List
● Data Preprocessing.
● Train/Test Split
Success Metrics
Below are the metrics for the successful submission of this case study.
● The accuracy of the model on the test data set should be > 85%(Subjective in nature)
Bonus Points
● You can package your solution in a zip file included with a README that explains the
installation and execution of the end-to-end pipeline.
● You can demonstrate your documentation skills by describing how it benefits our
company.
Data:
The dataset for this project can be accessed by clicking the link provided below.
FOREST_FIRE_SMOKE_AND_NON_FIRE_DATASET.zip
Project 5
MoodforMusic (An Intelligent Mood Detection and Music Recommendation
Application)
Problem Statement:
The objective of this project is to build an application that detects the mood of users using still
images or videos and recommends music accordingly. The system will use image or video
analysis to infer the user's mood and provide personalized music recommendations to enhance
their emotional experience.
Focus Areas:
Image/Video Analysis: Develop algorithms to analyze still images or videos and extract
mood-related features.
Mood Classification: Create a machine learning model that classifies the user's mood
based on the extracted features.
Deliverables:
User Interface
Comprehensive Documentation
Source Code
Tasks/Activities List:
Model Training: Split the data into training and testing sets. Train the mood
classification model.
User Interface Development: Design and develop an interactive user interface for
mood capture and music recommendations.
Testing and Validation: Evaluate the performance of the system using test data and user
feedback.
Success Metrics:
Bonus Points:
Packaging and README: Package the solution with a detailed README explaining
installation and execution of the end-to-end pipeline.
The dataset for this project can be accessed by clicking the link provided below.
MoodforMusic.zip
Project 6
NutriGro (Building a Recommender System)
Problem Statement
The goal of this project is to develop a recommender system that provides product
recommendations.
The following is recommendation of the steps that should be employed towards attempting to
solve this problem statement:
Data Collection: Gather historical order data from the database or e-commerce platform.
Collect healthiness information through surveys or health records (if available).
Data Preprocessing: Clean and preprocess the historical order and healthiness data,
handle missing values, and encode categorical variables.
o Train and fine-tune the recommender system using historical order data.
Feature Engineering:
o Deploy the system on a web server or cloud platform for public use.
Timeline
Deliverables
Tasks/Activities List
Data Collection
Data Preprocessing
Feature Engineering
Recommender System
Evaluation
Integration
User Interface/API
Model Refinement
Documentation
Deployment Plan
Success Metrics
Below are the metrics for the successful submission of this case study.
The accuracy of the model on the test data set should be > 75% (Subjective in nature)
You can package your solution in a zip file included with a README that explains the
installation and execution of the end-to-end pipeline.
You can demonstrate your documentation skills by describing how it benefits our
company.
Data
We construct a dataset via collecting data from Allrecipes.com. The website is chosen since it is
one of the largest food-oriented social networks with 1.5 billion visits per year. In particular, we
crawl 52,821 recipes from 27 categories posted between 2000 and 2018. For each recipe, we
crawl its ingredients, image and the corresponding ratings from users. To ensure the quality of
the data, we filter recipes that do not have images and that contain repeated ingredients and
zero reviews. Then we obtain raw_data, including 1,160,267 users, 49,698 recipes with 38,131
ingredients and 3,794,003 interactions , which contains files as following:
recipe_file:
each recipe information is presented in one line, which consists of recipe id, name,
average ratings of reviewers, image url, review nums, ingredients, cooking directions,
nutritions, and reviews.
interaction_file:
each interaction is presented in one line, which consists of user id, recipe id, rating, and
dateLastModified.
images:
49,698 images, named with recipe_id.jpg.
Besides, in order to evaluate recommendation models’ performance, we holdout the latest 30%
of interaction history to construct the test set, and split the remaining data into training (60%)
and validation (10%) sets. Then we retain users which occur in both training and test sets, and
obtain 68,768 users, 45,630 recipes with 33,147 ingredients and 1,093,845 interactions. We call
this dataset as core_data, the dataset consists of files as following:
recipe_file: each recipe information is presented in one line, which consists of recipe id,
name, image url, ingredients, cooking directions, nutritions.
The dataset for this project can be accessed by clicking the link provided below.
NutriGro.zip
Project 7
Propensify (Propensity Model to identify how likely certain target groups customers
respond to the marketing campaign)
Problem Statement:
Are you aware of what, when, and why your customers will make a purchase? Many businesses
undertake an intense pursuit to discover these answers, dedicating valuable resources to data-
driven campaigns and high-cost strategies - yet the actual outcomes often remain elusive and
disappointing.
Customer information is considered to be a valuable asset, however its true worth can only be
established when it is used. Many companies have large collections of data that appear to be
impressive, but upon further examination, they may contain outdated or unimportant
information.
Propensity modeling is a method that aims to forecast the chance that individuals, leads, and
customers will engage in specific actions. This method uses statistical analysis which takes into
account all the independent and confounding factors that impact customer behavior.
Suppose you are working for a company as a Data Scientist. Your company is commissioned by
an insurance company to develop a tool to optimize their marketing efforts.
Data:
The insurance company has provided you with a historical data set (train.csv). The company has
also provided you with a list of potential customers to whom to market (test.csv). From this list
of potential customers, you need to determine yes/no whether you wish to market to them.
(Note: Ignore any additional columns available other than the listed below in the table)
Your focus in this project should be on the following:
The following is recommendation of the steps that should be employed towards attempting to
solve this problem statement:
Exploratory Data Analysis: Analyze and understand the data to identify patterns,
relationships, and trends in the data by using Descriptive Statistics and Visualizations.
Data Cleaning: This might include standardization, handling the missing values and
outliers in the data.
Dealing with Imbalanced data: This data set is highly imbalanced. The data should be
balanced using the appropriate methods before moving onto model building.
Feature Engineering: Create new features or transform the existing features for better
performance of the ML Models.
Model Selection: Choose the most appropriate model that can be used for this project.
Model Training: Split the data into train & test sets and use the train set to estimate the
best model parameters.
Model Validation: Evaluate the performance of the model on data that was not used
during the training process. The goal is to estimate the model's ability to generalize to
new, unseen data and to identify any issues with the model, such as overfitting.
Deliverables
Tasks/Activities List
Collect the time series data from the CSV file linked here.
Exploratory Data Analysis (EDA) - Show the Data quality check, treat the missing values,
outliers etc if any.
Please add a column to the testingCandidate.csv file. In this column, for each observation
indicate a 1 (yes) or a 0 (no) whether you wish to market to that candidate.
Success Metrics
Below are the metrics for the successful submission of this case study.
The accuracy of the model on the test data set should be > 85% (Subjective in nature)
Bonus Points
You can package your solution in a zip file included with a README that explains the
installation and execution of the end-to-end pipeline.
You can demonstrate your documentation skills by describing how it benefits our
company.
Data
The dataset for this project can be accessed by clicking the link provided below.
Propensify.zip
Project 8
Recommender (Build intelligence to help customers discover products they may like
and most likely purchase)
Problem Statement
In the e-commerce industry, providing personalized product recommendations to customers is
crucial for enhancing user experience, increasing customer engagement, and boosting sales.
Building an effective shopping recommender system can significantly improve customer
satisfaction and drive repeat purchases. As a data scientist, your task is to develop a shopping
recommender system that suggests relevant products to customers based on their preferences
and browsing history, thereby increasing the likelihood of purchase and overall customer
retention.
Project Focus: The main focus of this exercise is to build a shopping recommender system that
can:
Recommend products to customers based on their historical purchase data and browsing
behavior.
Explanation of how the system benefits the e-commerce organization and its
customers.
3. A user guide or README file describing how to install and run the end-to-end pipeline.
Tasks/Activities List: The project code should include the following activities and analyses:
1. Data Collection: Collect product data, customer profiles, purchase history, and browsing
behavior from the e-commerce platform.
2. Data Preprocessing: Clean, preprocess, and transform the data to make it suitable for
recommendation modeling.
4. Evaluation Metrics: Utilize metrics like precision, recall, F1-score, and mean average
precision to evaluate the performance of the recommender system.
Success Metrics: The project's success will be evaluated based on the following metrics:
Bonus Points: To earn bonus points, you can consider the following:
1. Package your solution in a well-structured zip file with a detailed README explaining
how to set up and run the shopping recommender system.
Data:
The dataset for this project can be accessed by clicking the icon provided below
Recommnder.zip
Project 9
FaultFindy (Build intelligence using Machine Learning to predict the faulty tyre in
manufacturing)
The objective of this project is to develop an intelligent system using deep learning to predict
the faults in manufacturing processes. By analyzing various manufacturing parameters and
process data, the system will predict the faulty tyre generated during production. This predictive
capability will enable manufacturers to proactively optimize their processes, reduce waste, and
improve overall production efficiency.
Focus Areas:
Data Collection: Gather historical manufacturing data, including good and faulty
corresponding tyre images.
Data Preprocessing: Clean, preprocess, and transform the data to make it suitable for
deep learning models.
Feature Engineering: Extract relevant features and identify key process variables that
impact faulty tyre generation.
Model Selection: Choose appropriate machine learning algorithms for faulty tyre
prediction.
Model Training: Train the selected models using the preprocessed data.
Model Evaluation: Assess the performance of the trained models using appropriate
evaluation metrics.
Deliverables:
Tasks/Activities List:
Data Collection:
Data Preprocessing:
Feature Engineering:
Model Selection:
Model Training:
Model Evaluation:
Hyperparameter Tuning:
Success Metrics:
Bonus Points:
Data:
The dataset for this project can be accessed by clicking the link provided below
Faultfindy.zip
Project 10
Intensity Analysis (Build your own model using NLP and Python)
The objective of this project is to develop an intelligent system using NLP to predict the intensity
in the text reviews. By analyzing various parameters and process data, the system will predict the
intensity where its happiness, angriness or sadness. This predictive capability will enable to
proactively optimize their processes, and improve overall customer satisfaction.
Focus Areas:
Data Collection: Gather all the intensity data, including the text and its intensity.
Data Preprocessing: Clean, preprocess, and transform the data to make it suitable for
machine learning models.
Feature Engineering: Extract relevant features and identify key process variables that
impact intensity.
Model Training: Train the selected models using the preprocessed data.
Model Evaluation: Assess the performance of the trained models using appropriate
evaluation metrics.
The following is recommendation of the steps that should be employed towards attempting to
solve this problem statement:
Data Collection: Collect and preprocess data for natural language processing (NLP)
tasks. This may include text data for training the language model.
Testing and Validation: Test the classification in different scenarios and environments to
ensure it can analyze appropriately. The performance of the model should be evaluated,
and any issues should be addressed.
Timeline
Tasks/Activities List
Data Preprocessing.
Train/Test Split
Success Metrics
Below are the metrics for the successful submission of this case study.
The accuracy of the model on the test data set should be > 85% (Subjective in nature)
Bonus Points
You can package your solution in a zip file included with a README that explains the
installation and execution of the end-to-end pipeline.
You can demonstrate your documentation skills by describing how it benefits our
company.
Data:
The dataset for this project can be accessed by clicking the link provided below
Intensity_data.zip