Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Architecture of Data Science Projects: Components

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 4

The basic workflow to solve any data science problem is as follows:

1. Identifying the problem


2. Acquire test and training data
3. Clean the data
4. Analyze the data
5. Model, predict and solve the problem
6. Visualize the data and come up with a solution

The data science life-cycle thus looks somewhat like:


1. Data acquisition
2. Data preparation
3. Hypothesis and modeling
4. Evaluation and Interpretation
5. Deployment
6. Operations
7. Optimization

Architecture of Data Science Projects


In this article, I summarize the components of any data science / machine learning /
statistical project, as well as the cross-dependencies between these components. This will
give you a general idea of what a data science or other analytic project is about.
https://www.datasciencecentral.com/profiles/blogs/the-data-science-zoo

Components
1. Problem
This is the top, fundamental component. I have listed 24 potential problems in my article
24 uses of statistical modeling. It can be anything from building a market segmentation,
building a recommendation system, association rule discovery for fraud detection, or
simulations to predict extreme events such as floods.  
2. Data
It comes in many shapes: transactional (credit card transactions), real-time, sensor data
(IoT), unstructured data (tweets), big data, images or videos, and so on. Typically raw
data needs to be identified or even built and put into databases (NoSQL or traditional),
then cleaned and aggregated using EDA (exploratory data analysis). The process can
include selecting and defining metrics.
3. Algorithms
Also called techniques. Examples include decision trees, indexation algorithm, Bayesian
networks, or support vector machines. A rather big list can be found here.
https://www.datasciencecentral.com/profiles/blogs/40-techniques-used-by-data-
scientists
4. Models
By models, I mean testing algorithms, selecting, fine-tuning, and combining the best
algorithms using techniques such as model fitting, model blending, data reduction,
feature selection, and assessing the yield of each model, over the baseline. It also
includes calibrating or normalizing data, imputation techniques for missing data, outliers
processing, cross-validation, over-fitting avoidance, robustness testing and boosting,
and maintenance. Criteria that make a model desirable include robustness or stability,
scalability, simplicity, speed, portability, adaptability (to changes in the data), and
accuracy (sometimes measured using R-squared, though I recommend this
alternative instead).
5. Programming
There is almost always some code involved, even if you use a black-box solution.
Typically, data scientists use Python, R or Java, and SQL. However, I've completed
some projects that did not involve real coding, but instead, machine-to-machine
communications via API's. Automation of code production (and of data science in
general) is an hot topic, as evidenced by the publication of articles such as The
Automated Statistician, and my own work to design simple, robust black-box solutions.  
6. Environments
Some call it packages. It can be anything such as a bare Unix box accessed remotely
combined with scripting languages and data science libraries such as Pandas (Python),
or something more structured such as Hadoop. Or it can be an integrated database
system from Teradata, Pivotal or other vendors, or a package like SPSS, SAS,
RapidMiner or MATLAB, or typically, a combination of these.
7. Presentation
By presentation, I mean presenting the results. Not all data science projects run
continuously in the background, for instance to automatically buy stocks or predict the
weather. Some are just ad-hoc analyses that need to be presented to decision makers,
using Excel, Tableau and other tools. In some cases, the data scientist must work with
business analysts to create dashboards, or to design alarm systems, with results from
analysis e-mailed to selected people based on priority rules.
Cross-Dependencies
These components interact as follows. I invite you to create a nice graph from the
dependencies table below. The first relationships reads as "the problem impacts or
dictate the data".
Problem -> Data
Problem -> Algorithms
Algorithms -> Models
Algorithms -> Programming
Algorithms -> Environment
Data -> Environment
Environment -> Data
Data -> Algorithms
Data -> Problem
Problem -> Presentation
Models -> Presentation
Also read the lifecycle of data science projects (see also this article).

Steps in a data science project


To prepare a data science project plan, you need to know the steps in a data
science project. Data scientists perform the following tasks in rough
chronological order:

1. Obtain data -Obtain the data that we need from available data
sources.
2. Scrub data - Clean the data by handling missing values, remove
noisy data, normalize irregularities, etc.
3. Explore data - Apply statistical analyses to find patterns and
significant trends using visualization. This is also a good time to
validate your business question.
4. Modelling - Implement machine learning models that suit your
dataset and evaluate their performances.
5. iNterpret - Interpret the model and the results, propose some
potential actionable items to the stakeholders and decision makers.
commonly known as OSEMN framework.

In a data science projects, according to me there are six major steps


involved which are :-

1. Asking the right questions - basically when the user presents


their question and depending upon the question asked different
results will be obtained, hence the question itself determines the
objective and target of the exploration.
2. Data Collection - once the goal of the experiment is determined,
the user can start collecting the data from the data source, with
regard to the exploration target. Mostly, data collected appears
unorganized and diverse in format, hence the data should be sorted
out which is basically the next step.
3. Data munging - this phase helps to map the raw data into more
convenient format for consumption. During this phase, there are
many processes such as data parsing , sorting, merging, filtering,
and other processes to transform and organize the data.
4. Basic exploratory data analysis - After the data munging,
further anlaysis is to conduct data processing. The most basic
analysis is to perform exploratory data analysis, it involves analyzing
a dataset by summarizing its characteristics.
5. Advanced exploratory data analysis - untill now, the
descriptive statistic gives a general description of the data features.
However, one would like to generate an inference rule for the user to
predict the data features, therefore the application of machine
learning enables the user to generate more inferential model.
6. Model assesment - finally, to assess whether the generating
model performs the best in the data estimation, this step is used.
The selection method here involves many steps, including data
preprocessing, tuning the parameters, and even switching the
machine learning algortihms.

TDSP Lifecycle
https://docs.microsoft.com/en-us/azure/machine-learning/team-data-science-process/overview

A Data Science Framework


https://www.kaggle.com/ldfreeman3/a-data-science-framework-to-achieve-99-accuracy

You might also like