Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
100% found this document useful (1 vote)
141 views

DTS Modul Data Science Methodology

This document provides an overview of data science concepts including what data science is, how it relates to artificial intelligence and machine learning, and common data science methodologies. It then details the typical steps in the data science process, from initial business understanding and defining the analytic approach, to requirements gathering, data collection, data understanding, preparation, modeling, evaluation, and deployment. Key aspects of each step are described at a high level.

Uploaded by

dancent sutanto
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
141 views

DTS Modul Data Science Methodology

This document provides an overview of data science concepts including what data science is, how it relates to artificial intelligence and machine learning, and common data science methodologies. It then details the typical steps in the data science process, from initial business understanding and defining the analytic approach, to requirements gathering, data collection, data understanding, preparation, modeling, evaluation, and deployment. Key aspects of each step are described at a high level.

Uploaded by

dancent sutanto
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 56

Modul Bidang Data Science

Intro Data Science


Fresh Graduate Academy
Digital Talent Scholarship 2019
Daftar Isi
TOPIK MODUL MATERI
1 What is Data Science
2 Data Science, AI, dan Machine Learning
3 Data Science Methodology
4 Step 1. Business Understanding
5 Step 2: Analytic Approach
6 Step 3: Data Requirement
7 Step 4: Data Collection
8 Step 5: Data Understanding
9 Step 6: Data Preparation
Daftar Isi
TOPIK MODUL MATERI
10 Step 7: Modeling
11 Step 8: Evaluation
12 Step 9: Deployment
13 Step 10: Feedback
What is Data Science
• Data science is a multi-disciplinary field that uses scientific methods,
processes, algorithms and systems to extract knowledge and insights
from structured and unstructured data (Wikipedia,2019)
• It unifies statistics, data analysis, machine learning and their related
methods" in order to "understand and analyze actual phenomena"
with data collected from the web, smartphones, customers, sensors,
and other sources.
What is Data Science
• As a new discipline, various
definitions appear:
Data Science, Machine Learning, AI
• Artificial Intelligence describes machines that can act/perform tasks
resembling those of humans
• Machine Learning describe process to give machine the capability to
learn
• Data Science uses various techniques
from Machine Larning to extract
knowledge and insights from data
Data Science Methodology
• An iterative system of methods that guides data scientists on the ideal
approach to solving problems with data science, through a prescribed
sequence of steps.
• Some of DS Methodologies
• CRSIP-DM: a de facto industry standard for data mining activities
• IBM Data Science Methodology (Based on CRISP-DM)
• Microsoft ‘s Team Data Science Process (TDSP)
• Domino’s Data Science Lifecycle
CRISP-DM
• De facto industry standard for data mining activities
IBM Data Science Methodology
• Based on CRISP-DM ( a de facto industry standard for data mining
activities)
Microsoft’s Team Data Science Process
IBM Data Science Methodology
• Based on CRISP-DM ( a de facto industry standard for data mining
activities)
• Data Science is not technology based activities
• But business (organization) oriented approach
• To solve business problem
• By utilizing data analytics
Domino’s Data Science Lifecycle
In general
• Data Science is not technology based activities
• But business (organization) oriented approach
• To solve business problem
• By utilizing data analytics
• It’s an Iterative process that follows a Prescribed Sequence. Thus it
provides a structure.
• Iterative means it’s a continuous cycle: model gets trained, evaluated and
deployed
• Prescribed Sequence means, step-by-step.
A. From Problem to Approach
• Step 1. Business Understanding
• Step 2. Analytic Approach
Step 1. Business Understanding
• What is the problem that you are trying to solve?
• Not technical problem but BUSINESS problem
• For example, if a Business owner asks ‘How can we reduce the cost of
performing an activity?’
• The Data Scientist needs to understand if the goal is to improve the efficiency
of the activity or to increase business profitability.
Step 1. Business Understanding
• Asking the right questions as a Data Scientist starts with
understanding the goal of the business owner in this case.
• The right questions will inform the ideal analytical approach for
solving the problem.

• To understand the business problem we have to look for


• Background
• Business Objectives: What
• Business Success Criteria: How to measure the success of the process
• Develop a glossary of terms: General agreeable terms (between business
people and technical ones)
Step 1. Business Understanding (2)
• Then we need to assess organizational situation

• Assess Situation
• Inventory of Resources
• including key actors (sponsors, key users)
• Requirements, Assumptions, & Contraints
• Risks and Contingencies
• Terminology
• Cost and benefits
Case 1. Examining Hospital Readmission

• How to reduce the probability of patients readmitted to the hospital?


Case 2. Store
• Can we increase sales by rearranging merchandises in the store?
Step 2. Analytic Approach
• How can you use data to answer the (business) question?

• Picking analytic approach based on time of question


• Descriptive
• Current Status
• Diagnostic (Statistical Analysis)
• What happened?
• Why is happening?
• Predictive (Forecasting)
• What if these trends continue?
• What will happen next?
• Presciptive
• How do we solve it?
Step 2. Analytic Approach(2)
• How can you measure success criteria of the analytics to answer the
(business) question?
B. From Requirements to Collection
• Step 3. Data Requirement
• Step 4. Data Collection
Step 3. Data Requirement
• What kinds of data to collect

• Choice of analytic approach determines the data requirements, for


the analytic methods to be used require particular data content,
formats and representations, guided by domain knowledge
Step 4. Data Collection
• The data scientist identifies and gathers data resources—structured,
unstructured and semi-structured—that are relevant to the problem
domain. On encountering gaps in data collection, the data scientist
might need to revise the data requirements and collect more data.
C. From Understanding to Preparation
• Step 5. Data Understanding
• Step 6. Data Preparation
Step 5. Data Understanding
• Is the data you collected representative of the problem to be solved?
Step 5. Data Understanding
• Use descriptive statistics and visualization techniques to
• understand data content,
• assess data quality and
• discover initial insights into the data.

• 1. Describing Data
• Format, Quantity, Identities of the tables, fields, and others
• Does the data required satisfy the relevant requirements?
Step 5. Data Understanding (2)
• 2. Exploring Data
• Exploring data by using data query, visualization, statistics to indicate data
characteristics or lead to interesting subsets for further examination

These include: Distribution of key attributes, for example the target attribute of
a prediction task; Relations between pairs or small numbers of attributes;
Results of simple aggregations; Properties of significant sub-populations; Simple
statistical analyses
Step 5. Data Understanding (3)
• 2. Verifying data Quality
• Examine the quality of the data, addressing questions such as: Is the data
complete (does it cover all the cases required)?
• Is it correct or does it contain errors, and if there are errors how common are
they? Are there missing values in the data?
• If so how are they represented, where do they occur and how common are
they?
Step 6. Data Preparation
• What additional work is required to manipulate and work with the
data?
Step 6. Data Preparation
• The data preparation stage comprises all activities used to construct
the data set that will be used in the modeling stage.
• These include
• Selecting data (feature selection)
• Cleansing data
• Constructing (derived) data (feature engineering)
• Combining data from multiple sources and
• Formating data
Step 6. Data Preparation: Selecting Data
• 1. Selecting Data
• Decide on the data to be used for analysis.
• Criteria:
• relevance to the data science goals,
• quality, and
• technical constraints such as limits on data volume or data types.
• Data selection covers selection of attributes (columns) as well as selection of
records (rows) in a table.
Step 6. Data Preparation: Cleansing Data
• 2. Cleansing data
• Raise the data quality to the level required by the selected analysis
techniques.
• This may involve selection of clean subsets of the data, the insertion of
suitable defaults or more ambitious techniques such as the estimation of
missing data by modeling.
• Missing data
• Incorrect/ invalid data
• Duplicate data
• Formatting data
Step 6. Data Preparation: Feature
Engineering
• 3. Feature engineering: Creating new features from existing ones to
improve model performance.
• Indicator Variable
• Indicator variable from thresholds:
• when studying alcohol preferences by U.S. consumers and our dataset has an age feature, we can
create an indicator variable for age >= 21 to distinguish subjects who were over the legal drinking
age.
• Indicator variable from multiple features:
• When predicting real-estate prices and having the features n_bedrooms and n_bathrooms. If houses
with 2 beds and 2 baths are a premium as rental properties, we can create an indicator variable to
flag them.
• Indicator variable for special events:
• When modelling weekly sales of an e-comer sites, we could create two indicator variables for the
weeks of Black Friday and Chrismas
• Indicator variable for group of classes:
• from the categorical feature traffic_source, we could create an indicator variable for paid_traffic by
flagging observations with traffic source values of "Facebook Ads" or "Google Adwords".
Step 6. Data Preparation: Feature
Engineering
• Interaction Features
• Sum of two features:
• When predicting revenue based on preliminary sales data, if we have the
features sales_blue_pens and sales_black_pens, we could sum those features if we only care
about overall sales_pens.
• Difference between two features:
• Yif we have the features house_built_date and house_purchase_date, we can take their
difference to create the feature house_age_at_purchase.
• Product of two features:
• When running a pricing test, and we have the feature price and an indicator
variable conversion, we can take their product to create the feature earnings.
• Quotient of two features
• Having a dataset of marketing campaigns with the features n_clicks and n_impressions, we
can divide clicks by impressions to create click_through_rate, allowing you to compare across
campaigns of different volume.
Step 6. Data Preparation: Feature
Engineering
• Feature Representation
• date and time features
• From the feature purchase_datetime, we can
create purchase_day_of_week, purchase_hour_of_day, purchases_over_last_30_days.
• numeric to categorical mapping
• When we have the feature years_in_school, we might create a new feature grade with classes
such as "Elementary School", "Middle School", and "High School".
• grouping sparse data
• You have a feature with many classes that have low sample counts. You can try grouping
similar classes and then grouping the remaining ones into a single "Other" class.
• creating dummy variables
• Depending on your machine learning implementation, you may need to manually transform
categorical features into dummy variables. You should always do this after grouping sparse
classes.
Step 6. Data Preparation: Combining
Data
• 4. Combining data from multiple sources
• These are methods whereby information is combined from multiple tables or
records to create new records or values
• Merged data also covers aggregations. Aggregation refers to operations
where new values are computed by summarizing together information from
multiple records and/or tables.
• For example, converting a table of customer purchases where there is one
record for each purchase into a new table where there is one record for each
customer, with fields such as number of purchases, average purchase
amount, percent of orders charged to credit card, percent of items under
promotion, etc.
Step 6. Data Preparation: Formating Data
• 5. Formating data
Formatting transformations refer to primarily syntactic modifications
made to the data that do not change its meaning, but might be
required by the modeling tool.
D. From Modelling to Evaluation
• Step 7. Modelling
• Step 8. Evaluation
Step 7. Modeling
• Modeling is geared toward answering two key questions:-
A. What is the purpose of data modeling
B. What are the characteristics of the process?
Step 7. Modeling
• Starting with the first version of the prepared data set, data scientists
use a training set—historical data in which the outcome of interest is
known—to develop predictive or descriptive models using the
analytic approach already described.
• A descriptive model can tell what
new service a customer may
prefer based on the customer’s
existing preferences, using
recommender systems and
clustering algorithms.
• While predictive modeling can
tell a future value or class based
on present data, some examples
are classification and linear or
logistic regression algorithms
Step 7. Modeling
The modeling process is highly iterative.
• 1. Selecting Analytic modelling technique
• 2. Generating Test Design
• 3. Building Model
Step 7. Modeling (2)
• 1. Selecting Analytic modelling technique or techniques
• Decision tree
• Artificial Neural network
• SVM
• Deep Learning
• …

• For modelling technique(s) choosen, several related assumptions are made,


e.g. all attributes have uniform distributions, no missing values allowed, class
attribute must be symbolic etc.
Step 7. Modeling(3)
• 2. Generating Test Design
• Generate a procedure or mechanism how to test the model's quality and
validity.
• For example, in supervised data mining tasks such as classification, it is
common to use error rates as quality measures for data mining models.
Therefore, we typically separated the data set into train and test set, build the
model on the train set, and estimate its quality on the separate test set.
Step 7. Modeling(4)
• 3.Building Model
• Set some parameter
• Run the modeling tool on the prepared data set to create one or more
models.
• Describe the resultant model
Step 8. Evaluation
Step 8. Evaluation
• The data scientist evaluates the model’s quality and checks whether it
addresses the business problem fully and appropriately.
• Requires the computing of various diagnostic measures—as well as other
outputs, such as tables and graphs—using a testing set for a predictive model.
• Model evaluation is performed during model development and
before the model is deployed. Evaluation allows the quality of the
model to be assessed and it’s also a way to see if it meets the initial
request.
• Interpret the models according to the domain knowledge
• Data mining success criteria, and
• the desired test design.
Step 8. Evaluation
• Data scientist judges the success of the application of modeling and
discovery techniques more technically,
• Business analysts and domain experts discuss the results in the
business context.
• Moreover, this task only considers models whereas the evaluation
phase also takes into account all other results which were produced
in the course of the project.
Step 8. Evaluation
• A model evaluation has two main phases:
• The Diagnostic Measures phase:
• concerned with the actual performance of the model, given a test data set
• The Statistical Significance phase.
• concerned about how True or Confident is the model prediction or description.
E. From Deployment to Feedback
• Step 9. Deployment
• Step 10. Feedback
Step 9. Deployment
• Can we put the model into practice?
Step 9. Deployment
• The data science model may present a solution, but the key to
making that solution relevant and useful to solve the initial problem is
to get the relevant stakeholders acquainted with the tool produced.
• Requires effective communication skills for onboarding.
• The model may be deployed to a limited number of stakeholders
initially or to a test environment to build up confidence in applying it
for use across the board.
• The model must be relatively intuitive to use, and staff members who
may be responsible to apply the model to solving similar problems
must be trained. It is important to document problems that may arise
at this stage.
Step 9. Deployment
1. Plan deployment
2. Plan monitoring and maintenance
3. Producing project’s report
Step 10. Feedback
• Once deployed, feedback from the users will be used to refine the
model and assess it for performance and impact. This will continue
for as long as the solution is required.
Step 10. Feedback
• By collecting results from the implemented model, the organization
gets feedback on the model’s performance and observes how it
affects its deployment environment.
• Analyzing this feedback enables the data scientist to refine the
model, increasing its accuracy and thus its usefulness.
• This often overlooked stage can yield substantial additional benefits if
undertaken as part of the overall process.

You might also like