DTS Modul Data Science Methodology
DTS Modul Data Science Methodology
• Assess Situation
• Inventory of Resources
• including key actors (sponsors, key users)
• Requirements, Assumptions, & Contraints
• Risks and Contingencies
• Terminology
• Cost and benefits
Case 1. Examining Hospital Readmission
• 1. Describing Data
• Format, Quantity, Identities of the tables, fields, and others
• Does the data required satisfy the relevant requirements?
Step 5. Data Understanding (2)
• 2. Exploring Data
• Exploring data by using data query, visualization, statistics to indicate data
characteristics or lead to interesting subsets for further examination
These include: Distribution of key attributes, for example the target attribute of
a prediction task; Relations between pairs or small numbers of attributes;
Results of simple aggregations; Properties of significant sub-populations; Simple
statistical analyses
Step 5. Data Understanding (3)
• 2. Verifying data Quality
• Examine the quality of the data, addressing questions such as: Is the data
complete (does it cover all the cases required)?
• Is it correct or does it contain errors, and if there are errors how common are
they? Are there missing values in the data?
• If so how are they represented, where do they occur and how common are
they?
Step 6. Data Preparation
• What additional work is required to manipulate and work with the
data?
Step 6. Data Preparation
• The data preparation stage comprises all activities used to construct
the data set that will be used in the modeling stage.
• These include
• Selecting data (feature selection)
• Cleansing data
• Constructing (derived) data (feature engineering)
• Combining data from multiple sources and
• Formating data
Step 6. Data Preparation: Selecting Data
• 1. Selecting Data
• Decide on the data to be used for analysis.
• Criteria:
• relevance to the data science goals,
• quality, and
• technical constraints such as limits on data volume or data types.
• Data selection covers selection of attributes (columns) as well as selection of
records (rows) in a table.
Step 6. Data Preparation: Cleansing Data
• 2. Cleansing data
• Raise the data quality to the level required by the selected analysis
techniques.
• This may involve selection of clean subsets of the data, the insertion of
suitable defaults or more ambitious techniques such as the estimation of
missing data by modeling.
• Missing data
• Incorrect/ invalid data
• Duplicate data
• Formatting data
Step 6. Data Preparation: Feature
Engineering
• 3. Feature engineering: Creating new features from existing ones to
improve model performance.
• Indicator Variable
• Indicator variable from thresholds:
• when studying alcohol preferences by U.S. consumers and our dataset has an age feature, we can
create an indicator variable for age >= 21 to distinguish subjects who were over the legal drinking
age.
• Indicator variable from multiple features:
• When predicting real-estate prices and having the features n_bedrooms and n_bathrooms. If houses
with 2 beds and 2 baths are a premium as rental properties, we can create an indicator variable to
flag them.
• Indicator variable for special events:
• When modelling weekly sales of an e-comer sites, we could create two indicator variables for the
weeks of Black Friday and Chrismas
• Indicator variable for group of classes:
• from the categorical feature traffic_source, we could create an indicator variable for paid_traffic by
flagging observations with traffic source values of "Facebook Ads" or "Google Adwords".
Step 6. Data Preparation: Feature
Engineering
• Interaction Features
• Sum of two features:
• When predicting revenue based on preliminary sales data, if we have the
features sales_blue_pens and sales_black_pens, we could sum those features if we only care
about overall sales_pens.
• Difference between two features:
• Yif we have the features house_built_date and house_purchase_date, we can take their
difference to create the feature house_age_at_purchase.
• Product of two features:
• When running a pricing test, and we have the feature price and an indicator
variable conversion, we can take their product to create the feature earnings.
• Quotient of two features
• Having a dataset of marketing campaigns with the features n_clicks and n_impressions, we
can divide clicks by impressions to create click_through_rate, allowing you to compare across
campaigns of different volume.
Step 6. Data Preparation: Feature
Engineering
• Feature Representation
• date and time features
• From the feature purchase_datetime, we can
create purchase_day_of_week, purchase_hour_of_day, purchases_over_last_30_days.
• numeric to categorical mapping
• When we have the feature years_in_school, we might create a new feature grade with classes
such as "Elementary School", "Middle School", and "High School".
• grouping sparse data
• You have a feature with many classes that have low sample counts. You can try grouping
similar classes and then grouping the remaining ones into a single "Other" class.
• creating dummy variables
• Depending on your machine learning implementation, you may need to manually transform
categorical features into dummy variables. You should always do this after grouping sparse
classes.
Step 6. Data Preparation: Combining
Data
• 4. Combining data from multiple sources
• These are methods whereby information is combined from multiple tables or
records to create new records or values
• Merged data also covers aggregations. Aggregation refers to operations
where new values are computed by summarizing together information from
multiple records and/or tables.
• For example, converting a table of customer purchases where there is one
record for each purchase into a new table where there is one record for each
customer, with fields such as number of purchases, average purchase
amount, percent of orders charged to credit card, percent of items under
promotion, etc.
Step 6. Data Preparation: Formating Data
• 5. Formating data
Formatting transformations refer to primarily syntactic modifications
made to the data that do not change its meaning, but might be
required by the modeling tool.
D. From Modelling to Evaluation
• Step 7. Modelling
• Step 8. Evaluation
Step 7. Modeling
• Modeling is geared toward answering two key questions:-
A. What is the purpose of data modeling
B. What are the characteristics of the process?
Step 7. Modeling
• Starting with the first version of the prepared data set, data scientists
use a training set—historical data in which the outcome of interest is
known—to develop predictive or descriptive models using the
analytic approach already described.
• A descriptive model can tell what
new service a customer may
prefer based on the customer’s
existing preferences, using
recommender systems and
clustering algorithms.
• While predictive modeling can
tell a future value or class based
on present data, some examples
are classification and linear or
logistic regression algorithms
Step 7. Modeling
The modeling process is highly iterative.
• 1. Selecting Analytic modelling technique
• 2. Generating Test Design
• 3. Building Model
Step 7. Modeling (2)
• 1. Selecting Analytic modelling technique or techniques
• Decision tree
• Artificial Neural network
• SVM
• Deep Learning
• …