6 Real-World Case Studies: Data Science For Business

DATA SCIENCE FOR BUSINESS
6 REAL-WORLD CASE STUDIES

Build 6 Practical Real-world Business Projects and Harness the Power
of Data Science and Machine Learning to solve practical, real-world
business problems.
By Dr. Ryan Ahmed, Ph.D., MBA

SuperDataScience Team
1. UNSUPERVISED LEARNING (CLUSTERING) K-MEANS
A. CONCEPT
• K-means is an unsupervised learning algorithm (clustering).
• K-means works by grouping some data points together (clustering)

in an unsupervised fashion.
• The algorithm groups observations with similar attribute values

together by measuring the Euclidian distance between points.
AGE
AGE
K-MEANS
SAVINGS SAVINGS
B. K-MEAN ALGORITHM STEPS
1. Choose number of clusters “K”
2. Select random K points that are going to be the centroids for each
cluster
3. Assign each data point to the nearest centroid, doing so will enable
us to create “K” number of clusters
4. Calculate a new centroid for each cluster
5. Reassign each data point to the new closest centroid
6. Go to step 4 and repeat.
C. HOW TO CHOOSE OPTIMAL NUMBER OF K?
• Calculate the “Within Cluster Sum of Squares (WCSS)” for various

values of K (number of clusters).
• Plot the WCSS vs. K and choose the elbow of the curve as the optimal
number of clusters to use.
Within Cluster Sum of Squares (WCSS)
= distance(Pi, C1 )2
Pi in Cluster 1
+ distance(Pi, C2 )2 + distance(Pi, C3 )2
Pi in Cluster 2 Pi in Cluster 3
WITHIN CLUSTERS SUM OF
SQUARES (WCSS)
Pi
C3 OPTIMAL “K”
AGE
Pi C1
Pi
C2
SAVINGS NUMBER OF CLUSTERS “K”
D. HOW TO IMPLEMENT K-MEANS IN SCI-KIT LEARN?
>> k = 4 #specify the number of clusters

>> kmeans = KMeans(k)
>> kmeans.fit(X_train)
>> labels = kmeans.labels_
2. TIME SERIES FORECASTING
A. CONCEPT
• Predictive models attempt at forecasting future sales based on

historical data while taking into account seasonality effects, demand,
holidays, promotions, and competition.
• Facebook Prophet is open source software released by Facebook’s

Core Data Science team. Prophet is a procedure for forecasting time
series data based on an additive model where non-linear trends are
fit with yearly, weekly, and daily seasonality, plus holiday effects.
• Prophet implements an additive regression model with four
elements:
1. A piecewise linear, Prophet automatically picks up change

points in the data and identifies any change in trends.
2. A yearly seasonal component modeled using Fourier series.
3. A weekly seasonal component.
4. A holiday list that can be manually provided.
• Additive Regression model takes the form:
p
Y = β0 + ƒj Xj + ϵ
j=1
The functions ƒj (xj) are unknown smoothing functions fit from the data
FUTURE
PAST
SALES
TIME
B. FACEBOOK PROPHET
>> from fbprophet import Prophet

>> data_df = data_df.rename(columns = {'Date': 'ds', 'Sales: 'y'})
>> m = Prophet()
>> m.fit(data_df)
>> future = m.make_future_dataframe(periods = 720)
>> forecast = m.predict(future)
>> figure = m.plot(forecast)
3. REGRESSION TASKS
A. CONCEPT
• Regression is used to predict the value of one variable Y based on

another variable X.
• X is called the independent variable and Y is called the dependant

variable.
• Simple Linear regression is a statistical model that examines linear

relationship between two variables only.
• Multiple Linear Regression examines the relationship between more

than two variables.
• Each independent variable has its own corresponding coefficient.
y = b0 + b1 * x1 + b2 * x2 + .. + bnxn
DEPENDANT VARIABLES INDEPENDANT VARIABLES

B. LEAST SQUARES CONCEPT
• Least squares fitting is a way to find the best fit curve or line for a set
of points.
• The sum of the squares of the offsets (residuals) are used to estimate
the best fit curve or line.
• Least squares method is used to obtain the coefficients m and b.
yi(actual)
d d = ŷi − yi
REVENUE($)
ŷi(estimated/fitted)
min (ŷi − yi)2
MINIMUM (LEAST) SUM OF SQUARES
TEMPERATURE (DegC)
C. IMPLEMENT MULTIPLE LINEAR REGRESSION IN

SCIKIT-LEARN
>> from sklearn.linear_model import LinearRegression

>> regressor = LinearRegression(fit_intercept =True)
>> regressor.fit(X_train,y_train)
>> print('Linear Model Coefficient (m): ', regressor.coef_)
>> print('Linear Model Coefficient (b): ', regressor.intercept_)
D. EVALUATE THE MODEL (MAKE PREDICTIONS USING

TRAINED MODEL)
>> y_predict = regressor.predict( X_test)

>> y_predict
4. CLASSIFICATION TASKS
4.1. CLASSIFICATION USING LOGISTIC REGRESSION
A. LOGISTIC REGRESSION CONCEPT
• Logistic regression algorithm works by implementing a linear

equation first with independent predictors to predict a value.
• This value is then converted into a probability that could range from
0 to 1
• Logistic regression can be used as a classification technique by

setting a threshold value (Ex: 0.5) to predict binary outputs with two
possible values labeled "0" or "1"
• Therefore, Logistic model output can be one of two classes: pass/fail,

win/lose, healthy/sick
Linear equation:
y = b0 + b1 * x
PASS/FAIL
LOGISTIC
REGRESSION Apply Sigmoid function:
MODEL
P(x) = sigmoid (y)
1
P(x) =
1+e−y
1
P(x) =
HOURS OF STUDYING 1+e−(b0+b1*x)
B. LOGISTIC REGRESSION IN SCI-KIT LEARN
>> from sklearn.linear_model import LogisticRegression

>> Logistic_Regressor = LogisticRegression(random_state = 0)
>> Logistic_Regressor.fit(X_train, y_train)
4.2. CLASSIFICATION USING RANDOM FOREST
CLASSIFIER
A. RANDOM FOREST REGRESSION CONCEPT
• Random Forest Classifier is a type of ensemble algorithm.
• It creates a set of decision trees from randomly selected subset of

training set.
• It then combines votes from different decision trees to decide the

final class of the test object.
• It overcomes the issues with single decision trees by reducing the

effect of noise.
• Overcomes overfitting problem by taking average of all the

predictions, cancelling out biases.
Tree #1 Tree #1 Tree #N
Savings Savings Savings

> $1M > $1M > $1M
Yes No Yes No Yes No
Age>45 Class #0 Age>45 Class #0 Age>45 Class #0
Yes No Yes No Yes No
Class #0 Class #0 Class #0 Class #0 Class #0 Class #0
Out = Class #1 Out = Class #1 Out = Class #0
Magority vote = Class #1
B. RANDOM FOREST CLASSIFIER IN SCI-KIT LEARN
>> from import sklearn.ensemble RandomForestClassifier

>> RandomForest = RandomForestClassifier(n_estimators=250)
>> RandomForest.fit(X_train, y_train)
5. DEEP LEARNING AND COMPUTER VISION
A. ARTIFICIAL NEURAL NETWORKS CONCEPT
• Artificial Neural Networks are information processing models

inspired by the human brain. ANNs are built in a layered fashion
where inputs are propagated starting from the input layer through
the hidden layers and finally to the output.
Layer 1 Layer 2 Layer k
P1(t) a1k(t)
1 1 1
P2(t) a2k(t)
2 2 2
PN −1(t) akN (t)

k
1
N1−1 N2−1 Nk−1
b12 b22 b1k b2k

akN (t)
k
N1=1 N2=1 Nk=1

b2
N2-1
b k
Nk-1
Networks training is performed by optimizing the matrix of weights

outlined below:
W11 W12 W1 , N 1
…
W21 W22 W 2, N 1
…
…
…
Wm− 1 , 1 Wm− 1 , 2 Wm− 1 , N 1

…
Hm , 1 Wm , 2 Wm , N 1
B. BUILD AN ANNS USING KERAS
>> import tensorflow.keras

>> from keras.models import Sequential
>> from keras.layers import Dense
>> from sklearn.preprocessing import MinMaxScaler
>> model = Sequential()
>> model.add(Dense(25, input_dim=5, activation='relu'))
>> model.add(Dense(25, activation='relu'))
>> model.add(Dense(1, activation='linear'))
>> model.summary()
C. TRAIN AN ANN USING KERAS
>> model.compile(optimizer='adam', loss='mean_squared_error')

>> epochs_hist = model.fit(X_train, y_train, epochs=20,
batch_size=25, validation_split=0.2
D. EVALUATE THE TRAINED ANN MODEL
>> X_Testing = np.array([[input #1, input #2, input #3,.., input

#n]])
>> y_predict = model.predict(X_Testing)
E. HOW TO BUILD A CONVOLUTIONAL NEURAL NETWORK

(CNN)
• CNNs is a type of deep neural networks that are commonly used for
image classification.
• CNNs are formed of (1) Convolutional Layers (Kernels and feature
detectors), (2) Activation Functions (RELU), (3) Pooling Layers (Max
Pooling or Average Pooling), and (4) Fully Connected Layers
(Multi-layer Perceptron Network).
TARGET CLASSES
Airplanes
Hidden
Cars
Input
Output Birds
CONVOLUTION POOLING FLATTENING
Cats
Deer
Dogs
KERNELS/ POOLING Frogs
FEATURE FILTERS
Horses
DETECTORS
Ships
Trucks
POOLINGL
LAYER
CONVOLUTIONAL (DOWNSAMPLING)
LAYER f(y)
f(y)=y
f(y)=0 y
F. BUILD A CNN USING KERAS:
>> from keras.models import Sequential

>> from keras.layers import Conv2D, MaxPooling2D,
AveragePooling2D, Dense, Flatten, Dropout
>> from keras.optimizers import Adam
>> from keras.callbacks import TensorBoard
>> cnn_model = Sequential()
>> cnn_model.add(Conv2D(filters = 32, kernel_size=(3,3),

activation = 'relu', input_shape = (32,32,3)))
>> cnn_model.add(Conv2D(filters = 32, kernel_size=(3,3),
activation = 'relu'))
>> cnn_model.add(MaxPooling2D(2,2))
>> cnn_model.add(Dropout(0.3))
>> cnn_model.add(Flatten())
>> cnn_model.add(Dense(units = 512, activation = 'relu'))

>> cnn_model.add(Dense(units = 10, activation = 'softmax'))
6. NATURAL LANGUAGE PROCESSING (TOKENIZATION)
A. CONCEPT
• Tokenization is a common procedure in natural language processing.

Tokenization works by dividing a sentence into a set of words. These
words are then used to train a machine learning model to perform a
certain task.
B. TOKENIZATION USING SCIKIT-LEARN
>> from sklearn.feature_extraction.text import CountVectorizer

>> vectorizer = CountVectorizer()
>> output = vectorizer.fit_transform(data_df])
7. CLASSIFICATION MODELS ASSESSMENT (CONFUSION

MATRIX/CLASSIFICATION REPORT
A. CONFUSION MATRIX CONCEPT
A confusion matrix is used to describe the performance of a

classification model:
• True positives (TP): cases when classifier predicted TRUE (has a
disease), and correct class was TRUE (patient has disease).
• True negatives (TN): cases when model predicted FALSE (no disease),
and correct class was FALSE (patient does not have disease).
• False positives (FP) (Type I error): classifier predicted TRUE, but

correct class was FALSE (patient does not have disease).
• False negatives (FN) (Type II error): classifier predicted FALSE (patient

do not have disease), but they actually do have the disease
TRUE CLASS
+ −
Type I error
+ TRUE + FALSE +
PREDICTIONS
− FALSE − TRUE −
Type II error
• Classification Accuracy = (TP+TN) / (TP + TN + FP + FN)
• Misclassification rate (Error Rate) = (FP + FN) / (TP + TN + FP + FN)
• Precision = TP/Total TRUE Predictions = TP/ (TP+FP) (When model

predicted TRUE class, how often did it get it right?)
• Recall = TP/ Actual TRUE = TP/ (TP+FN) (when the class was actually
TRUE, how often did the classifier get it right?)
B. CONFUSION MATRIX IN SKLEARN
>> from sklearn.metrics import classification_report,

confusion_matrix
>> y_predict_test = classifier.predict(X_test)
>> cm = confusion_matrix(y_test, y_predict_test)
>> sns.heatmap(cm, annot=True)
C. CLASSIFICATION REPORT
>> from sklearn.metrics import classification_report

>> print(classification_report(y_test, y_pred))
8. REGRESSION MACHINE LEARNING MODELS METRICS
A. MEAN ABSOLUTE ERROR (MAE)
• Mean Absolute Error (MAE) is obtained by calculating the absolute

difference between the model predictions and the true (actual)
values
• MAE is a measure of the average magnitude of error generated by

the regression model
• The mean absolute error (MAE) is calculated as follows:
1
MAE = |yi − ŷi |
n
i= 1
• MAE is calculated by following these steps:

1. Calculate the residual for every data point
2. Calculate the absolute value (to get rid of the sign)
3. Calculate the average of all residuals
• If MAE is zero, this indicates that the model predictions are perfect
B. MEAN SQUARE ERROR (MSE)
• Mean Square Error (MSE) is very similar to the Mean Absolute Error
(MAE) but instead of using absolute values, squares of the difference
between the model predictions and the training dataset (true values)
is being calculated.
• MSE values are generally large compared to the MAE since the
residuals are being squared.
• In case of data outliers, MSE will become much larger compared to

MAE
• In MSE, error increases in a quadratic fashion while the error

increases in proportional fashion in MAE
• The MSE is calculated as follows:
n
1 2
MSE = yi − ŷi
n
i= 1
• MAE is calculated by following these steps:

2. Calculate the squared value of the residuals
3. Calculate the average of all residuals
C. ROOT MEAN SQUARE ERROR (RMSE)
• Root Mean Square Error (RMSE) represents the standard deviation of

the residuals (i.e.: differences between the model predictions and the
true values (training data)).
• RMSE can be easily interpreted compared to MSE because RMSE

units match the units of the output.
• RMSE provides an estimate of how large the residuals are being

dispersed.
• The MSE is calculated as follows:
n
1
MSE= yi − ŷi 2
n
i= 1
• RMSE is calculated by following these steps:
2. Calculate the squared value of the residuals
3. Calculate the average of the squared residuals
4. Obtain the square root of the result

D. MEAN ABSOLUTE PERCENTAGE ERROR (MAPE)
• MAE values can range from 0 to infinity which makes it difficult to

interpret the result as compared to the training data.
• Mean Absolute Percentage Error (MAPE) is the equivalent to MAE but

provides the error in a percentage form and therefore overcomes
MAE limitations.
• MAPE might exhibit some limitations if the data point value is zero
(since there is division operation involved)
• The MAPE is calculated as follows:
n
100%
MAPE= |(yi − ŷi )/ yi |
n
i= 1
E. MEAN PERCENTAGE ERROR (MPE)
• MPE is similar to MAPE but without the absolute operation
• MPE is useful to provide an insight of how many positive errors as

compared to negative ones
• The MPE is calculated as follows:
n
100%
MPE= (yi − ŷi )/ yi
n
i= 1

6 Real-World Case Studies: Data Science For Business

Uploaded by

Copyright:

Available Formats

6 Real-World Case Studies: Data Science For Business

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

6 Real-World Case Studies: Data Science For Business

Uploaded by

Copyright:

Available Formats

DATA SCIENCE FOR BUSINESS

6 REAL-WORLD CASE STUDIES

By Dr. Ryan Ahmed, Ph.D., MBA

1. UNSUPERVISED LEARNING (CLUSTERING) K-MEANS

• K-means is an unsupervised learning algorithm (clustering).

• K-means works by grouping some data points together (clustering)

• The algorithm groups observations with similar attribute values

1. Choose number of clusters “K”

4. Calculate a new centroid for each cluster

5. Reassign each data point to the new closest centroid

6. Go to step 4 and repeat.

C. HOW TO CHOOSE OPTIMAL NUMBER OF K?

• Calculate the “Within Cluster Sum of Squares (WCSS)” for various

Within Cluster Sum of Squares (WCSS)

SAVINGS NUMBER OF CLUSTERS “K”

D. HOW TO IMPLEMENT K-MEANS IN SCI-KIT LEARN?

>> k = 4 #specify the number of clusters

2. TIME SERIES FORECASTING

• Predictive models attempt at forecasting future sales based on

• Facebook Prophet is open source software released by Facebook’s

1. A piecewise linear, Prophet automatically picks up change

2. A yearly seasonal component modeled using Fourier series.

3. A weekly seasonal component.

4. A holiday list that can be manually provided.

• Additive Regression model takes the form:

>> from fbprophet import Prophet

• Regression is used to predict the value of one variable Y based on

• X is called the independent variable and Y is called the dependant

• Simple Linear regression is a statistical model that examines linear

• Multiple Linear Regression examines the relationship between more

• Each independent variable has its own corresponding coefﬁcient.

DEPENDANT VARIABLES INDEPENDANT VARIABLES

• Least squares method is used to obtain the coefﬁcients m and b.

MINIMUM (LEAST) SUM OF SQUARES

C. IMPLEMENT MULTIPLE LINEAR REGRESSION IN

>> from sklearn.linear_model import LinearRegression

D. EVALUATE THE MODEL (MAKE PREDICTIONS USING

>> y_predict = regressor.predict( X_test)

4.1. CLASSIFICATION USING LOGISTIC REGRESSION

A. LOGISTIC REGRESSION CONCEPT

• Logistic regression algorithm works by implementing a linear

• Logistic regression can be used as a classiﬁcation technique by

• Therefore, Logistic model output can be one of two classes: pass/fail,

B. LOGISTIC REGRESSION IN SCI-KIT LEARN

>> from sklearn.linear_model import LogisticRegression

A. RANDOM FOREST REGRESSION CONCEPT

• Random Forest Classiﬁer is a type of ensemble algorithm.

• It creates a set of decision trees from randomly selected subset of

• It then combines votes from different decision trees to decide the

• It overcomes the issues with single decision trees by reducing the

• Overcomes overﬁtting problem by taking average of all the

Tree #1 Tree #1 Tree #N

Savings Savings Savings

Yes No Yes No Yes No

Age>45 Class #0 Age>45 Class #0 Age>45 Class #0

Yes No Yes No Yes No

Class #0 Class #0 Class #0 Class #0 Class #0 Class #0

Out = Class #1 Out = Class #1 Out = Class #0

Magority vote = Class #1