Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
14 views

Data Science Projects

Data Science Projects

Uploaded by

sashs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Data Science Projects

Data Science Projects

Uploaded by

sashs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Email: sashs at gmx dot com

PROJECTS RELATING TO
DATA SCIENCE
 Part I
 Predictive Model in Detail
 Part II
 Portfolio
 Part III
 Energy Efficiency in Building Systems
Part I: Building of a Predictive Model

 Human Activity Recognition using


‘RandomForest’
Predictive Model in Detail

Conceptually…

 Steps in building a predictive model


1. Define the question
2. Define the ideal data set
3. Determine what data you can access
4. Obtain the data
5. Clean the data
6. Exploratory data analysis
7. Statistical prediction/modelling
8. Interpret results
9. Challenge results
10. Synthesize/write up results
Predictive Model in Detail

Problem

 Human Activity Prediction Using


Smartphones Data Set
 Samsung Galaxy S II
 30 volunteers wearing on their waist
 Six activities
 WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS,
SITTING, STANDING, LAYING
 Sensors
 Accelerometer and Gyroscope

Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Predictive Model in Detail

Dataset

 UCI Machine Learning Repository


 561-feature vector with time and frequency
domain variables, augmented with “subject” and
“activity” => 563
 3-axial linear acceleration
 3-axial angular velocity

Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Predictive Model in Detail

Duplicate Column Names

 R Language
load(".\\samsungData.rda")

is.data.frame(samsungData)
# [1] TRUE

table(duplicated(names(samsungData))) # checking
for duplicate headers
# FALSE TRUE
# 479 84
Predictive Model in Detail

Duplicate Column Names

samsDF <- data.frame(samsungData)


is.data.frame(samsungData)
# [1] TRUE

table(duplicated(names(samsDF))) # checking for


duplicate headers
# FALSE
# 563
Predictive Model in Detail

Column Types
table(sapply(samsDF, class))
# character integer numeric
# 1 1 561

which(sapply(samsDF, is.character))
# activity
# 563

which(sapply(samsDF, is.integer))
# subject
# 562
Predictive Model in Detail

Missing Data & Finite Values

dim(samsDF)
# [1] 7352 563

table(complete.cases(samsDF))
# TRUE
# 7352

table(sapply(samsDF[,1:561], is.finite))
# TRUE
# 4124472 #7352*561 = 4124472
Predictive Model in Detail

Balanced Data

table(samsDF$activity)
# laying sitting standing walk walkdown walkup
# 1407 1286 1374 1226 986 1073

sum(table(samsDF$activity))
# [1] 7352

round( table(samsDF$activity)/nrow(samsDF), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail

Splitting Data
library(caTools)
# Randomly split the data into training and testing sets
set.seed(1000)
split = sample.split(samsDF$activity, SplitRatio = 0.7)

# Split up the data using subset


train = subset(samsDF, split==TRUE)
dim(train)
# [1] 5146 563

round( table(train$activity)/nrow(train), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail

Test Data

test = subset(samsDF, split==FALSE)


dim(test)
# [1] 2206 56

round( table(test$activity)/nrow(test), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail

Random Forest

library(randomForest)
set.seed(415)

trainF = train
trainF[562] = NULL
dim(trainF)
# [1] 5146 562
Predictive Model in Detail

Determining ntree

fit <- randomForest(as.factor(activity) ~ ., data=trainF,


importance=TRUE, ntree=500, do.trace=T)

ntree = 293

Initial Results:
Prediction <- predict(fit, test[1:561])
library(caret)
confusionMatrix(Prediction , test[,563])
# Accuracy : 0.9782
# 95% CI : (0.9713, 0.9839)
Predictive Model in Detail

Determining mtry

# mtry : Optimal number of variables selected at each split

mtry <- tuneRF(trainF[-562], as.factor(trainF$activity), ntreeTry=200,


stepFactor=1.5,improve=0.01, trace=TRUE, plot=TRUE)

bestm <- mtry[mtry[, 2] == min(mtry[, 2]), 1]


bestm
# [1] 11
Predictive Model in Detail

Building & testing the Model

fitF <- randomForest(as.factor(activity) ~ ., data=trainF,


importance=TRUE, ntree=293, mtry=bestm, do.trace=T)

PredictionF <- predict(fitF, test[1:561])

library(caret)
confusionMatrix(PredictionF , test[,563])
# Accuracy : 0.9805
# 95% CI : (0.9738, 0.9859)
Reduction in Error = (0.9805 - 0.9782)/(1 - 0.9782) = 0.1055
Predictive Model in Detail

AUC

library(pROC)

ROC1 <- multiclass.roc( test$activity, as.numeric(PredictionF))


auc(ROC1)
# Multi-class area under the curve: 0.9953

Decision Tree by Hand: http://bit.ly/DTree123


Part II: Portfolio

 MapReduce: Apache Weblog


 Visualization: LTV
 Streaming Data Analysis: Speech
 Artificial Neural Network (ANN)
 Water-Sludge interface Detection
MapReduce: Apache Weblog

Source: https://www.maxmind.com/en/home
MapReduce: Apache Weblog

Problem

 Analyze Apache weblog and provide:


 EpochTime (date and time the request was
processed by the server)
 IP Address
 Latitude, Longitude
 URI
 Referer

http://bit.ly/oFraud123
MapReduce: Apache Weblog

Combined Weblog Format

 "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-


agent}i\""
 (%h) - IP address of the client (remote host)
 -(%l) - the "hyphen" indicates missing information
 (%u) - the "userid" of the person requesting
 (%t) - time of the request
 …
 …

Source: https://httpd.apache.org/docs/1.3/logs.html
MapReduce: Apache Weblog

Knowing your customers


through Apache Logs
 198.0.200.105 - - [14/Jan/2014:09:36:51 -0800] "GET
/svds.com/rockandroll/js/libs/modernizr-2.6.2.min.js HTTP/1.1"
200 8768 "http://www.svds.com/rockandroll/" "Mozilla/5.0
(Macintosh; Intel Mac OS X 10_9_1) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/31.0.1650.63 Safari/537.36“

IP address Date & Time

URI Referer
MapReduce: Apache Weblog

Challenges

 Weblog needs to be parsed to extract the


required information
 Time is not expressed in “Epoch Time”
 Latitude and Longitude are not readily
available
MapReduce: Apache Weblog

Regular Expression & Testing

(\S+) (\S+) (\S+) \[([^:]+:\d+:\d+:\d+) ([^\]]+)\] \"(\S+) \/(.*?)


(\S+)\" (\S+) (\S+) "([^"]*)" "([^"]*)

https://regex101.com/
MapReduce: Apache Weblog

RegEx Groups

https://regex101.com/
MapReduce: Apache Weblog

EpochTime

import time

def convert_time(d, utc):


# d = "14/Jan/2014:09:36:50"
# utc = '-0800'

fmt ='%d/%b/%Y:%H:%M:%S'
utci = int(utc)

epot = time.mktime(time.strptime(d, fmt)) #parses string given the format; converts to sec
epod = (abs(utci) % 100)/60.0 + (abs(utci) // 100) # minutes converted to hrs + int division in hrs

if utc.isdigit():
epf = epot + epod*3600
else:
epf = epot - epod*3600

return int(epf)
MapReduce: Apache Weblog

Latitude and Longitude

 Geolite2 from MaxMind


 geolite2.lookup(<IP address>)

 Reducer
 http://bit.ly/ApaMapper

Source: https://www.maxmind.com/en/home
MapReduce: Apache Weblog

Mapper

#!/usr/bin/env python

import sys

#Iterate through every line passed in to stdin


for input in sys.stdin.readlines():
value = input.strip()

print value

http://bit.ly/ApaMapper
MapReduce: Apache Weblog

Hadoop

hadoop jar path/to/hadoop-streaming-


0.20.203.0.jar \
-mapper path/to/mapper.py \
-reducer path/to/reducer.py \
-input path/to/input/* \
-output path/to/output
MapReduce: Apache Weblog

Sample Output

http://bit.ly/oFraud123
MapReduce: Apache Weblog

Impact

Helps to Detect Online Fraud and


Locate Online Visitors
Visualization: LTV
Visualization: LTV

Background

 Gamers sign up each day and become part of


a cohort
 LTV is computed for up to 30 days
Visualization: LTV

Problem

 Use Tableau to:


 Compute LTV
 Compute weighted LTV
Visualization: LTV

Challenges

 Tableau is relatively new


 LTV computation was not readily available
 Given dataset is irregular:
Visualization: LTV

Computed LTV
Visualization: LTV

Weighted LTV
Visualization: LTV

Impact

Customer LTV
>
Cost of customer Acquisition (CAC)

 CAC
 $10 engagement -> 5 new users -> these users
acquire 15 more users at no cost
 CAC = $10/(5+15) = $0.50
Streaming Data: Speech

https://angel.co/freeaccent
Streaming Data: Speech

Language Learning over a


Chat session

https://angel.co/freeaccent
Streaming Data: Speech

Problem

 Learn a foreign language from a native


speaker
 Student and Tutor are separated
 Use computing device and internet

https://angel.co/freeaccent
Streaming Data: Speech

Challenges

 Collect the speech data off the web


 Record: start record, stop record
 Upload
 Preprocessing speech data
 End point detection
 Noise
 Extracting Accent Score frame by frame
 Populating on the web page on demand

https://angel.co/freeaccent
Streaming Data: Speech

Technology Stack

 Collect the speech data off the web


 Html5, JavaScript, PHP
 Preprocessing speech data
 Energy based algo, MFCC
 Extracting Accent Score frame by frame
 Proprietary algo
 Populating on the web page on demand
 AJAX

https://angel.co/freeaccent
Streaming Data: Speech

User Interface

https://angel.co/freeaccent
Streaming Data: Speech

Impact

 Measurement tool
 Motivational: helps to set goal
 Customer retention

https://angel.co/freeaccent
Artificial Neural Network (ANN)
ANN

McCulloch Pitts (MP) Neuron

Source: https://appliedgo.net/perceptron/
ANN

Diagram of the MP neuron


ANN

Equation of the MP neuron

Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php
ANN

Multi-Layer Perceptron

 Fully interconnected

Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php
ANN

Optimization function

 Rumelhart et al – Gradient Descent


(Generalized Delta Rule)
ANN

Challenges

 Saturation at Initialization
 Known solutions:
 Small initial weights
 Hyperbolic Tangent Function instead of Sigmoidal
 Other challenges relating to speech
processing

http://bit.ly/my_pubs
ANN

Hyperbolic Tangent Function


ANN

Saturation at Initialization

http://bit.ly/modANN
ANN

Introduced (N)

where (N)  1

http://bit.ly/modANN
ANN

Impact

 Training time was significantly reduced


 3 layers – not needed
 (N) - empirical
Water-Sludge interface Detection

 Thames Water Authority – Deephams


Station, Enfield
Water-Sludge interface Detection

Problem

 Replace Turbidity meter


 Piezo-electric transducer to detect water-
sludge interface
 Measure water depth in a final stage settling
tank
Water-Sludge interface Detection

Piezo-electric Transducer

Receiver

Transmitter
Water-Sludge interface Detection

Final Stage Settling Tank


Water-Sludge interface Detection

Pulsed Sinusoidal Signal

 Period of pulse 27.5 ms


Water-Sludge interface Detection

Collecting Data

 Envelope Detection and Amplification


Water-Sludge interface Detection

Data Visualization

 Average of the reverberated signal by the pulse period

Leakage

Reverberation
3.68 ms

Bottom of
the Tank
Water-Sludge interface Detection

Computing the Water Depth

 Speed of sound ~1.5x103 m/s

1.5x103 x 3.68 ms
= 5.52 m

Depth of water
= 2.76 m
= 9.05 ft
Water-Sludge interface Detection

Impact

 Proof of concept was successful


 Won a contract to develop an instrument
Water-Sludge interface Detection

Addition of Internet?

 IoT
 On a computer or a device
Part III

 Energy Efficiency in Building Systems


Energy Efficiency in Building Systems

Powerwall by Tesla

Powerwall
Energy Efficiency in Building Systems

Solar Tubes and Walls


Energy Efficiency in Building Systems

Sun Shades

UC Davis West Village is the largest planned “zero net energy” community
Energy Efficiency in Building Systems

Net Zero Homes

 New homes to be net-zero energy by 2020


 California Public Utilities Commission (CPUC) and
 California Energy Commission (CEC)

Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020
Energy Efficiency in Building Systems

Related Work

 http://bit.ly/EnergyEff123

Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020

You might also like