Data Science Projects
Data Science Projects
PROJECTS RELATING TO
DATA SCIENCE
Part I
Predictive Model in Detail
Part II
Portfolio
Part III
Energy Efficiency in Building Systems
Part I: Building of a Predictive Model
Conceptually…
Problem
Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Predictive Model in Detail
Dataset
Source: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones
Predictive Model in Detail
R Language
load(".\\samsungData.rda")
is.data.frame(samsungData)
# [1] TRUE
table(duplicated(names(samsungData))) # checking
for duplicate headers
# FALSE TRUE
# 479 84
Predictive Model in Detail
Column Types
table(sapply(samsDF, class))
# character integer numeric
# 1 1 561
which(sapply(samsDF, is.character))
# activity
# 563
which(sapply(samsDF, is.integer))
# subject
# 562
Predictive Model in Detail
dim(samsDF)
# [1] 7352 563
table(complete.cases(samsDF))
# TRUE
# 7352
table(sapply(samsDF[,1:561], is.finite))
# TRUE
# 4124472 #7352*561 = 4124472
Predictive Model in Detail
Balanced Data
table(samsDF$activity)
# laying sitting standing walk walkdown walkup
# 1407 1286 1374 1226 986 1073
sum(table(samsDF$activity))
# [1] 7352
round( table(samsDF$activity)/nrow(samsDF), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
Splitting Data
library(caTools)
# Randomly split the data into training and testing sets
set.seed(1000)
split = sample.split(samsDF$activity, SplitRatio = 0.7)
round( table(train$activity)/nrow(train), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
Test Data
round( table(test$activity)/nrow(test), 2)
# laying sitting standing walk walkdown walkup
# 0.19 0.17 0.19 0.17 0.13 0.15
Predictive Model in Detail
Random Forest
library(randomForest)
set.seed(415)
trainF = train
trainF[562] = NULL
dim(trainF)
# [1] 5146 562
Predictive Model in Detail
Determining ntree
ntree = 293
Initial Results:
Prediction <- predict(fit, test[1:561])
library(caret)
confusionMatrix(Prediction , test[,563])
# Accuracy : 0.9782
# 95% CI : (0.9713, 0.9839)
Predictive Model in Detail
Determining mtry
library(caret)
confusionMatrix(PredictionF , test[,563])
# Accuracy : 0.9805
# 95% CI : (0.9738, 0.9859)
Reduction in Error = (0.9805 - 0.9782)/(1 - 0.9782) = 0.1055
Predictive Model in Detail
AUC
library(pROC)
Source: https://www.maxmind.com/en/home
MapReduce: Apache Weblog
Problem
http://bit.ly/oFraud123
MapReduce: Apache Weblog
Source: https://httpd.apache.org/docs/1.3/logs.html
MapReduce: Apache Weblog
URI Referer
MapReduce: Apache Weblog
Challenges
https://regex101.com/
MapReduce: Apache Weblog
RegEx Groups
https://regex101.com/
MapReduce: Apache Weblog
EpochTime
import time
fmt ='%d/%b/%Y:%H:%M:%S'
utci = int(utc)
epot = time.mktime(time.strptime(d, fmt)) #parses string given the format; converts to sec
epod = (abs(utci) % 100)/60.0 + (abs(utci) // 100) # minutes converted to hrs + int division in hrs
if utc.isdigit():
epf = epot + epod*3600
else:
epf = epot - epod*3600
return int(epf)
MapReduce: Apache Weblog
Reducer
http://bit.ly/ApaMapper
Source: https://www.maxmind.com/en/home
MapReduce: Apache Weblog
Mapper
#!/usr/bin/env python
import sys
print value
http://bit.ly/ApaMapper
MapReduce: Apache Weblog
Hadoop
Sample Output
http://bit.ly/oFraud123
MapReduce: Apache Weblog
Impact
Background
Problem
Challenges
Computed LTV
Visualization: LTV
Weighted LTV
Visualization: LTV
Impact
Customer LTV
>
Cost of customer Acquisition (CAC)
CAC
$10 engagement -> 5 new users -> these users
acquire 15 more users at no cost
CAC = $10/(5+15) = $0.50
Streaming Data: Speech
https://angel.co/freeaccent
Streaming Data: Speech
https://angel.co/freeaccent
Streaming Data: Speech
Problem
https://angel.co/freeaccent
Streaming Data: Speech
Challenges
https://angel.co/freeaccent
Streaming Data: Speech
Technology Stack
https://angel.co/freeaccent
Streaming Data: Speech
User Interface
https://angel.co/freeaccent
Streaming Data: Speech
Impact
Measurement tool
Motivational: helps to set goal
Customer retention
https://angel.co/freeaccent
Artificial Neural Network (ANN)
ANN
Source: https://appliedgo.net/perceptron/
ANN
Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php
ANN
Multi-Layer Perceptron
Fully interconnected
Source: http://dms1.irb.hr/tutorial/tut_nnets_short.php
ANN
Optimization function
Challenges
Saturation at Initialization
Known solutions:
Small initial weights
Hyperbolic Tangent Function instead of Sigmoidal
Other challenges relating to speech
processing
http://bit.ly/my_pubs
ANN
Saturation at Initialization
http://bit.ly/modANN
ANN
Introduced (N)
where (N) 1
http://bit.ly/modANN
ANN
Impact
Problem
Piezo-electric Transducer
Receiver
Transmitter
Water-Sludge interface Detection
Collecting Data
Data Visualization
Leakage
Reverberation
3.68 ms
Bottom of
the Tank
Water-Sludge interface Detection
1.5x103 x 3.68 ms
= 5.52 m
Depth of water
= 2.76 m
= 9.05 ft
Water-Sludge interface Detection
Impact
Addition of Internet?
IoT
On a computer or a device
Part III
Powerwall by Tesla
Powerwall
Energy Efficiency in Building Systems
Sun Shades
UC Davis West Village is the largest planned “zero net energy” community
Energy Efficiency in Building Systems
Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020
Energy Efficiency in Building Systems
Related Work
http://bit.ly/EnergyEff123
Source: www.greentechmedia.com/articles/read/California-Wants-All-New-Homes-to-be-Net-Zero-in-2020