Distracted Driver Detection Using Deep Learning Methods
Distracted Driver Detection Using Deep Learning Methods
Methods
2 Introduction 3
3 Literature Review 4
4 Dataset Description 5
5 Proposed Method 6
5.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Training: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
6 Architecture 8
6.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.2 Inception-v3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
6.3 Residual Network(ResNet-50) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6.4 Xception Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7 Activation Function 9
8 Loss function 10
8.1 Mean Squared Error: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
8.2 Likelihood Loss: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
8.3 Log Loss (Cross Entropy Loss): . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
9 Regularization 10
10 What’s Next? 11
11 Results 11
12 Conclusion 12
13 References 12
2
1 Abstract
An investigation by specialists at the Indian Institute of Technology Bombay (IITB) has indicated
how utilizing cell phones while driving can divert drivers and influence their capacity to deal with
dangerous circumstances. The consequences of the examination have demonstrated that both calling
and messaging while at the same time driving corrupts the presentation of a driver. As indicated by a
report by the World Health Organization, the ninth leading reason for the deaths are road accidents.
In India, a nation where a street mishap happens each moment and a life is lost at regular intervals,
a past report uncovered that 31% of the drivers who utilized a cell phone during driving met with
accidents. In the most recent decade alone, India lost 1.3 million individuals to street crashes and
another 5.3 million were crippled forever. Therefore considering the seriousness of the issue we present
a novel algorithm to detect distracted driving. We employ Convolutional Neural Networks(CNNs) to
identify and classify distracted driving from images taken from a camera fixed at the dashboard of the
four wheeler vehicle. We trained the model on Resnet50 and Inceptionv3 architectures and obtained an
accuracy of and respectively. Our model not only predicts distracted driving but promotes awareness
among drivers towards safe driving by giving them alerts.
2 Introduction
According to the World Health Organization (WHO) report, 1.35 million people worldwide die in traffic
accidents each year. That is nearly 3,700 people dying on the world’s roads every day. According
to National Crime Records Bureau and Ministry of Transport and Highway, 1214 road crashes occur
every day and 377 people die on an average in India. Also, the road accidents are leading cause of
death of people aged between 5 and 29 years old. The report also shows that one of the most common
cause of these increasing traffic accidents happening every year is distractions to the driver. The use
of a mobile phone while driving is widespread among young and novice drivers, adding further to
the already high risk of crash and death among these groups. Also, the drivers’ reaction times have
also been shown to be 50% slower with telephone use than without. According to NTHSA distracted
driving can be defined as ”any activity that diverts attention of the driver from the task of driving”
and can be classified into Manual, Visual and Cognitive distractions. Some examples of cognitive
distractions are daydreaming and lost in thoughts. Manual distractions include talking or texting
using mobile phones, eating, talking to passengers in the vehicle, drinking etc. and an example of
visual distraction is sleepiness. Situation in India is worse. While India has just 1% of the world’s
vehicles, India accounts for over 10% of global road crash fatalities. Within the report the SaveLIFE
Foundation highlights that 47% of people receive calls on their mobile phone while driving and 60% of
them don’t stop at a safe location before answering calls. Nowadays, an increasing number of modern
vehicles have Advanced Driver Assistance Systems (ADAS) such as stability control, traction control,
lane departure warning, adaptive cruise control and anti-lock brakes. But even the latest autonomous
vehicles today are not fully autonomous and require the driver to be ready to take control of the steering
wheel in an emergency and not to be distracted. Detection of Drivers’ attention is very important
3
system to integrate in ADAS technology. If the vehicles detect the driver’s divided attention and
warn the driver against it then the traffic accidents can be reduced to significant margin. The Focus
of this project is detecting driver’s distraction using Deep Learning Methods. Some convolutional
neural network approaches are applied. We have used different architectures, different activation and
loss functions to maximize the accuracy. We are addressing 10 types of distraction to drivers in this
project described in Dataset Description. The project is focused on real time detecting of distractions.
This project is limited to the software approach right now. The hardware integration of this project
is the next step.
3 Literature Review
We will review the relevant and significant works in the literature to detect distracted driving in this
section. In 2011, Zhang Proposed a method using the Hidden Conditional Random Fields model
based on face, mouth, and hand features to detect the use of the mobile phone [43] by creating a
dataset using a camera. Li explored the example of eye development of drivers in a procedure of crash
prevention from the back affected by driver interruption incited by mobile phones utilizing the driving
the test system of the Beijing Jiaotong University (BJTU), which was anticipated through a 300-degree
front/fringe field of view at a goals of 1400 1050 pixels, and the eye-following framework mounted
on the head, Eye following glasses SensoMotoric Instruments (SMI ETG), to gather eye development
information. Klaner considered the connection between the presentation of auxiliary errands, including
the utilization of mobile phones, and the danger of impacts and close to crashes, estimated through
various sensors, for example, accelerometers, cameras, worldwide situating frameworks, among others.
Atiquzzaman built up a calculation to identify driving conduct in two diverting assignments, sending
instant messages and eating or drinking. Utilizing a test system, subjects played out a progression
of assignments while driving the reproduction; at that point, 10 attributes identified with vehicle
increasing speed, guiding edge, speed, and so forth were utilized as information. At that point, three
information mining procedures were utilized: straight discriminant investigation (LDA), strategic
relapse (LR) and bolster vector machine (SVM). José Marı́a Celaya-Padilla presented a method to
detect distracted drivers using their cellphone is proposed. A convolutional Neural network is employed
with a camera fitted at the ceiling to get the images. The CNN is developed by the Inception V3 neural
network, being prepared to distinguish ”messaging and driving” subjects. The last CNN was prepared
and approved on a dataset of 85,401 pictures, accomplishing a zone under the bend (AUC) of 0.891 in
the preparation set, an AUC of 0.86 on a visually impaired test and an affectability estimation of 0.97 on
the visually impaired test. In 2017, Abouelnaga Made another dataset like the StateFarm’s dataset for
identifying driver attentiveness[2]. The creators proposed an answer utilizing a weighted get together
of five diverse Convolutional Neural Systems. The framework accomplished great characterization
exactness, yet, it’s unreasonably perplexing for ongoing identification. Baheti. tended to this issue of
unpredictability and decreased the number of parameters essentially and accomplished a precision of
95.54%.
4
4 Dataset Description
The dataset was acquired from the Statefarm Dataset available on kaggle. The total number of classes
are 10 and are described in the figure. The Dataset consists of 15000 train images with 1500 images
in each class. The validation dataset consists of 3000 images with 300 images in each class.
5
The dataset includes ten classes:
class 0: Driving safely.
class 1: Texting using right hand.
class 2: Talking on mobile phones with the right hand.
class 3: Texting using left hand.
class 4: Talking on mobile phones with the left hand.
class 5: Adjusting the radio.
class 6: Eating or drinking.
class 7: Turning back.
class 8: Hair or makeup.
class 9: Talking to the passenger.
The train-cross validation split is in the ratio 80:20.
5 Proposed Method
We propose a novel method for identification of distracted driving with the use of convolutional neural
Networks.
The method is shown in the figure above. The various implementational steps are given below:
1. The camera mounted on the dashboard gives us live images of the driver.
3. The trained model predicts the expected class (whether the driver is distracted or not) using
trained weights.
6
5.1 Preprocessing
1. The dataset originally consisted of images of size 480*640. The images are resized to a size of
224*224 because it is very computationally expensive to train a big image.
2. The images are grayscaled so that each individual pixel value ranges from 0 to 255. Images are
grayscaled to decrease computational complexity.
3. Images to Array: The images are then converted into numpy arrays to pass them onto the
model. We took 1500 images from each class into the train array and took 300 images from each
class into the test array. It is a supervised learning algorithm. In supervised learning we give the
network the inputs and the expected results. Model generates error by calculating the difference
between the measured value and the expected value. The change in weights is proportional to
the error produced.
4. Shuffling the array: We shuffle the array so that same kind of data doesn’t train the network
in a cluster so that the training process and weights updation is more uniform.
5. Vectorization: The train and test arrays are vectorized so that they can be passed onto the
network.
7. One hot encoding: The y test and y train vectors are then encoded. For example, If y train
is 4 then it is encoded as 0000100000.
5.2 Training:
The given training vector and the cross validation vector are then passed onto the network. We
have used three architectures to train the given model i.e. Resnet50, Inceptionv3 and the Xception
Net. Inception CNN was chosen due the very fact that such CNN will be exported for low value
hardware like Raspberry Pi and may be deployed in humanoid platforms; moreover, the CNN design
acted as multiple convolution filters that were then applied to constant input; the results were then
concatenated and passed forward. inception V3 relies on a pattern recognition network, and it’s
designed to use lowest amounts of image pre-processing. every of the planned CNN layer reinforces
key features; the primary layer detects edges, and also the second tackles the general style, among
others
7
6 Architecture
6.1 Convolutional Neural Network
A Convolutional Neural System (ConvNet/CNN) is a Profound Learning calculation which can take
in an info picture, dole out significance (learnable loads and inclinations) to different angles/protests
in the picture and have the option to separate one from the other. The pre-preparing required
in a ConvNet is a lot of lower when contrasted with other grouping calculations. While in crude
strategies channels are hand-designed, with enough preparing, ConvNets can become familiar with
these channels/attributes. The main idea behind Convolutional Neural Network is to learn the feature
mapping of an image and exploit it to make more nuanced feature mapping.
The design of a ConvNet is practically equivalent to that of the availability example of Neurons in the
Human Cerebrum and was roused by the association of the Visual Cortex. Singular neurons react to
stimuli just in a confined area of the visual field known as the Receptive Field. An assortment of such
fields overlaps to cover the whole visual territory. Why ConvNets over Feed-Forward Neural
Nets?
A picture is only a network of pixel values, isn’t that so? So why not simply flatten the picture (for
example 4x4 picture network into a 16x1 vector) and feed it to a Staggered Perceptron for grouping
purposes? In instances of very essential basic binary images, the technique may show a normal
precision score while performing prediction of classes however would have almost no precision with
regards to complex pictures having pixel conditions all through. A ConvNet can effectively catch the
Spatial and Temporal conditions in a picture through the use of significant channels. The engineering
plays out a superior fitting to the picture dataset because of the decrease in the quantity of parameters
included and reusability of loads. At the end of the day, the system can be prepared to comprehend
the complexity of the picture better
8
6.3 Residual Network(ResNet-50)
ResNet, short for Residual Networks is an exemplary neural system utilized as a spine for some PC
vision tasks. This model was the victor of ImageNet challenge in 2015. The major achievement
with ResNet was it enabled us to prepare very profound neural systems with 150+layers effectively.
Preceding ResNet preparing extremely profound neural systems was troublesome because of the issue
of vanishing gradients.
AlexNet, the winner of ImageNet 2012 and the model that obviously kick started the emphasis on
profound learning had just 8 convolutional layers, the VGG organize had 19 and Origin or Google Net
had 22 layers and ResNet 152 had 152 layers.
Nonetheless, expanding system depth doesn’t work by just stacking layers together. Deep networks
are difficult to train in light of the famous gradient issue — as the gradient are back-propagated to
prior layers; rehashed duplication may make the gradient incredibly small. Accordingly, as the system
goes further, its performance gets saturated or even starts degrading rapidly. The ResNet-50 model
comprises of 5 phases each with a convolution and identity block. Every convolution block has 3
convolution layers and every identity block additionally has 3 convolution layers. The ResNet-50 has
more than 23 million trainable parameters.
Skip Connection ResNet first presented the idea of skip association. The chart underneath outlines
skip association. The figure on the left is stacking convolution layers together in a steady progression.
On the right we stack convolution layers as in the past however we now also add the original input to
the output of the convolution block. This is called skip connection.
7 Activation Function
While building a neural network, it’s important to make the choice what activation function to use in
the hidden layer as well as at the output layer of the network. So, What is an activation function and
why to use them? Activation function basically decides, whether a neuron should be activated or not
.Whether the information that the neuron is receiving is relevant for the given information or should
it be ignored. It is done by calculating weighted sum and further adding bias with it.
Y = Activation[Σ(W eights ∗ Input) + Bias]
The purpose of the activation function is to introduce non-linearity into the output of a neuron.
Why do we need Non-linear activation functions? A neural network without an activation function is
9
essentially just a linear regression model. The activation function does the non-linear transformation
to the input making it capable to learn and perform more complex tasks. In this project we have used
Rectified Linear Unit activation function.
8 Loss function
Loss function is a method of evaluating how well your algorithm models your dataset. If your pre-
dictions are totally off, your loss function will output a higher number. If they’re pretty good, it’ll
output a lower number. As you change pieces of your algorithm to try and improve your model, your
loss function will tell you if you’re getting anywhere.
9 Regularization
Any modication we make to a learning algorithm that’s intended to reduce its generalization error but
not its training error is called regularization. Keeping the model simple enough by using regularization
techniques allows the network to generalize well on data points it hasn’t seen before.
10
1. Weight Penalty L1 and L2 : Weight penalty is standard way for regularization, widely used
in training other model types. It relies strongly on the implicit assumption that a model with small
weights is somehow simpler than a network with large weights. The penalties try to keep the weights
small or non-existent (zero) unless there are big gradients to counteract it, which makes models also
more interpretable. L2 norm penalizes the square value of the weight (which explains also the “2”
from the name) and tends to drive all the weights to smaller values. L1 norm penalizes the absolute
value of the weight (v- shape function) and tends to drive some weights to exactly zero (introducing
sparsity in the model), while allowing some weights to be big.
2. Dropout regularization: Dropout involves going over all the layers in a neural network and
setting probability of keeping a certain nodes or not. The probability of keeping each node is set at
random. You only decide of the threshold: a value that will determine if the node is kept or not. For
example, if you set the threshold to 0.7, then there is a probability of 30% that a node will be removed
from the network. Therefore, this will result in a much smaller and simpler neural network, as shown
below.
10 What’s Next?
This project focuses only on the software part of the distracted driver system only. This system has
the potential to be implemented in real cars, which can significantly reduce the traffic accidents caused
by driving distractions. So, the next step is to integrate the software part with the electronic hardware
model that is compatible with the ADAS system of automated cars. The integrated system will be
connected to car’s main microprocessor. Whenever the driver’s distraction will be detected through
the uploaded trained model, it will send signal to the microprocessor and the processor will send a
warning signal to the driver. Currently, the similarities in postures of different behaviours result in
incorrect classifications. The behaviours that have more misclassifications are ‘drinking’, ‘hair and
makeup’ and ‘texting on the phone – left’. In the future, we need to solve this misclassification issue.
11 Results
11
Modifications in a model should also consider the dataset. Usually in Histopathology we get the data
set which is limited and meagre. So Data Augmentation plays a big role in this kind of situation and
we have new data generated from old data itself. Also in Histopathology, an important problem is
color consistency. Different variations of different level of illuminations must be compensated. These
problems are reduced by Color Normalization method.
12 Conclusion
In this paper a system for detection of distracted driver based on convolutional neural networks
technique was designed. The weight initialization method was random initialization. The training is
done by 2 kind of network. First network was Inception Net and second one was Residual Network
(ResNet50). The training accuracy attained by these two networks are shown in the table. In an Image
classification task, the size of salient feature can considerably vary within the image frame. Hence,
deciding on a fixed kernel size is rather difficult. Lager kernels are preferred for more global features
that are distributed over large area of the image, on the other hand smaller kernels provide good results
in detecting area specific features that are distributed across the image frame. For effective recognition
of such variable sized feature, we need kernels of different sizes. That is what Inception does. Neural
Networks are not good in being able to find a simpler mapping when it exists. When sometimes
input is just equal to output, the weights should be 1 and biases should be 0. But backpropogation
in neural networks makes this mapping complex instead of simply putting weights equal to 1. This
happens because of the vanishing gradient problem. As we go deeper in CNN, the derivative when
back-propagating to the initial layers becomes almost insignificant in value. ResNet addresses this
issue by introducing “shortcut-connections”.
13 References
1. Ronneberger O., Fischer P., Brox T. (2015) U-Net: Convolutional Networks for Biomedical
Image Segmentation. In: Navab N., Hornegger J., Wells W., Frangi A. (eds) Medical Image
Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes
in Computer Science, vol 9351. Springer, Cham
2. Zhengyang Wang, Shuiwang Ji (2018) Smoothed Dilated Convolutions for Improved Dense Pre-
diction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge
Discovery Data Mining (pp. 2486-2495). 2018
12