Industrial Training Report
Industrial Training Report
Industrial Training Report
Bachelor of Technology
in
Department of Electronics & Communication Engineering
i
CERTIFICATE
CERTIFICATE
ii
Acknowledgement
iii
There are always some key personalities whose roles are vital for the successful completion of
any work. However, it would not have been possible to complete this work without the kind
support and help of many individuals and organization. I would like to extend my sincere thanks
to all of them.
I am highly indebted to my mentors Mrs. Pooja Choudhary, Assistant Professor & Ms.
Gloria Joseph, Assistant Professor for their guidance and constant supervision as well as for
providing necessary information regarding the industrial training seminar & also for their support
in completing the Industrial Training. I would like to thank Mr. Lalit Kumar Lata, Faculty,
Department of Electronics and Communication, SKIT M & G, Jaipur for their kind support and
guidance to complete my Industrial Training Successfully. They helped us throughout the
training. Their excellent guidance has been instrumental in making this training a success.
I would like to thank Prof. (Dr.) Mukesh Arora, Professor & Head, Department of Electronics
and communication, SKIT M & G, Jaipur for providing me the opportunity to do training in
consistent direction and the adequate means and support to pursue this training.
Finally, earnest and sincere thanks to all the Faculty members of Electronics and Communication
Department, SKIT M & G, Jaipur for their direct and indirect support in the completion of this
industrial training.
Last but not least, we sincerely express our deepest gratitude to our families for their
wholehearted support and encouragement to us to take up this course. In addition, a very special
thanks to our colleagues and friends for their support.
Saransh Sharma
17ESKEC064
iv
TABLE OF CONTENTS
Certificate ii
Acknowledgement iv
Table of Content v
List of Figures vii
PART A/ Course I
Chapter 1:Introduction………………….………………………………………………………2
1.1 Introduction……………………………………………………………………………………2
1.2 Why Python?………………………………………………………………………………….2
1.3 Characteristics of Python……………………………………………………………………..2
1.4 Local Environment Setup……………………………………………………………………..3
Chapter 2: Strings………………………………………………………………………………..4
2.1 Strings is a Sequence…….……………………………………………………………………4
2.2 Getting the length of a string………………………………………………………………….4
2.3 Traversal through a string with a loop…………………………………………………………
v
LIST OF FIGURES
vi
Part A
1
Chapter 1
INTRODUCTION TO PYTHON
1.1 Introduction
2
Python is available on a wide variety of platforms including Linux and Mac OS X. Let's
understand how to set up our Python environment on windows.
Open a terminal window and type "python" to find out if it is already installed and which version
is installed.
3
At the command prompt − type path %path%;C:\Python and press Enter.
Note − C:\Python is the path of the Python directory.
Chapter 2
4
1. STRINGS
The second statement extracts the character at index position 1 from the fruit variable and
assigns it to the letter variable. The expression in brackets is called an index. The index indicates
which character in the sequence you want (hence the name).
2.1.1 String Indexes
You can use any expression, including variables and operators, as an index, but the value of the
index has to be an integer. Otherwise you get:
To get the last letter of a string, you might be tempted to try something like this:
The reason for the IndexError is that there is no letter in “banana” with the index 6. Since we
started counting at zero, the six letters are numbered 0 to 5.
5
>>> print(last)
a
Alternatively, you can use negative indices, which count backward from the end of the string.
The expression fruit[-1] yields the last letter, fruit[-2] yields the second to last, and so on.
index = 0
while index < len(fruit):
letter = fruit[index]
print(letter)
index = index + 1
This loop traverses the string and displays each letter on a line by itself. The loop condition
is index < len(fruit), so when index is equal to the length of the string, the condition is false, and
the body of the loop is not executed. The last character accessed is the one with the
index len(fruit)-1, which is the last character in the string. Another way to write a traversal is
with a for loop:
Each time through the loop, the next character in the string is assigned to the variable char. The
loop continues until no characters are left.
6
The operator returns the part of the string from the “n-th” character to the “m-th” character,
including the first but excluding the last.
If you omit the first index (before the colon), the slice starts at the beginning of the string. If you
omit the second index, the slice goes to the end of the string:
The reason for the error is that strings are immutable, which means you can’t change an existing
string.
The best you can do is create a new string that is a variation on the original:
This example concatenates a new first letter onto a slice of greeting. It has no effect on the
original string.
word = 'banana'
count = 0
7
for letter in word:
if letter == 'a':
count = count + 1
print(count)
2.7 The in operator
The word in is a boolean operator that takes two strings and returns True if the first appears as a
substring in the second:
if word == 'banana':
print('All right, bananas.')
Other comparison operations are useful for putting words in alphabetical order:
Python has a set of built-in methods that you can use on strings.
8
Note: All string methods returns new values. They do not change the original string.
Here are some of the most common string methods. A method is like a function, but it runs "on"
an object. If the variable s is a string, then the code s.lower() runs the lower() method on that
string object and returns the result (this idea of a method running on an object is one of the basic
ideas that make up Object Oriented Programming, OOP). Here are some of the most common
string methods:
9
Chapter 3
LISTS
The first example is a list of four integers. The second is a list of three strings. The elements of a
list don’t have to be the same type. The following list contains a string, a float, an integer, and
(lo!) another list:
A list within another list is nested. A list that contains no elements is called an empty list; you
can create one with empty brackets, []. As you might expect, you can assign list values to
variables:
10
[17, 5]
If an index has a negative value, it counts backward from the end of the list.
If you omit the first index, the slice starts at the beginning. If you omit the second, the slice goes
to the end. So if you omit both, the slice is a copy of the whole list.
>>> t[:]
['a', 'b', 'c', 'd', 'e', 'f']
11
Since lists are mutable, it is often useful to make a copy before performing operations that fold,
spindle, or mutilate lists.
A slice operator on the left side of an assignment can update multiple elements:
Most list methods are void; they modify the list and return None. If you accidentally write t =
t.sort(), you will be disappointed with the result.
12
3.5 Deleting elements
There are several ways to delete elements from a list. If you know the index of the element you
want, you can use pop:
pop modifies the list and returns the element that was removed. If you don’t provide an index, it
deletes and returns the last element.
If you don’t need the removed value, you can use the del operator:
If you know the element you want to remove (but not the index), you can use remove:
13
3.6 Lists and functions
There are a number of built-in functions that can be used on lists that allow you to quickly look
through a list without writing your own loops:
The sum() function only works when the list elements are numbers. The other functions
(max(), len(), etc.) work with lists of strings and other types that can be comparable.
We could rewrite an earlier program that computed the average of a list of numbers entered by
the user using a list.
First, the program to compute an average without a list:
total = 0
count = 0
while (True):
inp = input('Enter a number: ')
if inp == 'done': break
value = float(inp)
total = total + value
count = count + 1
average = total / count
print('Average:', average)
14
In this program, we have count and total variables to keep the number and running total of the
user’s numbers as we repeatedly prompt the user for a number.
We could simply remember each number as the user entered it and use built-in functions to
compute the sum and count at the end.
numlist = list()
while (True):
inp = input('Enter a number: ')
if inp == 'done': break
value = float(inp)
numlist.append(value)
We make an empty list before the loop starts, and then each time we have a number, we append
it to the list. At the end of the program, we simply compute the sum of the numbers in the list
and divide it by the count of the numbers in the list to come up with the average.
>>> s = 'spam'
>>> t = list(s)
>>> print(t)
['s', 'p', 'a', 'm']
Because list is the name of a built-in function, you should avoid using it as a variable name. I
also avoid the letter “l” because it looks too much like the number “1”. So that’s why I use “t”.
The list function breaks a string into individual letters. If you want to break a string into words,
you can use the split method:
15
['pining', 'for', 'the', 'fjords']
>>> print(t[2])
the
Once you have used split to break the string into a list of words, you can use the index operator
(square bracket) to look at a particular word in the list.
You can call split with an optional argument called a delimiter that specifies which characters to
use as word boundaries. The following example uses a hyphen as a delimiter:
>>> s = 'spam-spam-spam'
>>> delimiter = '-'
>>> s.split(delimiter)
['spam', 'spam', 'spam']
join is the inverse of split. It takes a list of strings and concatenates the elements. join is a string
method, so you have to invoke it on the delimiter and pass the list as a parameter:
In this case the delimiter is a space character, so join puts a space between words. To concatenate
strings without spaces, you can use the empty string, "", as a delimiter.
3.9 List Comprehensions
List comprehensions provide a concise way to create lists. Common applications are to make
new lists where each element is the result of some operations applied to each member of another
sequence or iterable, or to create a subsequence of those elements that satisfy a certain condition.
A list comprehension consists of brackets containing an expression followed by a for clause, then
zero or more for or if clauses. The result will be a new list resulting from evaluating the
expression in the context of the for and if clauses which follow it. For example, this listcomp
combines the elements of two lists if they are not equal:
16
Chapter 4
DICTIONARIES
A dictionary is like a list, but more general. In a list, the index positions have to be integers; in a
dictionary, the indices can be (almost) any type.
You can think of a dictionary as a mapping between a set of indices (which are called keys) and a
set of values. Each key maps to a value. The association of a key and a value is called a key-
value pair or sometimes an item.
As an example, we’ll build a dictionary that maps from English to Spanish words, so the keys
and the values are all strings.
The function dict creates a new dictionary with no items. Because dict is the name of a built-in
function, you should avoid using it as a variable name.
The curly brackets, {}, represent an empty dictionary. To add items to the dictionary, you can
use square brackets:
This line creates an item that maps from the key 'one' to the value “uno”. If we print the
dictionary again, we see a key-value pair with a colon between the key and value:
>>> print(eng2sp)
{'one': 'uno'}
This output format is also an input format. For example, you can create a new dictionary with
three items. But if you print eng2sp, you might be surprised:
17
The order of the key-value pairs is not the same. In fact, if you type the same example on your
computer, you might get a different result. In general, the order of items in a dictionary is
unpredictable.
But that’s not a problem because the elements of a dictionary are never indexed with integer
indices. Instead, you use the keys to look up the corresponding values:
>>> print(eng2sp['two'])
'dos'
The key 'two' always maps to the value “dos” so the order of the items doesn’t matter.
If the key isn’t in the dictionary, you get an exception:
>>> print(eng2sp['four'])
KeyError: 'four'
>>> len(eng2sp)
3
The in operator works on dictionaries; it tells you whether something appears as a key in the
dictionary (appearing as a value is not good enough).
To see whether something appears as a value in a dictionary, you can use the method values,
which returns the values as a type that can be converted to a list, and then use the in operator:
The in operator uses different algorithms for lists and dictionaries. For lists, it uses a linear
search algorithm. As the list gets longer, the search time gets longer in direct proportion to the
length of the list. For dictionaries, Python uses an algorithm called a hash table that has a
remarkable property: the in operator takes about the same amount of time no matter how many
items there are in a dictionary.
18
4.2 Dictionary as a set of counters
Suppose you are given a string and you want to count how many times each letter appears. There
are several ways you could do it:
1. You could create 26 variables, one for each letter of the alphabet. Then you could
traverse the string and, for each character, increment the corresponding counter, probably
using a chained conditional.
2. You could create a list with 26 elements. Then you could convert each character to a
number (using the built-in function ord), use the number as an index into the list, and
increment the appropriate counter.
3. You could create a dictionary with characters as keys and counters as the corresponding
values. The first time you see a character, you would add an item to the dictionary. After
that you would increment the value of an existing item.
Each of these options performs the same computation, but each of them implements that
computation in a different way.
An implementation is a way of performing a computation; some implementations are better than
others. For example, an advantage of the dictionary implementation is that we don’t have to
know ahead of time which letters appear in the string and we only have to make room for the
letters that do appear.
Here is what the code might look like:
word = 'brontosaurus'
d = dict()
for c in word:
if c not in d:
d[c] = 1
else:
d[c] = d[c] + 1
print(d)
We are effectively computing a histogram, which is a statistical term for a set of counters (or
frequencies).
The for loop traverses the string. Each time through the loop, if the character c is not in the
dictionary, we create a new item with key c and the initial value 1 (since we have seen this letter
once). If c is already in the dictionary we increment d[c].
Here’s the output of the program:
19
The histogram indicates that the letters “a” and “b” appear once; “o” appears twice, and so on.
Dictionaries have a method called get that takes a key and a default value. If the key appears in
the dictionary, get returns the corresponding value; otherwise it returns the default value. For
example:
We can use get to write our histogram loop more concisely. Because the get method
automatically handles the case where a key is not in a dictionary, we can reduce four lines down
to one and eliminate the if statement.
word = 'brontosaurus'
d = dict()
for c in word:
d[c] = d.get(c,0) + 1
print(d)
The use of the get method to simplify this counting loop ends up being a very commonly used
“idiom” in Python and we will use it many times in the rest of the book. So you should take a
moment and compare the loop using the if statement and in operator with the loop using
the get method. They do exactly the same thing, but one is more succinct.
20
We will write a Python program to read through the lines of the file, break each line into a list of
words, and then loop through each of the words in the line and count each word using a
dictionary.
You will see that we have two for loops. The outer loop is reading the lines of the file and the
inner loop is iterating through each of the words on that particular line. This is an example of a
pattern called nested loops because one of the loops is the outer loop and the other loop is
the inner loop.
Because the inner loop executes all of its iterations each time the outer loop makes a single
iteration, we think of the inner loop as iterating “more quickly” and the outer loop as iterating
more slowly.
The combination of the two nested loops ensures that we will count every word on every line of
the input file.
python count1.py
21
Enter the file name: romeo.txt
{'and': 3, 'envious': 1, 'already': 1, 'fair': 1,
'is': 3, 'through': 1, 'pale': 1, 'yonder': 1,
'what': 1, 'sun': 2, 'Who': 1, 'But': 1, 'moon': 1,
'window': 1, 'sick': 1, 'east': 1, 'breaks': 1,
'grief': 1, 'with': 1, 'light': 1, 'It': 1, 'Arise': 1,
'kill': 1, 'the': 3, 'soft': 1, 'Juliet': 1}
It is a bit inconvenient to look through the dictionary to find the most common words and their
counts, so we need to add some more Python code to get us the output that will be more helpful.
jan 100
chuck 1
annie 42
The for loop iterates through the keys of the dictionary, so we must use the index operator to
retrieve the corresponding value for each key. Here’s what the output looks like:
22
jan 100
annie 42
First you see the list of keys in unsorted order that we get from the keys method. Then we see the
key-value pairs in order from the for loop.
23
Chapter 5
TUPLES AND SETS
5.1 Tuples
To create a tuple with a single element, you have to include the final comma:
>>> t1 = ('a',)
>>> type(t1)
<type 'tuple'>
Without the comma Python treats ('a') as an expression with a string in parentheses that evaluates
to a string:
>>> t2 = ('a')
>>> type(t2)
<type 'str'>
Another way to construct a tuple is the built-in function tuple. With no argument, it creates an
empty tuple:
>>> t = tuple()
>>> print(t)
()
If the argument is a sequence (string, list, or tuple), the result of the call to tuple is a tuple with
the elements of the sequence:
24
>>> t = tuple('lupins')
>>> print(t)
('l', 'u', 'p', 'i', 'n', 's')
Because tuple is the name of a constructor, you should avoid using it as a variable name.
Most list operators also work on tuples. The bracket operator indexes an element:
>>> print(t[1:3])
('b', 'c')
But if you try to modify one of the elements of the tuple, you get an error:
You can’t modify the elements of a tuple, but you can replace one tuple with another:
25
Decorate: a sequence by building a list of tuples with one or more sort keys preceding the
elements from the sequence,
Sort: the list of tuples using the Python built-in sort, and
Undecorate: by extracting the sorted elements of the sequence.
For example, suppose you have a list of words and you want to sort them from longest to
shortest:
The first loop builds a list of tuples, where each tuple is a word preceded by its length.
sort compares the first element, length, first, and only considers the second element to break ties.
The keyword argument reverse=True tells sort to go in decreasing order.
The second loop traverses the list of tuples and builds a list of words in descending order of
length. The four-character words are sorted in reverse alphabetical order, so “what” appears
before “soft” in the following list. The output of the program is as follows:
As you should expect from a dictionary, the items are in no particular order.
26
However, since the list of tuples is a list, and tuples are comparable, we can now sort the list of
tuples. Converting a dictionary to a list of tuples is a way for us to output the contents of a
dictionary sorted by key:
The new list is sorted in ascending alphabetical order by the key value.
This loop has two iteration variables because items returns a list of tuples and key, val is a tuple
assignment that successively iterates through each of the key-value pairs in the dictionary.
For each iteration through the loop, both key and value are advanced to the next key-value pair in
the dictionary (still in hash order).
The output of this loop is:
10 a
22 c
1b
27
>>> d = {'a':10, 'b':1, 'c':22}
>>> l = list()
>>> for key, val in d.items() :
... l.append( (val, key) )
>>> l
[(10, 'a'), (22, 'c'), (1, 'b')]
>>> l.sort(reverse=True)
>>> l
[(22, 'c'), (10, 'a'), (1, 'b')]
>>>
directory[last,first] = number
The expression in brackets is a tuple. We could use tuple assignment in a for loop to traverse this
dictionary.
This loop traverses the keys in directory, which are tuples. It assigns the elements of each tuple
to last and first, then prints the name and corresponding telephone number.
5.2 Sets
In Python, Set is an unordered collection of data type that is iterable, mutable and has no duplicate
elements. The order of elements in a set is undefined though it may consist of various elements.
The major advantage of using a set, as opposed to a list, is that it has a highly optimized method for
checking whether a specific element is contained in the set.
28
5.2.1 Creating a Set
Sets can be created by using the built-in set() function with an iterable object or a sequence by
placing the sequence inside curly braces, separated by ‘comma’.
Note – A set cannot have mutable elements like a list, set or dictionary, as its elements.
29
Chapter 6
6. Hands on Projects
As we are done with the python data structures so in this section of we will talk about 2
interesting hands on projects at an introductory and intermediate level respectively.
6.1.1 Working
Here are the 5 steps to create a chatbot in Python from scratch:
The data file is in JSON format so we used the json package to parse the JSON file into Python.
30
2. Preprocess data
When working with text data, we need to perform various preprocessing on the data
before we make a machine learning or a deep learning model. Tokenizing is the most basic and
first thing you can do on text data. Tokenizing is the process of breaking the whole text into
small parts like words.
Here we iterate through the patterns and tokenize the sentence using nltk.word_tokenize()
function and append each word in the words list. We also create a list of classes for our tags.
31
Now we will lemmatize each word and remove duplicate words from the list. Lemmatizing is the
process of converting a word into its lemma form and then creating a pickle file to store the
Python objects which we will use while predicting.
32
4.Build the model-
We have our training data ready, now we will build a deep neural network that has 3
layers. We use the Keras sequential API for this. After training the model for 200 epochs, we
achieved 100% accuracy on our model. Let us save the model as ‘chatbot_model.h5’.
33
We will load the trained model and then use a graphical user interface that will predict the
response from the bot. The model will only tell us the class it belongs to, so we will implement
some functions which will identify the class and then retrieve us a random response from the list
of responses.Again we import the necessary packages and load the ‘words.pkl’ and ‘classes.pkl’
pickle files which we have created when we trained our model:
To predict the class, we will need to provide input in the same way as we did while training. So
we will create some functions that will perform text preprocessing and then predict the class.
34
Now we will code a graphical user interface. For this, we use the Tkinter library which already
comes in python. We will take the input message from the user and then use the helper functions
we have created to get the response from the bot and display it on the GUI. Here is the full
source code for the GUI.
35
6. Run the chatbot
To run the chatbot, we have two main files; train_chatbot.py and chatapp.py.
python train_chatbot.py
If we don’t see any error during training, we have successfully created the model. Then to run
the app, we run the second file.
python chatgui.py
The program will open up a GUI window within a few seconds. With the GUI you can easily
chat with the bot.
5.3.2 Summary
36
In this Python data science project, we understood about chatbots and implemented a deep
learning version of a chatbot in Python which is accurate. You can customize the data according
to business requirements and train the chatbot with great accuracy. Chatbots are used everywhere
and all businesses is looking forward to implementing bot in their workflow.
Part B
(Python for Data Science)
37
Chapter 1
Introduction
1.1 Data Science-
Data science is an inter-disciplinary field that uses scientific methods, processes,
algorithms and systems to extract knowledge and insights from many structural and unstructured
data. Data science is related to data mining, machine learning and big data.
Data science is a "concept to unify statistics, data analysis and their related methods" in
order to "understand and analyze actual phenomena" with data. It uses techniques and theories
drawn from many fields within the context of mathematics, statistics, computer science, domain
knowledge and information science. Turing award winner Jim Gray imagined data science as a
"fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and
asserted that "everything about science is changing because of the impact of information
technology" and the data deluge.
1.2 Foundation-
Data science is an interdisciplinary field focused on extracting knowledge from data sets,
which are typically large (see big data). The field encompasses analysis, preparing data for
analysis, and presenting findings to inform high-level decisions in an organization. As such, it
incorporates skills from computer science, mathematics, statistics, information visualization,
graphic design, complex systems, communication and business. Statistician Nathan Yau,
drawing on Ben Fry, also links data science to human-computer interaction: users should be able
to intuitively control and explore data. In 2015, the American Statistical Association identified
database management, statistics and machine learning, and distributed and parallel systems as the
three emerging foundational professional communities.
38
Many statisticians, including Nate Silver, have argued that data science is not a
new field, but rather another name for statistics. Others argue that data science is distinct
from statistics because it focuses on problems and techniques unique to digital data.
Vasant Dhar writes that statistics emphasizes quantitative data and description. In
contrast, data science deals with quantitative and qualitative data (e.g. images) and
emphasizes prediction and action. Andrew Gelman of Columbia University and data
scientist Vincent Granville have described statistics as a nonessential part of data science.
Stanford professor David Donoho writes that data science is not distinguished from
statistics by the size of datasets or use of computing, and that many graduate programs
misleadingly advertise their analytics and statistics training as the essence of a data
science program. He describes data science as an applied field growing out of traditional
statistics. In summary, data science can be therefore described as an applied branch of
statistics.
1.3 Impact of Data Science -
Big data is very quickly becoming a vital tool for businesses and companies of all sizes.
[29] The availability and interpretation of big data has altered the business models of old
industries and enabled the creation of new ones.[29] Data-driven businesses are worth $1.2
trillion collectively in 2020, an increase from $333 billion in the year 2015.[30] Data scientists
are responsible for breaking down big data into usable information and creating software and
algorithms that help companies and organizations determine optimal operations.[30] As big data
continues to have a major impact on the world, data science does as well due to the close
relationship between the two.
39
Chapter 2
Technologies and Methods
40
2.1.3 Frameworks
TensorFlow is a framework for creating machine learning models developed by
Google.
Pytorch is another framework for machine learning developed by Facebook.
Jupyter Notebook is an interactive web interface for Python that allows faster
experimentation.
Apache Hadoop is a software framework that is used to process data over large
distributed systems.
41
Chapter 3
Linear Regression
3.1 Definition
In statistics, linear regression is a linear approach to modelling the relationship between a
scalar response and one or more explanatory variables (also known as dependent and
independent variables). The case of one explanatory variable is called simple linear regression;
for more than one, the process is called multiple linear regression. This term is distinct from
multivariate linear regression, where multiple correlated dependent variables are predicted, rather
than a single scalar variable.
3.1.1 Use
Linear regression has many practical uses. Most applications fall into one of the
following two broad categories:
If the goal is prediction, forecasting, or error reduction,[clarification needed]
linear regression can be used to fit a predictive model to an observed data set of
values of the response and explanatory variables. After developing such a
model, if additional values of the explanatory variables are collected without an
accompanying response value, the fitted model can be used to make a
prediction of the response.
If the goal is to explain variation in the response variable that can be attributed
to variation in the explanatory variables, linear regression analysis can be
applied to quantify the strength of the relationship between the response and the
explanatory variables, and in particular to determine whether some explanatory
variables may have no linear relationship with the response at all, or to identify
which subsets of explanatory variables may contain redundant information
about the response.
42
3.2 Mathematical Approach
Simple and multiple linear regression
Fig. 3.1
Multiple linear regression is a generalization of simple linear regression to the case of more
than one independent variable, and a special case of general linear models, restricted to one
dependent variable. The basic model for multiple linear regression is
43
In the more general multivariate linear regression, there is one equation of the above form
for each of m > 1 dependent variables that share the same set of explanatory variables and
hence are estimated simultaneously with each other:
for all observations indexed as i = 1, ... , n and for all dependent variables indexed as j =
1, ... , m.
Nearly all real-world regression models involve multiple predictors, and basic descriptions
of linear regression are often phrased in terms of the multiple regression model. Note,
however, that in these cases the response variable y is still a scalar. Another
term, multivariate linear regression, refers to cases where y is a vector, i.e., the same
as general linear regression.
Bayesian linear regression applies the framework of Bayesian statistics to linear regression.
(See also Bayesian multivariate linear regression.) In particular, the regression coefficients β
are assumed to be random variables with a specified prior distribution. The prior distribution
can bias the solutions for the regression coefficients, in a way similar to (but more general
than) ridge regression or lasso regression. In addition, the Bayesian estimation process
produces not a single point estimate for the "best" values of the regression coefficients but an
entire posterior distribution, completely describing the uncertainty surrounding the quantity.
This can be used to estimate the "best" coefficients using the mean, mode, median, any
quantile (see quantile regression), or any other function of the posterior distribution.
Quantile regression focuses on the conditional quantiles of y given X rather than the
conditional mean of y given X. Linear quantile regression models a particular conditional
quantile, for example the conditional median, as a linear function βTx of the predictors.
Mixed models are widely used to analyze linear regression relationships involving dependent
data when the dependencies have a known structure. Common applications of mixed models
include analysis of data involving repeated measurements, such as longitudinal data, or data
obtained from cluster sampling. They are generally fit as parametric models, using maximum
44
likelihood or Bayesian estimation. In the case where the errors are modeled as normal
random variables, there is a close connection between mixed models and generalized least
squares.[18] Fixed effects estimation is an alternative approach to analyzing this type of data.
Principal component regression (PCR)[7][8] is used when the number of predictor variables
is large, or when strong correlations exist among the predictor variables. This two-stage
procedure first reduces the predictor variables using principal component analysis then uses
the reduced variables in an OLS regression fit. While it often works well in practice, there is
no general theoretical reason that the most informative linear function of the predictor
variables should lie among the dominant principal components of the multivariate
distribution of the predictor variables. The partial least squares regression is the extension of
the PCR method which does not suffer from the mentioned deficiency.
Least-angle regression[6] is an estimation procedure for linear regression models that was
developed to handle high-dimensional covariate vectors, potentially with more covariates
than observations.
The Theil–Sen estimator is a simple robust estimation technique that chooses the slope of the
fit line to be the median of the slopes of the lines through pairs of sample points. It has
similar statistical efficiency properties to simple linear regression but is much less sensitive to
outliers.[19]
Other robust estimation techniques, including the α-trimmed mean approach[citation needed],
and L-, M-, S-, and R-estimators have been introduced.
45
Chapter 4
4.1 Application
4.1.1 Finance
The capital asset pricing model uses linear regression as well as the concept of
beta for analyzing and quantifying the systematic risk of an investment. This comes directly
from the beta coefficient of the linear regression model that relates the return on the investment
to the return on all risky assets.
4.1.2 Epidemiology
Early evidence relating tobacco smoking to mortality and morbidity came from
observational studies employing regression analysis. In order to reduce spurious
correlations when analyzing observational data, researchers usually include several
variables in their regression models in addition to the variable of primary interest. For
example, in a regression model in which cigarette smoking is the independent variable of
primary interest and the dependent variable is lifespan measured in years, researchers
might include education and income as additional independent variables, to ensure that
any observed effect of smoking on lifespan is not due to those other socio-economic
factors. However, it is never possible to include all possible confounding variables in an
empirical analysis. For example, a hypothetical gene might increase mortality and also
cause people to smoke more. For this reason, randomized controlled trials are often able
to generate more compelling evidence of causal relationships than can be obtained using
46
regression analyses of observational data. When controlled experiments are not feasible,
variants of regression analysis such as instrumental variables regression may be used to
attempt to estimate causal relationships from observational data.
Linear regression plays an important role in the field of artificial intelligence such
as machine learning. The linear regression algorithm is one of the fundamental
supervised machine-learning algorithms due to its relative simplicity and well-known
properties.
4.1.4 Economics
In the health-care industry, data science is making great leaps. The various
industries in health-care making use of data science are:
In the medical image analysis, data science has created a strong sphere of
influence for analyzing medical images such as X-rays, MRIs, CT-Scans, etc.
Previously, doctors and medical examiners would have to manually search for
clues in the medical images. However, with the advancements in computing
technologies and surge in data, it is possible to create machines that can
automatically detect flaws in the imagery. Data Scientists have created powerful
image recognition tools that allow doctors to have an in-depth understanding of
complex medical imagery.
47
ii. Genomic Data Science
Another important field making use of data science is drug discovery. In drug
discovery, new candidate medicines are formulated. Drug Discovery is a tedious and
often complex process. Data Science can help us to simplify this process and provide us
with an early insight into the success rate of the newly discovered drug. With Machine
Learning, we can also analyze several combinations of drugs and their effect on different
gene structure to predict the outcome.
With the advancements in predictive modeling, data scientists can help to predict
the outcome of disease given the historical data of the patients. Data Science has enabled
practitioners to analyze the data, make correlations between the variables of the data and
also provide insights to doctors and medical practitioners.
48
Chapter 5
5.1 Hands on project(Stock Price Prediction)
Predicting how the stock market will perform is one of the most difficult things to do.
There are so many factors involved in the prediction – physical factors vs. Psychological,
rational and irrational behaviour, etc. All these aspects combine to make share prices volatile and
very difficult to predict with a high degree of accuracy. We will implement a mix of machine
learning algorithms to predict the future stock price of this company, starting with simple
algorithms like averaging and linear regression, and then move on to advanced techniques like
LSTM.
5.1.1 Fundamental Analysis and Technical Analysis
Fundamental Analysis involves analyzing the company’s future profitability on the
basis of its current business environment and financial performance.
Technical Analysis, on the other hand, includes reading the charts and using
statistical figures to identify the trends in the stock market.
We will first load the dataset and define the target variable for the problem:
49
The profit or loss calculation is usually determined by the closing price of a stock for the
day, hence we will consider the closing price as the target variable. Let’s plot the target variable
to understand how it’s shaping up in our data:
df['Date'] = pd.to_datetime(df.Date,format='%Y-%m-%d')
df.index = df['Date']
#plot
plt.figure(figsize=(16,8))
The input gate: The input gate adds information to the cell state
The forget gate: It removes the information that is no longer required by the model
The output gate: Output Gate at LSTM selects the information to be shown as output
Implementation
#creating dataframe
for i in range(0,len(data)):
50
new_data['Date'][i] = data['Date'][i]
new_data['Close'][i] = data['Close'][i]
#setting index
new_data.index = new_data.Date
dataset = new_data.values
train = dataset[0:987,:]
valid = dataset[987:,:]
scaled_data = scaler.fit_transform(dataset)
for i in range(60,len(train)):
x_train.append(scaled_data[i-60:i,0])
y_train.append(scaled_data[i,0])
model = Sequential()
model.add(LSTM(units=50))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
51
#predicting 246 values, using past 60 from the train data
inputs = inputs.reshape(-1,1)
inputs = scaler.transform(inputs)
X_test = []
for i in range(60,inputs.shape[0]):
X_test.append(inputs[i-60:i,0])
X_test = np.array(X_test)
closing_price = model.predict(X_test)
closing_price = scaler.inverse_transform(closing_price)
Results
rms=np.sqrt(np.mean(np.power((valid-closing_price),2)))
rms
11.772259608962642
#for plotting
train = new_data[:987]
valid = new_data[987:]
valid['Predictions'] = closing_price
plt.plot(train['Close'])
plt.plot(valid[['Close','Predictions']])
52
Fig. 5.2 Prediction Output Visualization
53
Chapter 6
Conclusion
6.1 Conclusion
Neither the analytical engineering nor the exploration should be omitted when
considering the application of data science methods to solve a business problem. Omitting the
engineering aspect usually makes it much less likely that the results of mining data will actually
solve the business problem. Omitting the understanding of process as one of exploration and
discovery often keeps an organization from putting the right management, incentives, and
investments in place for the project to succeed.
The purpose of Data Science, we conclude that Data Scientists are the backbone of data-intensive
companies. The purpose of Data Scientists is to extract, preprocess and analyze data. Through
this, companies can make better decisions. Various companies have their own requirements and
use data accordingly. In the end, the goal of Data Scientist to make businesses grow better. With
the decisions and insights provided, the companies can adopt appropriate strategies and customize
themselves for enhanced customer experience.
54