Internship Report
Internship Report
Internship Report
Academic year
(2022-2023)
Government polytechnic
NTERNSHIP REPORT UNDER
SUBJECT OF
Diploma. SEMESTER-III
Submitted by :-
Dharmesh.S.Vaish
BrainyBeam Technologies was founded with a vision to address growing businesses' needs of
reducing the time to market and cost effectiveness required to develop and maintain unique
and customized web and mobile solutions. We are uniquely and strategically positioned to
partner with startups and leading brands to help them expand their business and offer the most
effective and cost-efficient solutions that provide revenues and value to their business needs.
Vision
To become the most trusted and preferred offshore IT solutions partner for Startups, SMBs and
Enterprises through innovation and technology leadership. Understanding your ambitious
vision, honing in on its essence, creating a design strategy, and knowing how to technically
execute it is what we do best. Our promise? The integrity of your vision will be maintained and
we'll enhance it to best reach your target customers. With our primary focus on creating
amazing user experiences, we'll help you understand the tradeoffs, prioritize features, and
distill valuable functionality. It's an art form we care about getting right.
Joining Letter
Completion Certificate
Jainish Shah
ACKNOWLEDGEMENT
I would like to express my deepest gratitude to all those who provided me the possibility
to the completion of the internship. A special gratitude of thanks I give to our Assitant
Professor, Prof. Shweta Rajput, whose contribution in stimulating suggestions and
encouragement, helped me to coordinate the internship especially in drafting this report.
Furthermore, I would also like to acknowledge with much appreciation the crucial role of the
Head of Department, Dr. Avani Vasant, who gave the permission to use all required
equipment and the necessary material to fulfil the task. Last but not the least, many thanks go
to the teachers and my friends and families who have invested their full effort in guiding us in
achieving the goal.
Also I appreciate the guidance given by the developer at BrainyBeam, Mr Raj as well as the
panels especially for the internship that has advised me and gave guidance at every moment
of the internship.
1|Page
Jainish Shah
Abstract
Data Science and analysis is playing the most significant role today covering
every industry in the market. For e.g., finance, e-commerce, business,
education, government.
Now organizations play a 360-degree role to analyse the behaviour and interest
of their customers to take decisions in favour of them. Data is analysed through
programming language such as python which is one of the most versatile
language and helps in doing a lot of things through it.
Netflix is a pure data science project that reached at the top through analysing
every single interest of their customers. Key terminology that are used in Data
Science are: Data Visualization, Anaconda Jupyter Notebook, Exploratory Data
Analysis, Machine Learning, Data wrangling, and Evaluation using scikit
library’s surprise module.
2|Page
Jainish Shah
||| DAY - 1
BASIC INTRODUCTION AND DOMAIN KNOWLADGE
Explain about work flow of whole internship. Also discuss some basic
domain knowledge.
Introduction about Field
i. Discuss some basic point about python, working of python, advantages
of python for working in data science.
ii. Also explained how to install and run python and jupyter notebook
and other useful tools?
AIM: Task: build a python program which can take input of students with their
subject marks, and gives their total marks obtained.
Program:
total = 0
n = int(input("Enter the number of students "))
for i in range(n):
name = input("Enter the Name:")
sub = int(input("No. of subjects "))
3|Page
Jainish Shah
for i in range(sub):
marks = int(input("Enter marks"))
total = total+marks
print(total)
4|Page
Jainish Shah
||| DAY - 2
AIM: List out the methods used commonly in list, set,
tuple, dictionary with their rules
Data Types in Python
1. str: A string data type is traditionally a sequence of characters, either as a literal
constant or as some kind of variable. The latter may allow its elements to be
mutated and the length changed, or it may be fixed (after creation).
2. Numbers: Int, float, complex and long integers are numeric data types. We can
store real number values in int, floating point values in float and complex numbers
in complex data types and long for integers of unlimited size.
3. Lists are the build-in data-types in python that are used to store multiple items in a
single variables. The data is stored in [ ].
4. Sets are also used to store multiple items in a single variables. In set there is no
order and no index. Data stored between { }.
5. Tuples: Similar to list the tuples are ordered and similar to set the tuples
are immutable. Stored in ( ).
6. Dictionary: Storing of values ,Ordered , changeable(mutable) , doesn’t allow change
of values.
LIST:
Example: a= [‘Jai’,’ni’,’sh’]
Lists are the build-in data-types in python that are used to store multiple items in a single
variables. The plus point of list is that the order of list does not change, and the items in the
list are changeable (mutable) and the last point as the list allows duplicate values too.
LIST Methods:
- .append(x) : Add an item to the end of the list
- .insert(i, x): Inserting an item at a given position
- .remove(x) : removing the first item from the list whose value is equal to x
- copy(): Copying of the list
- count(): Number of elements with the specified value
- reverse() : reverse the list
5|Page
Jainish Shah
SET
- Sets are also used to store multiple items in a single variables.
- In set there is no order and no index.
- The down point of set data type is the value cannot be changed once the set is created
immutable
- Repetition of values are not allowed in set.
Sets Methods:
a. add(): adds element to a set
b. discard(): Removes an Element from The Set
c. isdisjoint(): Checks Disjoint Sets
d. issubset(): Checks if a Set is Subset of Another Set
e. union(): Returns the union of sets
f. update(): Add elements to the set
g. clear(): remove all elements from a set
CODE.
# set of vowels
vowels = {'a', 'e', 'i', 'u'}
#discard 'o'
6|Page
Jainish Shah
vowels.discard("o")
print(vowels)
#isdisjoin()
A = {1, 2, 3, 4}
B = {5, 6, 7}
C = {4, 5, 6}
print('Are A and B disjoint?', A.isdisjoint(B))
print('Are A and C disjoint?', A.isdisjoint(C))
#issubset()
A1 = {1, 2, 3}
B1 = {1, 2, 3, 4, 5}
print(A1.issubset(B1))
#union
A2 = {'a', 'c', 'd'}
B2 = {'c', 'd', 2 }
print('A U B =', A2.union(B2))
#update
A3 = {'a', 'b'}
B3 = {1, 2, 3}
result = A3.update(B3)
print('A =', A3)
#clear vowels.clear()
print('Vowels (after clear):', vowels)
7|Page
Jainish Shah
TUPLE
- Storing of multiple items in one variable
- Similar to list the tuples are ordered and similar to set the tuples are immutable.
- Tuples also allow duplicates.
Tuples Methods:
a. .count( ): Returns the number of times a specified value occurs in a tuple.
b. .index( ): Searches the tuple for a specified value and returns the position of where it
was found.
Dictionaries
- Storing of values
- Ordered , changeable(mutable) , doesn’t allow change of
Code.
#get()
person = {'name': 'Jainish', 'age': 21}
print('Name: ', person.get('name'))
print('Age: ', person.get('age'))
8|Page
Jainish Shah
#items()
print(person.items())
#keys
print(person.keys())
#setdefault()
age = person.setdefault('age')
print('person = ',person)
print('Age = ',age)
#values()
print(person.values())
#clear()
person.clear()
print(person)
9|Page
Jainish Shah
||| DAY - 3
AIM:
1) Random module functions with explanation
2) Build password generator program containing numbers, Alphabets
and special characters.
3) Write a note about NLP, NLU, and NLG with examples.
4) Perform text to speech examples using gtts.
Program:
10 | P a g e
Jainish Shah
CODE:
import random as rand
x = rand.randrange(5)
print(x) # returns a random number in the given number range
rand.seed(20)
print(rand.random())
y = rand.randint(1,50)
print(y)
b = rand.uniform(1.0,5.0)
print(b)
print(rand.getstate())
print(rand.randrange(3, 9))
11 | P a g e
Jainish Shah
12 | P a g e
Jainish Shah
NLU – Natural Language Understanding is the branch of NLP where the transformation of
human language into machine readable format. And allows computers to understand the
commands without the formalized syntax of computer languages and for computers to
communicate back to humans in their own languages.
NLG – Natural Language Generation is the subfield of AI, is a software that automatically
transforms data into plain-English content. With the right data in the right format, an NLG
system can automatically turn numbers in a spreadsheet into data driven narratives or even
use associations between words to create partially or fully machine written text.
gTTS Functions:
a. .get_bodies() : Get request bodies sent to the TTS API.
b. .save(): Do the TTS API request and write results to file.
c. .write_to_fp() : Do the TTS API request(s) and write bytes to a file-like object.
d. .lang() : Support for different languages.
Output
voice1.mp3
Similarly there many more libraries for audio recording such as pyaudio.
Here is the small implementation of pyaudio.
import pyaudio
import wave
FORMAT = pyaudio.paInt16
13 | P a g e
Jainish Shah
CHANNELS = 2
RATE = 44100
CHUNK = 1024
RECORD_SECONDS = 5
WAVE_OUTPUT_FILENAME = "file.wav"
audio = pyaudio.PyAudio()
# start Recording
stream = audio.open(format=FORMAT, channels=CHANNELS,
rate=RATE, input=True,
frames_per_buffer=CHUNK)
print("recording...")
frames = []
# stop Recording
stream.stop_stream()
stream.close()
audio.terminate()
So the program will record the voice and store it in the folder you are working on.
14 | P a g e
Jainish Shah
||| DAY - 4
AIM:
1) List out 5 methods of pandas and numpy with output.
2) Reshape(-1,1) explanation
3) Linear regression working with mathematical equation.
4) Sales prediction with csv
Day 4: At day 4, we were taught the methods of numpy and pandas that are the most
important for performing task on data science project like importing files and creating
dataframes. Then linear regression as the first algorithm for model building and predicting
data.
Program:
1) List out 5 methods of pandas and numpy with output.
NUMPY:
import numpy as np
a=np.array([[1,2,3,4,5,6],[7,8,9,10,11,12]])
a
a.ndim----------1 (ndim() function return the number of dimensions of an array)
print(np.std(a))---------#Standard Dev.
print(np.var(a))------------#Variance
15 | P a g e
Jainish Shah
Pandas:
import pandas as pd
df = pd.read_csv('Data1.csv')
df
a= df[df.language == 'python'][['rating','userid']]
df['rating']*2
16 | P a g e
Jainish Shah
df2 =pd.DataFrame(ls_of_ls)
df2
d = {'a':[1,2,3,4,5],'b':[4,5,6,7,8]}
df2 = pd.DataFrame(d)
df2
df.info()
groupby() function is used to split the data into groups based on some criteria. pandas objects
can be split on any of their axes.
Pd.groupby(“column_name”)
17 | P a g e
Jainish Shah
18 | P a g e
Jainish Shah
Task 1: Why we only give list to a array or any data structure to convert it
into data-frames.
Code:
import pandas as pd
d = {'a':1,'b':2}
#If we right above code it will show value error as "If using all scalar values, you must pass
an index"
#The reason is we are providing here a scalar values.
#So our program can not able to decide to take this value as rows or columns.
#df1 = pd.DataFrame(d)
print(df1)
#Or we can write like below
d1 = {'a':[1,2],'b':[4,5]}
df2 = pd.DataFrame(d1)
df2
z = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
print(z.shape)
z1 = z.reshape(-1,1)
print(z1.shape)
19 | P a g e
Jainish Shah
import numpy as np
from sklearn.linear_model import LinearRegression
testing = np.array([8,9])
model = LinearRegression()
model.fit(x,y)
model.predict(testing.reshape(-1,1))
m = model.coef_
print(m)
print(model.intercept_)
20 | P a g e
Jainish Shah
CODE
import numpy as np
import pandas as pd
df1 = pd.read_csv('sales.csv')
df1
data1 = df1.groupby(['month']).mean()
x = np.array(data1.index)
y = np.array(data1['sales'])
testing = np.array([17,12])
21 | P a g e
Jainish Shah
22 | P a g e
Jainish Shah
||| DAY - 5
AIM :
1) Try to clean train.csv data
2) decision tree explanation
3) random state working
df['Gender']= le.fit_transform(df['Sex'])
df.info()
# data format
df.describe()
df.describe(include=['O'])
df.info()
# Taking care of missing data , so age columns Nan Values are interpolated and
filled... df['Age'] = df['Age'].interpolate()
df["Fare"] = df["Fare"].fillna(df["Fare"].median())
23 | P a g e
Jainish Shah
24 | P a g e
Jainish Shah
Mathematics behind Decision tree algorithm: Before going to the Information Gain first
we have to understand entropy
Entropy: Entropy is the measures of impurity, disorder, or uncertainty in a bunch of
examples.
Purpose of Entropy:
Entropy controls how a Decision Tree decides to split the data. It affects how a Decision
Tree draws its boundaries.
“Entropy values range from 0 to 1”, Less the value of entropy more it is trusting able.
25 | P a g e
Jainish Shah
26 | P a g e
Jainish Shah
||| DAY - 6
AIM:
On day 6 of the internship we were taught about the more data
cleaning operations and then decision tree algorithm and its parameters
to better understand the algorithm. And then we were assign the 4
tasks.
1) Explain 5 parameters used in decision tree model
2) Mention data cleaning methods and its working.
3) Make a diagram explaining decision tree parameters with
titanic dataset with equation.
4) Obtain 90% accuracy from the dataset.
b. min_samples_split
• An internal node will have further splits (also called children).
• min_samples_split specifies the minimum number of samples required to
split an internal node.
• We can either specify number to denote the minimum number or a fraction to
denote the percentage of samples in an internal node.
c. min_samples_leaf
• A leaf node is a node without any children.
• min_samples_leaf is the minimum number of the samples required to be at
a leaf node.
• This parameter is similar to min_samples_splits, however, this describe
the minimum number of samples at the leafs, the base of the tree.
27 | P a g e
Jainish Shah
d. max_features
• max_features represent the number of features to consider when looking
for the best split.
e. criterion
criterion − string, optional default= “gini”
• supported criteria are “gini” and “entropy”. Function to measure the quality
of a split.
a) Check for the Missing Values – for detecting missing values across
different array dtypes. Pandas provides the functions named, isnull()
and notnull().
Example.
28 | P a g e
Jainish Shah
29 | P a g e
Jainish Shah
c) Drop Missing Values – simply exclude the missing values, then use the
dropna function along with the axis argument. By default, axis = 0, along
row, which means that if any value within a row is NA then the whole
row is excluded.
30 | P a g e
Jainish Shah
A particular book can have only one date of publication. Therefore, we need to do
the following:
Remove the extra dates in square brackets, wherever present: 1879 [1878]
Convert date ranges to their “start date”, wherever present: 1860-63; 1839,
38-54
Completely remove the dates we are not certain about and replace them
with NumPy’s NaN: [1897?]
Convert the string nan to NumPy’s NaN value
31 | P a g e
Jainish Shah
regex = r'^(\d{4})'
32 | P a g e
Jainish Shah
import numpy as np
import pandas as pd
total = train_df.isnull().sum().sort_values(ascending=False)
percent_1 = train_df.isnull().sum()/train_df.isnull().count()*100
percent_2 = (round(percent_1, 1)).sort_values(ascending=False)
missing_data = pd.concat([total, percent_2], axis=1, keys=['Total',
'%']) missing_data.head(5)
train_df.columns.values
33 | P a g e
Jainish Shah
train_df = train_df.drop(['PassengerId'],axis=1)
### Dealing with the missing data
import re
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8}
data = [train_df]
data = [train_df]
train_df["Age"].isnull().sum()
common_value = 'S'
data = [train_df]
34 | P a g e
Jainish Shah
dataset['Embarked'] = dataset['Embarked'].fillna(common_value)
x = train_df.drop(["Survived","Name"],axis=1)
y = train_df["Survived"]
decision_tree = DecisionTreeClassifier()
decision_tree.fit(x_train,y_train)
y_pred = decision_tree.predict(x_test)
acc_decision_tree = round(decision_tree.score(x_train,y_train) *100,2)
print(acc_decision_tree)
35 | P a g e
Jainish Shah
36 | P a g e
Jainish Shah
37 | P a g e
Jainish Shah
||| DAY - 7
Task List:
1) Find polarity of unique products using apply with Function.
2) Use frequency distribution and pos tag from ntk.
3) Make list of meta characters and make the pattern for email and
phone number.
The range of polarity is from -1 to 1(negative to positive) and will tell us if the
text contains positive or negative feedback.
The polarity of words is retrieved from the package pattern and the sentence
polarity is calculated using: Sum of polarity of all the words in a sentence
divided by the total number of words in the sentence.
import pandas as pd
import numpy as np
import nltk
df = pd.read_csv("cloths-rating.csv")
def sentiment_calc(text):
try:
return TextBlob(str(text)).sentiment.polarity
except:
return None
df['sentiment'] = df['grouped'].apply(sentiment_calc)
38 | P a g e
Jainish Shah
39 | P a g e
Jainish Shah
import nltk
text="The titular threat of The Blob has always struck me as the ultimate movie ...
…
rampant."
print(text)
for i in
40 | P a g e
Jainish Shah
tokens:
41 | P a g e
Jainish Shah
freq[i]+=1
freq
cw =freq.most_common(5)
cw
pos=nltk.pos_tag(tokens)
pos
42 | P a g e
Jainish Shah
c) Make list of meta characters and make the pattern for email and
phone number.
43 | P a g e
Jainish Shah
import re
reg = '^(\w|\.|\_|\-)+[@](\w|\_|\-|\.)+[.]\w{2,3}$'
em = input("enter email")
if(re.search(reg,em)):
print("Nice its a valid email id, enjoy")
else:
print("So sorry not a valid email id")
import re
regex = '^[6-9]\d{9}$'
phone = input("enter number")
if(re.search(regex,phone)):
print("Valid Phone")
else:
print("Invalid Phone")
44 | P a g e
Jainish Shah
||| DAY - 8
Task: Explain TF-IDF with example.
t — term (word)
d — document (set of words)
N — count of corpus
corpus — the total document set
Term Frequency (tf): gives us the frequency of the word in each document in the corpus. It is
the ratio of number of times the word appears in a document compared to the total number of
words in that document. It increases as the number of occurrences of that word within the
document increases. Each document has its own tf.
EXAMPLE.
from sklearn.feature_extraction.text import TfidfVectorizer
sentences=['what is your name','where do you live','do you live in surat','what is your
latname']
vectors=TfidfVectorizer()
vectors.fit(sentences)
transform=vectors.transform(sentences)
print(transform)
print(vectors.vocabulary_)
transform.shape
45 | P a g e
Jainish Shah
import pandas as pd
d=[1,2,3,4,5]
g=[6,7,8,9,10]
a=pd.Series(d)
b=pd.Series(g
) a+b
df=pd.DataFrame(vectors.fit_transform(sentences).toarray(),columns=vectors.get_feature_na
mes())
vectors=TfidfVectorizer(binary=True,min_df=2,max_df=3)
vectors.fit(sentences)
transform=vectors.transform(sentences)
print(transform)
print(vectors.vocabulary_)
df=pd.DataFrame(vectors.fit_transform(sentences).toarray(),columns=vectors.get_feature_na
mes())
46 | P a g e
Jainish Shah
||| DAY - 9
Task: Explain three techniques of stemming.
Stemming is the process of reducing inflection in words to their root forms such as mapping a
group of words to the same stem even if the stem itself is not a valid word in the Language.
SNOWBALL Stemming:
When compared to Porter Stemmer, the Snowball Stemmer can map non-English words too.
Since it supports other languages the Snowball Stemmers can be called a multi-lingual
stemmer. The Snowball stemmers are also imported from the nltk package. This stemmer is
based on a programming language called ‘Snowball’ that processes small strings and is the
most widely used stemmer. A lot of the things added to the Snowball stemmer were because
of issues noticed with the Porter stemmer. There is about a 5% difference in the way that
Snowball stems versus Porter.
LANCASTER Stemming:
The Lancaster Stemming are more aggressive and dynamic compared to the other two
stemmers. The stemmers is really faster, but the algorithm is really confusing when dealing
with small words. But they are not as efficient as Snowball Stemmers. The Lancaster
stemmers save the rules extremely and basically uses an iterative algorithm.
PORTER:
It’s not too complex and development on it is frozen. Typically, it’s a nice starting basic
stemmer, but it’s not really advised to use it for any production/complex application. Based
on the idea that the suffixes in the English language are made up of a combination of smaller
and simpler suffixes. This stemmer is known for its speed and simplicity. The main
applications of Porter Stemmer include data mining and Information retrieval. However, its
applications are only limited to English words. Also, the group of stems is mapped on to the
same stem and the output stem is not necessarily a meaningful word. The algorithms are
fairly lengthy in nature and are known to be the oldest stemmer.
47 | P a g e
Jainish Shah
||| DAY - 10
Task:
1) Explain collaborative and content based filtering with example.
2) Explain cosine similarly with equation
3) Explain RMSE and MSE with mathematical equation.
48 | P a g e
Jainish Shah
Users will have a table with different rated items of what they choose or liked
49 | P a g e
Jainish Shah
Based on the similarities, prediction can be make of what the user might like,
based on what similar users did.
The list will be filtered and matched to users who used the same items for comparison
and recommendations
Collaborative algorithm uses “User Behaviour” for recommending items. They
exploit behaviour of other users and items in terms of transaction history, ratings,
selection and purchase information. Other users behaviour and preferences over the
items are used to recommend items to the new users. In this case, features of the items
are not known.
Cosine similarity measures the similarity between two vectors of an inner product space. It is
measured by the cosine of the angle between two vectors and determines whether two vectors
are pointing in roughly the same direction. It is often used to measure document similarity in
text analysis.
Cosine similarity is a measure of similarity that can be used to compare documents or, say,
give a ranking of documents with respect to a given vector of query words. Let x and y be two
vectors for comparison. Using the cosine measure as a similarity function, we have
Mean Square error is one such error metric for judging the accuracy and error rate of
any machine learning algorithm for a regression problem. So, MSE is a risk function
that helps us determine the average squared difference between the predicted and
the actual value of a feature or variable.
50 | P a g e
Jainish Shah
RMSE is an acronym for Root Mean Square Error, which is the square root of
value obtained from Mean Square Error function. Using RMSE, we can easily plot
a difference between the estimated and actual values of a parameter of the model.
51 | P a g e
Jainish Shah
||| DAY - 11
Task: Perform recommendation (based on rating) with any dataset
In the following task, we were informed to perform recommendation based on rating from
any dataset. I was provided with the Amazon Electronics Rating Dataset where the attributes
(columns) as reviewerID, asin, reviewerName, helpful, reviewText, overall, summary,
unixReviewTime and reviewTime. Where we are mainly considering reviewerID, asin,
overall and ReviewText and summary for the recommendation and sentiment analysis.
First step is to Import the libraries, which the most important task for any exploratory
or sentiment analysis.
Then, we have to import the dataset, so the dataset is in the form of json file and downloaded
from the link http://jmcauley.ucsd.edu/data/amazon/. The original data was in json format.
The json was imported and decoded to convert json format to csv format. The sample dataset
is shown below:
Sample review Dataset:
"reviewerID": "A2SUAM1J3GNN3B",
"asin": "0000013714",
52 | P a g e
Jainish Shah
"reviewText": "I bought this for my husband who plays the piano. He is having a
wonderful time playing these old hymns. The music is at times hard to read because we
think the book was published for singing from more than playing from. Great purchase
though!",
"overall": 5.0,
"unixReviewTime": 1252800000,
53 | P a g e
Jainish Shah
54 | P a g e
Jainish Shah
55 | P a g e
Jainish Shah
56 | P a g e
Jainish Shah
||| DAY - 12
Task: Find the key from the dictionary containing 1-5 ratings as keys and 40
values.
dict1 = {'1':[-5,-4,-3.75,-3,-2.5,-2.25,-2,-1.5,-1.25,-1,-0.75,-0.5,-0.25],'2':[-
0.24,0.25,0.5,0.75,1] ,'3': [1.01,1.25,1.5,2],'4': [2.01,2.25,2.5,3],'5':[3.01,3.75,4,5]}
def fun(val):
for i in dict1:
for j in dict1[i]:
if val == j:
return i
if val <= j:
return i
57 | P a g e
Jainish Shah
||| DAY - 13
Task: Perform recommendation (based on Rating) with any dataset
import pandas as pd
import numpy as np
from textblob import TextBlob
from sklearn.metrics.pairwise import cosine_similarity
LOADING THE CSV FILE
df = pd.read_csv("cloths-rating.csv")
df.head()
Find Sentiment on Text(Reviews)
def sentiment_calc(text):
try:
return TextBlob(str(text)).sentiment.polarity
except:
return None
df['sentiment'] = df['Text'].apply(sentiment_calc)
df
58 | P a g e
Jainish Shah
def fun(val):
for i in dict1:
for j in dict1[i]:
if val == j:
59 | P a g e
Jainish Shah
return i
if val <= j:
return i
df_pivot_matrix = csr_matrix(df_pivot.values)
print(df_pivot_matrix)
60 | P a g e
Jainish Shah
model_knn.fit(df_pivot_matrix)
data_dict={}
for i in range(0, len(similarity.flatten())): #gives length of similarity array
if i == 0:
print('Recommendations for {0}:\n'.format(df_pivot.index[query_index]))
else:
data_dict[str(df_pivot.index[indices.flatten()[i]])] = float(similarity.flatten()[i])
print(f'{df_pivot.index[indices.flatten()[i]]}, is similarity distance = with
{similarity.flatten()[i]:.20f}:')
print(data_dict)
61 | P a g e
Jainish Shah
There is very sight difference between the recommendation using rating and recommendation
using review.
62 | P a g e
Jainish Shah
||| DAY - 14
Task: Explain surprise package and its working
Scikit-Surprise is an easy-to-use Python scikit for recommender systems, another example of
python scikit is Scikit-learn which has lots of awesome estimators. Singular vector
decomposition (SVD) shown here employs the use of gradient descent to minimize the
squared error between predicted rating and actual rating, eventually getting the best model.
63 | P a g e