0% found this document useful (0 votes)

34 views

Python Assignment 3

This document discusses text preprocessing and feature extraction techniques in natural language processing. It loads a corpus of text documents, applies bag-of-words and TF-IDF algorithms to extract features and calculate feature weights. Specifically, it uses sklearn's CountVectorizer and TfidfVectorizer to transform text into numerical vectors and calculate IDF weights. It then prints the extracted features and computed IDF values. Additionally, it implements a custom IDF calculation for comparison purposes.

Uploaded by

Bataan Shivani

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views

Python Assignment 3

Uploaded by

Bataan Shivani

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

In [68]: #TASK 1

#import all required libraray import pandas as pd import math import numpy
as np from scipy import sparse from scipy.stats import uniform from
sklearn.feature_extraction.text import TfidfVectorizer

# input data string corpus = ['this is the first document',

'this document is the second document', 'and this is the third one', 'is
this the first document'] # use fit method to compute Bag of words

vectorizer = TfidfVectorizer() vectorizer.fit(corpus) skl_output =

vectorizer.transform(corpus) bow=vectorizer.get_feature_names() print(bow)
IDF_reference=vectorizer.idf_ print(IDF_reference) #compute IDF using
custom method

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer() vectors = vectorizer.fit_transform(corpus)

matrix = CountVectorizer() matrix.fit(corpus) # after this statement the

matrix will build the vocabulary with all th e unique words

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
# you should call this function only after fit()

# to convert the sentance into numerical vectors, we will call transfor

m() # the first feature name will corresponds to first column in
transforme d matrix # the 2nd feature name will corresponds to 2nd column
in transformed ma trix

print(matrix.transform(corpus).toarray())

# Here we will print the sklearn tfidf vectorizer idf values after appl
ying the fit method # After using the fit function on the corpus the vocab
has 9 words in i t, and each has its idf value.

#compute IDF using custom method

for i in range(len(bow)):
Y=0 word=bow[i] for j in range(len(corpus)):
list[j]=corpus[j].split()

if(word in list[j]): #print(word) #print(list[j]) Y=Y+1 X=len(corpus)

XY=math.log((1+X)/(1+Y)) IDF_custom=XY+1 print(IDF_custom)

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'th

is'] [1 91629073 1 22314355 1 51082562 1 1 91629073 1 91629073 Create PDF

in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

[1.91629073 1.22314355 1.51082562 1. 1.91629073 1.91629073
1. 1.91629073 1. ] [[0 1 1 1 0 0 1 0 1] [0 2 0 1 0 1 1 0 1] [1 0 0 1 1 0 1
1 1] [0 1 1 1 0 0 1 0 1]] 1.916290731874155 1.2231435513142097
1.5108256237659907 1.0 1.916290731874155 1.916290731874155 1.0
1.916290731874155 1.0

In [15]: #TASK2
import pickle import numpy as np with
open("E:\Applied_AI\Assignments\cleaned_strings","rb") as f:
data = pickle.load(f) # printing the length of the corpus loaded
print("Number of documents in data = ",len(data))

#call all usique words using fit and tranform function from
sklearn.feature_extraction.text import TfidfVectorizer vectorizer =
TfidfVectorizer() vectorizer.fit(data) skl_output =
vectorizer.transform(data) bow=vectorizer.get_feature_names()

#compute IDF IDF=vectorizer.idf_

#sort IDF in descending order sorted_IDF=np.sort(IDF)

required_IDF=sorted_IDF[::-1]

#print top 50 IDF values

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD
print(required_IDF[0:49])

Number of documents in data = 746 [6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918
6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918]

Create PDF in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

035 Assignment PDF
No ratings yet
035 Assignment PDF
14 pages
In This Hands-On You Will Be Performing CNN Operations Using Tensorflow Package
No ratings yet
In This Hands-On You Will Be Performing CNN Operations Using Tensorflow Package
6 pages
GE3171 - Python Manuel R2021
100% (5)
GE3171 - Python Manuel R2021
48 pages
1_5089492269589857342(1)
No ratings yet
1_5089492269589857342(1)
7 pages
Univariate and Mutivariate Time Series Forecasting
No ratings yet
Univariate and Mutivariate Time Series Forecasting
33 pages
Experiment1111
No ratings yet
Experiment1111
25 pages
IP Practic MINE
No ratings yet
IP Practic MINE
30 pages
NLP Tushar
No ratings yet
NLP Tushar
21 pages
ChatDBClean - Colab
No ratings yet
ChatDBClean - Colab
3 pages
178 DL
No ratings yet
178 DL
31 pages
12 Ip HW
No ratings yet
12 Ip HW
10 pages
IR practical
No ratings yet
IR practical
24 pages
Ip Project
No ratings yet
Ip Project
21 pages
6 - Text Vectorization-CSC688-SP22
No ratings yet
6 - Text Vectorization-CSC688-SP22
5 pages
Lab Assignments-2024
No ratings yet
Lab Assignments-2024
9 pages
Explore Weather Trends
No ratings yet
Explore Weather Trends
6 pages
Hrithik Saini Class 12th c1, Roll No 1033
No ratings yet
Hrithik Saini Class 12th c1, Roll No 1033
25 pages
IR - 754 All Practical
No ratings yet
IR - 754 All Practical
21 pages
Deep Learning Practical
No ratings yet
Deep Learning Practical
12 pages
DAA-MANUAL-2024 (1)
No ratings yet
DAA-MANUAL-2024 (1)
36 pages
IP Practical File Project
No ratings yet
IP Practical File Project
60 pages
Fundamentals of Data Science Lab Manual-5-26
No ratings yet
Fundamentals of Data Science Lab Manual-5-26
22 pages
Practical File Part 1
No ratings yet
Practical File Part 1
17 pages
ip_project[1] (5)
No ratings yet
ip_project[1] (5)
28 pages
12 Ip
No ratings yet
12 Ip
4 pages
D22IT184 DAA Practical5
No ratings yet
D22IT184 DAA Practical5
11 pages
# For Linear Algebra Import Numpy As NP # For Data Processing Import Pandas As PD
No ratings yet
# For Linear Algebra Import Numpy As NP # For Data Processing Import Pandas As PD
4 pages
Vanshita PST Merged Organized
No ratings yet
Vanshita PST Merged Organized
51 pages
Class XII Python Practical File
No ratings yet
Class XII Python Practical File
19 pages
alizing-time-series-data-in-python
No ratings yet
alizing-time-series-data-in-python
47 pages
Lec 4 Complete Amortize Loan 01102023 113433am
No ratings yet
Lec 4 Complete Amortize Loan 01102023 113433am
6 pages
ROBV101_PNote Activities
No ratings yet
ROBV101_PNote Activities
10 pages
ML-FINANCE - NPTES_BSR
No ratings yet
ML-FINANCE - NPTES_BSR
36 pages
Intro To Py and ML - Part 2
No ratings yet
Intro To Py and ML - Part 2
10 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Chapter 4
No ratings yet
Chapter 4
31 pages
Suryadatta National School Class 12 CBSE Informatics Practices Practicals List
No ratings yet
Suryadatta National School Class 12 CBSE Informatics Practices Practicals List
19 pages
long docs
No ratings yet
long docs
8 pages
12 IP Practical
No ratings yet
12 IP Practical
14 pages
Final
No ratings yet
Final
15 pages
FOD Record Sem 1
No ratings yet
FOD Record Sem 1
25 pages
MLRecord
No ratings yet
MLRecord
24 pages
Financial Analytics With Python
100% (1)
Financial Analytics With Python
40 pages
Simple It K Tutorial
No ratings yet
Simple It K Tutorial
3 pages
Data Minig and Techniquezz
No ratings yet
Data Minig and Techniquezz
48 pages
GNN MetaLayer
No ratings yet
GNN MetaLayer
14 pages
Ml Lab Manual Completed
No ratings yet
Ml Lab Manual Completed
56 pages
Daa 123 Final
No ratings yet
Daa 123 Final
12 pages
DAA Practical
No ratings yet
DAA Practical
48 pages
Assign 3
No ratings yet
Assign 3
1 page
Python Note 3
No ratings yet
Python Note 3
11 pages
Taller Python 2
No ratings yet
Taller Python 2
6 pages
Acknowledgement
No ratings yet
Acknowledgement
25 pages
Informatic Practices Hhw
No ratings yet
Informatic Practices Hhw
21 pages
Chap 1
No ratings yet
Chap 1
32 pages
Creating Acrobat Form Fields Using JavaScript PDF
No ratings yet
Creating Acrobat Form Fields Using JavaScript PDF
14 pages
GROCERY.cs
No ratings yet
GROCERY.cs
14 pages
Profound Python Libraries
From Everand
Profound Python Libraries
Onder Teker
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Doctrine of Res Gestae Under Indian Evidence Act
No ratings yet
Doctrine of Res Gestae Under Indian Evidence Act
8 pages
Handbook of Industrial Engineering Technology and Operations Management 3rd Edition Igi Global - Own the complete ebook with all chapters in PDF format
100% (1)
Handbook of Industrial Engineering Technology and Operations Management 3rd Edition Igi Global - Own the complete ebook with all chapters in PDF format
57 pages
FSIC Requirements 2021 Update09082021
No ratings yet
FSIC Requirements 2021 Update09082021
3 pages
MGMT 625 Midterm
No ratings yet
MGMT 625 Midterm
39 pages
Prime Safety Ltd. Wacc
100% (1)
Prime Safety Ltd. Wacc
5 pages
8674UK Parker Hydrolic
No ratings yet
8674UK Parker Hydrolic
4 pages
Qualitative Research in Education
50% (2)
Qualitative Research in Education
17 pages
Modern Computer Architecture and Programming in Assembly Language - TCM - 183 - 1309076
No ratings yet
Modern Computer Architecture and Programming in Assembly Language - TCM - 183 - 1309076
131 pages
DLL - Science 4 - Q2 - W7
No ratings yet
DLL - Science 4 - Q2 - W7
4 pages
CS3072-CS3605 Task 5 Brief - Dissertation 2020.21
No ratings yet
CS3072-CS3605 Task 5 Brief - Dissertation 2020.21
4 pages
Gaming in Education
No ratings yet
Gaming in Education
7 pages
00 090 PD 0020
No ratings yet
00 090 PD 0020
112 pages
Libjo National High School
No ratings yet
Libjo National High School
4 pages
Roschin PDF
No ratings yet
Roschin PDF
28 pages
1.full Name Display
No ratings yet
1.full Name Display
7 pages
Villanueva - Loadclassprogram 1st-Sem 2022
No ratings yet
Villanueva - Loadclassprogram 1st-Sem 2022
2 pages
Chapter 2
No ratings yet
Chapter 2
214 pages
Quant Complete Book - Pagenumber PDF
No ratings yet
Quant Complete Book - Pagenumber PDF
579 pages
Examining Designs of Selected Juvenile Correctional Facilities in Nigeria A Case Study Approach PDF
No ratings yet
Examining Designs of Selected Juvenile Correctional Facilities in Nigeria A Case Study Approach PDF
6 pages
Lokring Process Catalog
No ratings yet
Lokring Process Catalog
49 pages
Prime Prime Prime Prime: Video Video Video Video
No ratings yet
Prime Prime Prime Prime: Video Video Video Video
3 pages
Module 2 Part 2
No ratings yet
Module 2 Part 2
81 pages
STRESS
No ratings yet
STRESS
28 pages
Conplast RP264
No ratings yet
Conplast RP264
2 pages
Topic 4 Structural Engineering
No ratings yet
Topic 4 Structural Engineering
11 pages
Topographic Survey
No ratings yet
Topographic Survey
10 pages
Thunder and Lightning Model Answer
No ratings yet
Thunder and Lightning Model Answer
5 pages
2008 Mark Scheme Paper1
No ratings yet
2008 Mark Scheme Paper1
48 pages
Proposed University Belt Dormitory: Halina
No ratings yet
Proposed University Belt Dormitory: Halina
2 pages
Ebooks File Music Analysis and The Body Experiments Explorations and Embodiments N Reyland All Chapters
100% (5)
Ebooks File Music Analysis and The Body Experiments Explorations and Embodiments N Reyland All Chapters
52 pages

Python Assignment 3

Uploaded by

Python Assignment 3

Uploaded by

In [68]: #TASK 1

# input data string corpus = ['this is the first document',

vectorizer = TfidfVectorizer() vectorizer.fit(corpus) skl_output =

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer() vectors = vectorizer.fit_transform(corpus)

matrix = CountVectorizer() matrix.fit(corpus) # after this statement the

# to convert the sentance into numerical vectors, we will call transfor

#compute IDF using custom method

if(word in list[j]): #print(word) #print(list[j]) Y=Y+1 X=len(corpus)

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'th

is'] [1 91629073 1 22314355 1 51082562 1 1 91629073 1 91629073 Create PDF

in your applications with the Pdfcrowd HTML to PDF API PDFCROWD

#compute IDF IDF=vectorizer.idf_

#sort IDF in descending order sorted_IDF=np.sort(IDF)

#print top 50 IDF values

Number of documents in data = 746 [6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918 6.922918 6.922918 6.922918

6.922918 6.922918 6.922918 6.92291 8 6.922918]

You might also like