Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

NLP Lab1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Lab 1 Word Embedding and CNN Based

Sentiment Classification
SHINE-MING WU SCHOOL OF INTELLIGENT ENGINEERING
Autuom 2024

Task 1 Word embedding

Prerequisites
1. You need to install the gensim packages:
p ip i n s t a l l −−upgrade gensim

Q1 Train a word2vec model using Gensim [code]

Hint: you could refer to Q1 in the .ipynb()

1. Here, we will be using Text8 corpus, and train a word2vec model using Gensim. This
module is an API for downloading, getting information and loading corpus/datasets,
and indicate that how to train Word2Vector model.

Q2 Find top 10 similar words [code].

Hint: you could refer to Q2 in the .ipynb.

1. Use the Word2Vec model obtained in Q1, find top 10 similar words of the following
words: [cat apple student happy quickly]

2. Find top 10 similar words of the following words, in contrast to 2.1, here you will use
the model produced by GloVe. [cat apple student happy quickly]

Q3 Analogies with Word Vectors [code + written]

Hint: you could refer to Q3 in the .ipynb.


1. Analogies with Word Vector, solve the following analogies [code]
man : king :: woman : ?
tokyo : japan :: bangkok : ?
helsinki : finland :: paris : ?
argentina : spanish :: egypt : ?
bear : cub :: deer : ?
bee: hive :: bat : ?
banana : yellow :: milk : ?
groom : bride :: husband : ?
nephew : niece :: son : ?

2. Find a bad example of word analogy [code + written]

Q4 Implement ploted embeddings [code]


Hint: you could refer to Q4 in the .ipynb.

1. Visualize word embeddings. You are required to use TSNE/PCA methods to reduce the
high-dimensional word vectors to two-dimensional plots and plot them on a graph. Vi-
sualize these words: [he she his her female male woman man women men father
mother sister brother boy girl housekeeper mechanic carpenter dancer engineer
chief lawyer developer physician driver librarian nurse doctor cashier secretary
prince]

Q5 Guided Analysis of Bias in Word Vectors [written]


Hint: you could refer to Q5 in the .ipynb.

1. It’s important to be cognizant of the biases (gender, race, sexual orientation etc.)
implicit in our word embeddings. Bias can be dangerous because it can reinforce
stereotypes through applications that employ these models.
Run the cell of Q5 in the .ipynb, to examine (a) which terms are most similar to ”girl”
and ”toy” and most dissimilar to ”boy”, and (b) which terms are most similar to ”boy”
and ”toy” and most dissimilar to ”girl”. Point out the difference between the list of
female-associated words and the list of male-associated words, and explain how it is
reflecting gender bias.
Note that, use words to answer rather than code.

2
Figure 1: Illustration of a CNN architecture for text classification. We depict three filter
region sizes: 2, 3 and 4, each of which has 2 filters. Filters perform convolutions on the
sentence matrix and generate (variable-length) feature maps; 1D max pooling is performed
over each map, i.e., the largest number from each feature map is recorded. Thus a univariate
feature vector is generated from all six maps, and these 6 features are concatenated to form
a feature vector for the penultimate layer. The final softmax layer then receives this feature
vector as input and uses it to classify the sentence; here we assume binary classification and
hence depict two possible output states. Image Source: [1]

3
Task 2 CNN Based Sentiment Classification

Prerequisites
1. You need to install keras and tensorflow packages:
p ip i n s t a l l t e n s o r f l o w

2. or
conda i n s t a l l t e n s o r f l o w

3. Make sure you have the following file(s): lab1_task2, including:


lab1 task2
lab1 task2 skeleton.py
data helper.py
data
test.csv
train.csv
glove 50d.txt

Q1 Write code to perform the following tasks: [code]

1. Using text CNN to finish sentiment classification task. Note that, everyone should im-
plement a unified model architecture (check section 2 in Task 2) rather than designing
a new one.

2. Tutorial for CNN

Suppose the number of words in each document is seq len, and each word is associated
with an embedding where the dimension of the embedding emb dim. If the length of each
document is variate, we need to do padding to ensure the length of each document is the
same.
CNN mainly contains two kinds of layers, namely, convolutional layer and pooling layer.
Sepcially, in text classification, we will use Conv1D and GlobalMaxPooling1D in Keras.layers.

• Conv1D(filters, kernel size, strides, activation, ...)


filters: Integer, specifying the number of output filters in the convolution.
kernel size: Integer, specifying the length of the 1D convolution window.
strides: Integer, specifying the stride length of the convolution.
activation: String, specifying the activation function to use.
...: other arguments. In this lab, we just set it default. For more details, please read

4
https://keras.io/layers/convolutional/.
Input shape
3D tensor with shape: (batch size, seq len, emb dim)
Output shape
3D tensor with shape: (batch size, new steps, filters)
where new steps = seq len - kernel size + stride
Example
The red filters in Figure 1 should be implemented as:
Conv1D(filters=2, kernel size=4, strides=1,activation=’relu’)
The green filters should be implemented as:
Conv1D(filters=2, kernel size=3, strides=1,activation=’relu’)

• GlobalMaxPooling1D(...)
...: other arguments. In this lab, we just set it default. For more details, please read
https://keras.io/layers/pooling/.
Input shape
3D tensor with shape: (batch size, new steps, filters)
Output shape
2D tensor with shape: (batch size, filters)
Example
The max pooling on the outputs of red/green/yellow filters in Figure 1 should be
implemented as:
GlobalMaxPooling1D()
Note that, Global max pooling operation for 1D temporal data. https://keras.io/
api/layers/pooling_layers/global_max_pooling1d/

• Dense(...)
Dense implements the operation: output = activation(dot(input, kernel) + bias) where
activation is the element-wise activation function passed as the activation argument,
kernel is a weights matrix created by the layer, and bias is a bias vector created
by the layer (only applicable if use bias is True). These are all attributes of Dense.
https://keras.io/api/layers/core_layers/dense/
Input shape
N-D tensor with shape: (batch size, ..., input dim). The most common situation would
be a 2D input with shape (batch size, input dim).
Output shape
N-D tensor with shape: (batch size, ..., units). For instance, for a 2D input with shape
(batch size, input dim), the output would have shape (batch size, units).
Example
The final outputs in Figure 1 should be implemented as:
Dense(units=2, activation=’softmax’)

5
Submission

Please submit the files including program output, written answers and python scripts to
Blackboard. After you finished the assignments, make sure you include the header infor-
mation in the beginning of your code
# a u t h o r : Your name
# s t u d e n t i d : Your student ID

Copy all the program output in to a text file named StudentID_StudentName_lab1_output.


txt, answer the written questions in a text file named StudentID_StudentName_lab1_
writtenanswer.txt and submit zipped python script solutions named StudentID_StudentName_
lab1.zip that containing all the python scripts and the aforementioned answer files to
Blackboard.

If you want onsite grading during the lab, you can ask TA to grade your lab submissions by
showing your codes and outputs. It should be noted that, even with the onsite grading, you
still need to summit the files to Blackboard for keeping a electronic record of your assignment.

Submission deadline: 8 PM,October 28, 2024, 2024.

References
[1] Ye Zhang and Byron Wallace. A sensitivity analysis of (and practitioners’ guide to) con-
volutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820,
2015.

You might also like