NLP Lab1
NLP Lab1
NLP Lab1
Sentiment Classification
SHINE-MING WU SCHOOL OF INTELLIGENT ENGINEERING
Autuom 2024
Prerequisites
1. You need to install the gensim packages:
p ip i n s t a l l −−upgrade gensim
1. Here, we will be using Text8 corpus, and train a word2vec model using Gensim. This
module is an API for downloading, getting information and loading corpus/datasets,
and indicate that how to train Word2Vector model.
1. Use the Word2Vec model obtained in Q1, find top 10 similar words of the following
words: [cat apple student happy quickly]
2. Find top 10 similar words of the following words, in contrast to 2.1, here you will use
the model produced by GloVe. [cat apple student happy quickly]
1. Visualize word embeddings. You are required to use TSNE/PCA methods to reduce the
high-dimensional word vectors to two-dimensional plots and plot them on a graph. Vi-
sualize these words: [he she his her female male woman man women men father
mother sister brother boy girl housekeeper mechanic carpenter dancer engineer
chief lawyer developer physician driver librarian nurse doctor cashier secretary
prince]
1. It’s important to be cognizant of the biases (gender, race, sexual orientation etc.)
implicit in our word embeddings. Bias can be dangerous because it can reinforce
stereotypes through applications that employ these models.
Run the cell of Q5 in the .ipynb, to examine (a) which terms are most similar to ”girl”
and ”toy” and most dissimilar to ”boy”, and (b) which terms are most similar to ”boy”
and ”toy” and most dissimilar to ”girl”. Point out the difference between the list of
female-associated words and the list of male-associated words, and explain how it is
reflecting gender bias.
Note that, use words to answer rather than code.
2
Figure 1: Illustration of a CNN architecture for text classification. We depict three filter
region sizes: 2, 3 and 4, each of which has 2 filters. Filters perform convolutions on the
sentence matrix and generate (variable-length) feature maps; 1D max pooling is performed
over each map, i.e., the largest number from each feature map is recorded. Thus a univariate
feature vector is generated from all six maps, and these 6 features are concatenated to form
a feature vector for the penultimate layer. The final softmax layer then receives this feature
vector as input and uses it to classify the sentence; here we assume binary classification and
hence depict two possible output states. Image Source: [1]
3
Task 2 CNN Based Sentiment Classification
Prerequisites
1. You need to install keras and tensorflow packages:
p ip i n s t a l l t e n s o r f l o w
2. or
conda i n s t a l l t e n s o r f l o w
1. Using text CNN to finish sentiment classification task. Note that, everyone should im-
plement a unified model architecture (check section 2 in Task 2) rather than designing
a new one.
Suppose the number of words in each document is seq len, and each word is associated
with an embedding where the dimension of the embedding emb dim. If the length of each
document is variate, we need to do padding to ensure the length of each document is the
same.
CNN mainly contains two kinds of layers, namely, convolutional layer and pooling layer.
Sepcially, in text classification, we will use Conv1D and GlobalMaxPooling1D in Keras.layers.
4
https://keras.io/layers/convolutional/.
Input shape
3D tensor with shape: (batch size, seq len, emb dim)
Output shape
3D tensor with shape: (batch size, new steps, filters)
where new steps = seq len - kernel size + stride
Example
The red filters in Figure 1 should be implemented as:
Conv1D(filters=2, kernel size=4, strides=1,activation=’relu’)
The green filters should be implemented as:
Conv1D(filters=2, kernel size=3, strides=1,activation=’relu’)
• GlobalMaxPooling1D(...)
...: other arguments. In this lab, we just set it default. For more details, please read
https://keras.io/layers/pooling/.
Input shape
3D tensor with shape: (batch size, new steps, filters)
Output shape
2D tensor with shape: (batch size, filters)
Example
The max pooling on the outputs of red/green/yellow filters in Figure 1 should be
implemented as:
GlobalMaxPooling1D()
Note that, Global max pooling operation for 1D temporal data. https://keras.io/
api/layers/pooling_layers/global_max_pooling1d/
• Dense(...)
Dense implements the operation: output = activation(dot(input, kernel) + bias) where
activation is the element-wise activation function passed as the activation argument,
kernel is a weights matrix created by the layer, and bias is a bias vector created
by the layer (only applicable if use bias is True). These are all attributes of Dense.
https://keras.io/api/layers/core_layers/dense/
Input shape
N-D tensor with shape: (batch size, ..., input dim). The most common situation would
be a 2D input with shape (batch size, input dim).
Output shape
N-D tensor with shape: (batch size, ..., units). For instance, for a 2D input with shape
(batch size, input dim), the output would have shape (batch size, units).
Example
The final outputs in Figure 1 should be implemented as:
Dense(units=2, activation=’softmax’)
5
Submission
Please submit the files including program output, written answers and python scripts to
Blackboard. After you finished the assignments, make sure you include the header infor-
mation in the beginning of your code
# a u t h o r : Your name
# s t u d e n t i d : Your student ID
If you want onsite grading during the lab, you can ask TA to grade your lab submissions by
showing your codes and outputs. It should be noted that, even with the onsite grading, you
still need to summit the files to Blackboard for keeping a electronic record of your assignment.
References
[1] Ye Zhang and Byron Wallace. A sensitivity analysis of (and practitioners’ guide to) con-
volutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820,
2015.