0% found this document useful (0 votes)

2 views

Assignment 1

The document outlines Programming Assignment 1 for CSC421/2516, focusing on learning distributed word representations using a neural language model. Students are required to submit a PDF writeup and a code file, implementing backpropagation and analyzing the learned representations. The assignment is graded out of 10 points, with specific tasks and questions outlined for each part.

Uploaded by

areebahmad967

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Assignment 1

Uploaded by

areebahmad967

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

CSC421/2516 Winter 2019 Programming Assignment 1

Programming Assignment 1: Learning Distributed Word Repre-

sentations
Due Date: Jan. 31, 2018
Based on an assignment by George Dahl

Submission: You must submit two files through MarkUs1 : a PDF file containing your writeup,
titled a1-writeup.pdf, and your code file language_model.py. Your writeup must be typed.

The programming assignments are individual work. See the Course Information handout2 for de-
tailed policies.

You should attempt all questions for this assignment. Most of them can be answered at least par-
tially even if you were unable to finish earlier questions. If you think your computational results
are incorrect, please say so; that may help you get partial credit.

Introduction
In this assignment we will make neural networks learn about words. One way to do this is to train
a network that takes a sequence of words as input and learns to predict the word that comes next.
This assignment will ask you to implement the backpropagation computations for a neural
language model and then run some experiments to analyze the learned representation. The amount
of code you have to write is very short — only about 5 lines — but each line will require you to
think very carefully. You will need to derive the updates mathematically, and then implement them
using matrix and vector operations in NumPy.

Part 1: Network architecture (2pts)

In this assignment, we will train a neural language model like the one we covered in lecture. It
receives as input 3 consecutive words, and its aim is to predict a distribution over the next word
(the target word). We train the model using the cross-entropy criterion, which is equivalent to
maximizing the probability it assigns to the targets in the training set. Hopefully it will also learn
to make sensible predictions for sequences it hasn’t seen before.
The model architecture is as follows:
1
https://markus.teach.cs.toronto.edu/csc421-2019-01
2
http://www.cs.toronto.edu/~rgrosse/courses/csc421_2019/syllabus.pdf

1
CSC421/2516 Winter 2019 Programming Assignment 1

The network consists of an input layer, embedding layer, hidden layer and output layer. The
input consists of a sequence of 3 consecutive words, given as integer valued indices. (I.e., the 250
words in our dictionary are arbitrarily assigned integer values from 0 to 249.) The embedding
layer maps each word to its corresponding vector representation. This layer has 3 × D units, where
D is the embedding dimension, and it essentially functions as a lookup table. We share the same
lookup table between all 3 positions, i.e. we don’t learn a separate word embedding for each context
position. The embedding layer is connected to the hidden layer, which uses a logistic nonlinearity.
The hidden layer in turn is connected to the output layer. The output layer is a softmax over the
250 words.
As a warm-up, please answer the following questions, each worth 1 point out of 10.

1. As above, assume we have 250 words in the dictionary and use the previous 3 words as inputs.
Suppose we use a 16-dimensional word embedding and a hidden layer with 128 units. The
trainable parameters of the model consist of 3 weight matrices and 2 sets of biases. What
is the total number of trainable parameters in the model? Which part of the model has the
largest number of trainable parameters?

2. Another method for predicting the next word is an n-gram model, which was mentioned in
Lecture 7. If we wanted to use an n-gram model with the same context length as our network,
we’d need to store the counts of all possible 4-grams. If we stored all the counts explicitly,
how many entries would this table have?

2
CSC421/2516 Winter 2019 Programming Assignment 1

Starter code and data

Download and extract the archive from the course web page http://www.cs.toronto.edu/~rgrosse/
courses/csc421_2019/assignments/a1-code.zip.
Look at the file raw_sentences.txt. It contains the sentences that we will be using for this
assignment. These sentences are fairly simple ones and cover a vocabulary of only 250 words.
We have already extracted the 4-grams from this dataset and divided them into training, vali-
dation, and test sets. To inspect this data, run the following within IPython:

import pickle
data = pickle.load(open(’data.pk’, ’rb’))

Now data is a Python dict which contains the vocabulary, as well as the inputs and targets
for all three splits of the data. data[’vocab’] is a list of the 250 words in the dictionary;
data[’vocab’][0] is the word with index 0, and so on. data[’train_inputs’] is a 372, 500 × 3
matrix where each row gives the indices of the 3 context words for one of the 372, 500 training
cases. data[’train_targets’] is a vector giving the index of the target word for each training
case. The validation and test sets are handled analogously.
Now look at the file language_model.py, which contains the starter code for the assignment.
Even though you only have to modify two specific locations in the code, you may want to read
through this code before starting the assignment. There are three classes defined in this file:

• Params, a class which represents the trainable parameters of the network, i.e. all of the edges
in the above figure. This class has five fields:

– word_embedding_weights, a matrix of size NV × D, where NV is the number of words

in the vocabulary and D is the embedding dimension.
– embed_to_hid_weights, a matrix of size NH × 3D, where NH is the number of hidden
units. The first D columns represent connections from the embedding of the first context
word, the next D columns for the second context word, and so on.
– hid_bias, a vector of length NH
– hid_to_output_weights, a matrix of size NV × NH
– output_bias, a vector of length NV

• Activations, a class which represents the activations of all the networks’s units (i.e. the
boxes in the above figure) on a batch of data. This has three fields:

– embedding_layer, a matrix of B × 3D matrix (where B is the batch size), representing

the activations for the embedding layer on all the cases in a batch. The first D columns
represent the embeddings for the first context word, and so on.
– hidden_layer, a B × NH matrix representing the hidden layer activations for a batch

3
CSC421/2516 Winter 2019 Programming Assignment 1

– output_layer, a B × NV matrix representing the output layer activations for a batch

• Model, a class for the language model itself, which includes a lot of routines related to training
and making predictions. These methods will be discussed in later sections.

Part 2: Training the model (4pts)

In the first part of the assignment, you implement a method which computes the gradient using
backpropagation. To start you out, the Model class contains several important methods used in
training:

• compute_activations computes the activations of all units on a given input batch

• compute_loss computes the total cross-entropy loss on a mini-batch

• evaluate computes the average cross-entropy loss for a given set of inputs and targets

You will need to complete the implementation of two additional methods which are needed for
training:

• compute_loss_derivative computes the derivative of the loss function with respect to the
output layer inputs. In other words, if C is the cost function, and the softmax computation
is
ezi
yi = P zj ,
je

this function should compute a B × NV matrix where the entries correspond to the partial
derivatives ∂C/∂zi .

• back_propagate is the function which computes the gradient of the loss with respect to model
parameters using backpropagation. It uses the derivatives computed by compute_loss_derivative.
Some parts are already filled in for you, but you need to compute the matrices of derivatives
for embed_to_hid_weights, hid_bias, hid_to_output_weights, and output_bias. These
matrices have the same sizes as the parameter matrices (see previous section).

In order to implement backpropagation efficiently, you need to express the computations in terms
of matrix operations, rather than for loops. You should first work through the derivatives on pencil
and paper. First, apply the chain rule to compute the derivatives with respect to individual units,
weights, and biases. Next, take the formulas you’ve derived, and express them in matrix form. You
should be able to express all of the required computations using only matrix multiplication, matrix
transpose, and elementwise operations — no for loops! If you want inspiration, read through the
code for Model.compute_activations and try to understand how the matrix operations correspond
to the computations performed by all the units in the network.

4
CSC421/2516 Winter 2019 Programming Assignment 1

To make your life easier, we have provided the routine checking.check_gradients, which
checks your gradients using finite differences. You should make sure this check passes before
continuing with the assignment.
Once you’ve implemented the gradient computation, you’ll need to train the model. The func-
tion train in language_model.py implements the main training procedure. It takes two arguments:
• embedding_dim: The number of dimensions in the distributed representation.

• num_hid: The number of hidden units

For example, execute the following:
model = language_model.train(16, 128)
As the model trains, the script prints out some numbers that tell you how well the training is going.
It shows:
• The cross entropy on the last 100 mini-batches of the training set. This is shown after every
100 mini-batches.

• The cross entropy on the entire validation set every 1000 mini-batches of training.
At the end of training, this function shows the cross entropies on the training, validation and test
sets. It will return a Model instance.
To convince us that you have correctly implemented the gradient computations, please include
the following with your assignment submission:
• You will submit language_model.py through MarkUs. You do not need to modify any of
the code except the parts we asked you to implement.

• In your writeup, include the output of the function checking.print_gradients. This prints
out part of the gradients for a partially trained network which we have provided, and we
will check them against the correct outputs. Important: make sure to give the output of
checking.print_gradients, not checking.check_gradients.
This is worth 4 points: 1 for the loss derivatives, 1 for the bias gradients, and 2 for the weight
gradients. Since we gave you a gradient checker, you have no excuse for not getting full points on
this part.

Part 3: Analysis (4pts)

In this part, you will analyze the representation learned by the network. You should first train a
model with a 16-dimensional embedding and 128 hidden units, as discussed in the previous section;
you’ll use this trained model for the remainder of this section. Important: if you’ve made any fixes

5
CSC421/2516 Winter 2019 Programming Assignment 1

to your gradient code, you must reload the language_model module and then re-run the training
procedure. Python does not reload modules automatically, and you don’t want to accidentally
analyze an old version of your model.
These methods of the Model class can be used for analyzing the model after the training is
done.

• tsne_plot creates a 2-dimensional embedding of the distributed representation space using

an algorithm called t-SNE. (You don’t need to know what this is for the assignment, but we
may cover it later in the course.) Nearby points in this 2-D space are meant to correspond to
nearby points in the 16-D space. From the learned model, you can create pictures that look
like this:

• display_nearest_words lists the words whose embedding vectors are nearest to the given
word

• word_distance computes the distance between the embeddings of two words

• predict_next_word shows the possible next words the model considers most likely, along
with their probabilities

Using these methods, please answer the following questions, each of which is worth 1 point.

6
CSC421/2516 Winter 2019 Programming Assignment 1

1. Pick three words from the vocabulary that go well together (for example, ‘government of united’,
‘city of new’, ‘life in the’, ‘he is the’ etc.). Use the model to predict the next word.
Does the model give sensible predictions? Try to find an example where it makes a plausible
prediction even though the 4-gram wasn’t present in the dataset (raw_sentences.txt). To
help you out, the function language_model.find_occurrences lists the words that appear
after a given 3-gram in the training set.

2. Plot the 2-dimensional visualization using the method Model.tsne_plot. Look at the plot
and find a few clusters of related words. What do the words in each cluster have in common?
(You don’t need to include the plot with your submission.)

3. Are the words ‘new’ and ‘york’ close together in the learned representation? Why or why
not?

4. Which pair of words is closer together in the learned representation: (‘government’, ‘political’),
or (‘government’, ‘university’)? Why do you think this is?

What you have to submit

For reference, here is everything you need to hand in. See the top of this handout for submission
directions.

• A PDF file titled a1-writeup.pdf containing the following:

– Both questions from Part 1

– The output of checking.print_gradients()
– Answers to all four questions from Part 3
– Optional: Which part of this assignment did you find the most valuable? The most
difficult and/or frustrating?

• Your code file language_model.py

This assignment is graded out of 10 points: 2 for Part 1, 4 for Part 2, and 4 for Part 3.

Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
CS5785 Homework 4: .PDF .Py .Ipynb
No ratings yet
CS5785 Homework 4: .PDF .Py .Ipynb
5 pages
Cs294a 2011 Assignment
No ratings yet
Cs294a 2011 Assignment
5 pages
ps1 PDF
No ratings yet
ps1 PDF
5 pages
Ps 2
No ratings yet
Ps 2
11 pages
CS 229, Summer 2019 Problem Set #2 Solutions
No ratings yet
CS 229, Summer 2019 Problem Set #2 Solutions
18 pages
Assignment_1-2
No ratings yet
Assignment_1-2
4 pages
hw1 Problem Set
No ratings yet
hw1 Problem Set
8 pages
message (3)
No ratings yet
message (3)
2 pages
ESO207 ProgAssign 2 2022
No ratings yet
ESO207 ProgAssign 2 2022
7 pages
Miniproject 1: Machine Learning 101: Preamble
No ratings yet
Miniproject 1: Machine Learning 101: Preamble
5 pages
2021 Homework3 Introduction
No ratings yet
2021 Homework3 Introduction
8 pages
Introduction To Deep Learning Assignment 0: September 2023
No ratings yet
Introduction To Deep Learning Assignment 0: September 2023
3 pages
Lfakjo 30 Uf 2 Ofu 092 Jieow
No ratings yet
Lfakjo 30 Uf 2 Ofu 092 Jieow
6 pages
AERO2598 Matlab Assignment 1
No ratings yet
AERO2598 Matlab Assignment 1
14 pages
Ee364a Homework Solutions
100% (1)
Ee364a Homework Solutions
4 pages
COMP 4650 6490 Assignment 3 2023-v1.1
No ratings yet
COMP 4650 6490 Assignment 3 2023-v1.1
6 pages
Ass3 v1
No ratings yet
Ass3 v1
4 pages
Assignment 1
No ratings yet
Assignment 1
2 pages
CIS 419/519 Introduction To Machine Learning Assignment 2: Instructions
No ratings yet
CIS 419/519 Introduction To Machine Learning Assignment 2: Instructions
12 pages
Lab-11 Random Forest
No ratings yet
Lab-11 Random Forest
2 pages
DS Lab1
No ratings yet
DS Lab1
7 pages
Homework 2
No ratings yet
Homework 2
4 pages
COMP-377 Lab2
No ratings yet
COMP-377 Lab2
3 pages
Text Classification_movie Review_news Wires
No ratings yet
Text Classification_movie Review_news Wires
5 pages
CSL465/603 - Machine Learning
No ratings yet
CSL465/603 - Machine Learning
3 pages
CNN Text Classification
No ratings yet
CNN Text Classification
12 pages
ML Lab 11 Manual - Neural Networks (Ver4)
No ratings yet
ML Lab 11 Manual - Neural Networks (Ver4)
8 pages
CSC263 Winter 2021 Problem Set 1: Instructions
No ratings yet
CSC263 Winter 2021 Problem Set 1: Instructions
4 pages
DSA Lab Manual
100% (1)
DSA Lab Manual
65 pages
AIMLlatestmodule 2Notes Removed
No ratings yet
AIMLlatestmodule 2Notes Removed
33 pages
module_2
No ratings yet
module_2
35 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
MIT6 189IAP11 hw2
No ratings yet
MIT6 189IAP11 hw2
8 pages
Assignment 02 - FA24
No ratings yet
Assignment 02 - FA24
5 pages
1.1. C++ Review: Type Pointer - Name
No ratings yet
1.1. C++ Review: Type Pointer - Name
24 pages
Homework 9: Independent and Paired Samples T-Tests: Information 1
No ratings yet
Homework 9: Independent and Paired Samples T-Tests: Information 1
7 pages
h algorithm (2)
No ratings yet
h algorithm (2)
3 pages
Part 1: Linear Regression With One Feature: Release Date
No ratings yet
Part 1: Linear Regression With One Feature: Release Date
2 pages
Learning Objectives For Computer Science 2
No ratings yet
Learning Objectives For Computer Science 2
4 pages
Daa Uniti
No ratings yet
Daa Uniti
41 pages
Homework 4: ME 570 - Prof. Tron
No ratings yet
Homework 4: ME 570 - Prof. Tron
15 pages
MATLAB Homework 5: Exercise 5. 1 Load and Convert Data (2pts)
No ratings yet
MATLAB Homework 5: Exercise 5. 1 Load and Convert Data (2pts)
2 pages
Use of A Matrix Class To Introduce Object Oriented Programming PDF
No ratings yet
Use of A Matrix Class To Introduce Object Oriented Programming PDF
9 pages
Data Structure & Algorithm
No ratings yet
Data Structure & Algorithm
25 pages
Tensorflow/Keras Assignment: Problem Specification
No ratings yet
Tensorflow/Keras Assignment: Problem Specification
10 pages
CS 159 - Spring 2021 - Lab #9: Contact Prior
No ratings yet
CS 159 - Spring 2021 - Lab #9: Contact Prior
5 pages
Ex 2
No ratings yet
Ex 2
13 pages
Assignment 3
No ratings yet
Assignment 3
7 pages
PCHScs Mock 2014 p3 Al 1
No ratings yet
PCHScs Mock 2014 p3 Al 1
6 pages
CSCI374_Homework1
No ratings yet
CSCI374_Homework1
5 pages
Computer Programming 143: Universiteit Stellenbosch University
No ratings yet
Computer Programming 143: Universiteit Stellenbosch University
9 pages
ML Hota Assign5
No ratings yet
ML Hota Assign5
2 pages
Paren Lab
No ratings yet
Paren Lab
5 pages
Assignment 1 AI
No ratings yet
Assignment 1 AI
6 pages
Ex 3
No ratings yet
Ex 3
12 pages
Fall 2012 - Homework 5: Che 3E04 - Process Model Formulation and Solution
No ratings yet
Fall 2012 - Homework 5: Che 3E04 - Process Model Formulation and Solution
4 pages
Prog-In-C-Unit-2 Studoc
No ratings yet
Prog-In-C-Unit-2 Studoc
27 pages
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
C Programming Concepts
From Everand
C Programming Concepts
Jitendra Patel
No ratings yet
OOP Report On Hostel Management System 313305 First Micro Project Msbte Store
No ratings yet
OOP Report On Hostel Management System 313305 First Micro Project Msbte Store
14 pages
Voice Assistant
No ratings yet
Voice Assistant
46 pages
Religion Compass Volume 1 Issue 6 2007 (Doi 10.1111/j.1749-8171.2007.00043.x) Ann Jeffers - Magic and Divination in Ancient Israel
No ratings yet
Religion Compass Volume 1 Issue 6 2007 (Doi 10.1111/j.1749-8171.2007.00043.x) Ann Jeffers - Magic and Divination in Ancient Israel
15 pages
MPU 3273/ LANG 2128/ BLC 221: Professional Communication
No ratings yet
MPU 3273/ LANG 2128/ BLC 221: Professional Communication
17 pages
Gould Handout7 Stoneking
No ratings yet
Gould Handout7 Stoneking
2 pages
OL Midterm Exam IT APPLICATION TOOLS IN BUSINESS PDF
No ratings yet
OL Midterm Exam IT APPLICATION TOOLS IN BUSINESS PDF
4 pages
SmartPTT Enterprise Installation Guide
No ratings yet
SmartPTT Enterprise Installation Guide
18 pages
Acc Algebra I Geometry A Comprehensive Course Overview
No ratings yet
Acc Algebra I Geometry A Comprehensive Course Overview
36 pages
Curriculo de Mozart Dias Martins
No ratings yet
Curriculo de Mozart Dias Martins
2 pages
SY Electronics Slips
No ratings yet
SY Electronics Slips
8 pages
Kiran Resume
No ratings yet
Kiran Resume
3 pages
Geoinformatics in Theory and Practice
No ratings yet
Geoinformatics in Theory and Practice
528 pages
(Be) Going To - Future PDF
No ratings yet
(Be) Going To - Future PDF
2 pages
Java p13
No ratings yet
Java p13
5 pages
zCEE Customization Security and CICS PDF
No ratings yet
zCEE Customization Security and CICS PDF
26 pages
Raymie Nightingale by Kate DiCamillo Activity Kit
No ratings yet
Raymie Nightingale by Kate DiCamillo Activity Kit
13 pages
CHAPTER 1 - Malaysian Politics Background
No ratings yet
CHAPTER 1 - Malaysian Politics Background
23 pages
RAWS2019 - Week1 - Text As A Connected Discourse / Selecting and Organizing Information
No ratings yet
RAWS2019 - Week1 - Text As A Connected Discourse / Selecting and Organizing Information
3 pages
VBox 2
No ratings yet
VBox 2
26 pages
CivRev Cases
No ratings yet
CivRev Cases
12 pages
Js-4-interview-questions
No ratings yet
Js-4-interview-questions
5 pages
Exam 2u5
No ratings yet
Exam 2u5
4 pages
ICSE Theory Questions On BlueJ
No ratings yet
ICSE Theory Questions On BlueJ
43 pages
It All Belongs To You
No ratings yet
It All Belongs To You
7 pages
Anderoid
No ratings yet
Anderoid
25 pages
Yohanes Christian - 20237470007 - Academic Writing - Task 4
No ratings yet
Yohanes Christian - 20237470007 - Academic Writing - Task 4
14 pages
Calculation For Shear Connection (Shear and Axial) : Inputs in Shaded Cells
No ratings yet
Calculation For Shear Connection (Shear and Axial) : Inputs in Shaded Cells
3 pages
Local/Remote Control 9600 Series Converter
No ratings yet
Local/Remote Control 9600 Series Converter
21 pages
LabVIEW Mostly Missed Question in CLAD
No ratings yet
LabVIEW Mostly Missed Question in CLAD
39 pages
12 Marks Imp Questions
No ratings yet
12 Marks Imp Questions
3 pages