0% found this document useful (1 vote)

172 views

Web GPU

The document discusses the development of an online programming environment for teaching a large-scale Coursera course on parallel programming. Key points: - The system allowed instant grading and peer review, which previous systems lacked, and was designed with scaling in mind to support thousands of users. - It uses a distributed architecture where user code runs on "worker" machines and results are returned to Coursera. This approach allowed grading of thousands of submissions. - Lessons learned include the importance of avoiding abstraction initially for debuggability, preparing for last-minute work, and developing a thick skin for negative feedback which can be discouraging.

Uploaded by

IDJIBB

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

172 views

Web GPU

Uploaded by

IDJIBB

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Teaching ECE408 at Scale

Abdul Dakkak

Overview

IMPACT "

History"

Architecture"

Some data"

Lessons"

Future
01
!2

Quick Demo
!3

IMPACT

Nothing similar exists"

Possibly the most visible project in our group"

Works for CUDA/OpenCL/OpenACC/"

Thousands of users spent hundreds of hours on the

site"

Production vs research implementation

Objective

Create a programming
environment for the Coursera
course"

Allow people to develop

outside of the environment"

Not be tied into Coursera

(want to offer it for summer
school)

01
!5

Previous System

Many user interface issues"

Grading was done offline"

Had trouble with scaling

Did not have peer reviewing

New System

Instant grading and

submission back to Coursera"

Peer review implemented"

Scaling was kept in mind"

Making a new MP is simple

01
!7

How do you grade 4000 Programs?

You dont.

Architecture
!8

Life of a Program Submission

Library+!
User Code

Life of a
Worker
Program
Submission

User interacts with a web server"

Web server dispatches job to one of

many workers"

Grades get communicated with

Coursera

Library+!
User Code

Worker

Web!
Server

Cours-!
era
!1010

Detailed Architecture
!11

Made it public,
so got some help
from students

Correlated and
source of major
problems (happens
on Cyclone and AWS)

Source of Bugs

Incorrect or
inefficient
queries

!12

db
ox
ed
ro
ot

Sa
n
no
n-

Library+!
User Code

Worker

Web!
Server

Security
Sandbox users code"

Workers, web server, and

DB are run with no privileges"

A simple proxy server routes all port

80 requests to the web server

no
n-

ro
ot

Cours-!
era
!1013

How to Scale?

Each request is a lightweight thread

(thousands of these)"

The lightweight threads get mapped to n

threads (n being the number of cores)"

Create a connection pool with the

database server"

Most of operations are asynchronous "

Master can communicate to any number

of workers"

Worker will run programs on different

GPUs if available

Web!
Server

Worker
Web!
Server
Worker
!1014

Scale

What works for 2 people may not work for 1000 people"

What works for 1000 people may not work when all are logged in
at the same time"

What works for 1000 people on day 1 may not work for 1000
people on day 50
!15

Lessons Learned (Implementation)

!16

Abstraction is not a Good Thing

At the start of the course, we were using a library to

simplify database queries it did simplify them, but
generated complex queries that were not possible to
debug
!17

Abstraction is a Great Thing

When we replaced the library, had to replace just the

model code in the application

!18

Lessons Learned (Human)

!19

Lessons Learned

Negative criticism speaks louder than positive ones"

People do not read documentation or search the forums"

You will feel discouraged and moody for the rest of the day"

You will answer the same thing over and over and over again"

People will send you their code and ask you to debug them"

They will feel like you are not doing your job if you tell them no
!20

Data
!21

Data Collected

Google Analytics"

A 13Gb database containing"

2 Million program revisions"

700 thousand program runs"

Runtime/Compilation time / errors"

6 thousand graded programs"

A 7Gb Event database containing"

CPU/GPU information"

What pages users Visited

!22

Thousands of Visitors
!23

Users from All over the World

!24

People Spend Time on the Site

!25

Grades
!26

Most People Passed

939 actually started the course XXXX this is not correct FIXME"

989 got over 80%"

648 got over 100%

!27

Some Data Insights

Date
!28

We made all course MPs

available on day 1"

Everyone waited until the last

minute"

Every MP has 2 bumps one

for submitting the code and
another for the peer review

n
D
ow
M
P1

D
ue

ue
D

W
eb
si
te

P2
M

People Submit at
the Last Minute

Grade Timeline

Students Work
Program save timeline

!30

Some people do not

give or receive
feedback"

Some people receive

feedback but do not
give any"

Some people give

feedback but do not
receive any

Peer Review Relies on Participation

Image Convolution MP
!31

Data Analysis Opportunities

!32

!
!

#define BLOCK_SIZE 16
__global__ void MatMulKernel(float *DA, float *DB, float *DC, int Ah, int Aw,
int AwTiles, int Bh, int Bw) {
// Block row and column
if (cellRow >= Ah)
int blockRow = blockIdx.y;
return;
int blockCol = blockIdx.x;
if (cellCol >= Bw)
return;
// Thread row and column within Csub
int row = threadIdx.y;
DC[cellRow * Bw + cellCol] = Cvalue;
int col = threadIdx.x;
}

!
!
!
!

//
//
//
//

!
!

int cellRow = blockRow * BLOCK_SIZE + row;

int cellCol = blockCol * BLOCK_SIZE + col;
// Each thread computes one element of Csub
// by accumulating results into Cvalue
float Cvalue = 0.0;
Loop over all the sub-matrices of A and B that are
required to compute Csub
Multiply each pair of sub-matrices together
and accumulate the results

#pragma unroll
for (int m = 0; m < AwTiles; ++m) {
// Shared memory used to store Asub and Bsub respectively
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE];
__shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load Asub and Bsub from device memory to shared memory

// Each thread loads one element of each sub-matrix
int aRow = BLOCK_SIZE * blockRow + row;
int aCol = BLOCK_SIZE * m + col;

float aValue = ((aRow < Ah) && (aCol < Aw));

aValue *= DA[Aw * aRow + aCol];
As[row][col] = aValue;

!
!

int bRow = BLOCK_SIZE * m + row;

int bCol = BLOCK_SIZE * blockCol + col;
float bValue = ((bRow < Bh) && (bCol < Bw));
bValue *= DB[Bw * bRow + bCol];
Bs[row][col] = bValue;

// Synchronize to make sure the sub-matrices are loaded

// before starting the computation
__syncthreads();

// Multiply Asub and Bsub together

for (int e = 0; e < BLOCK_SIZE; ++e)
Cvalue += As[row][e] * Bs[e][col];

// Synchronize to make sure that the preceding

// computation is done before loading two new
// sub-matrices of A and B in the next iteration
__syncthreads();
}

!
!
!!

global void nop() {}

//
//
CITE: Vasily Volkov, UC Berkeley.... Intresting approach to using
//caching a bit differently... had to give it a try...
//
__device__ void saxpy(float a, float *b, float *c) {
c[0] += a * b[0];
c[1] += a * b[1];
c[2] += a * b[2];
c[3] += a * b[3];
c[4] += a * b[4];
c[5] += a * b[5];
c[6] += a * b[6];
c[7] += a * b[7];
c[8] += a * b[8];
c[9] += a * b[9];
c[10] += a * b[10];
c[11] += a * b[11];
c[12] += a * b[12];
c[13] += a * b[13];
c[14] += a * b[14];
c[15] += a * b[15];
}

!
!

!
!
!
!

A += 4 * lda;
saxpy(a[0], &bs[0][0],
a[0] = A[0 * lda];
saxpy(a[1], &bs[1][0],
a[1] = A[1 * lda];
saxpy(a[2], &bs[2][0],
a[2] = A[2 * lda];
saxpy(a[3], &bs[3][0],
a[3] = A[3 * lda];

c);
c);
c);
c);

A += 4 * lda;
saxpy(a[0], &bs[4][0],
a[0] = A[0 * lda];
saxpy(a[1], &bs[5][0],
a[1] = A[1 * lda];
saxpy(a[2], &bs[6][0],
a[2] = A[2 * lda];
saxpy(a[3], &bs[7][0],
a[3] = A[3 * lda];

c);
c);
c);
c);

A += 4 * lda;
saxpy(a[0], &bs[8][0], c);
a[0] = A[0 * lda];
saxpy(a[1], &bs[9][0], c);
a[1] = A[1 * lda];
saxpy(a[2], &bs[10][0], c);
a[2] = A[2 * lda];
saxpy(a[3], &bs[11][0], c);
a[3] = A[3 * lda];
A += 4 * lda;
saxpy(a[0], &bs[12][0],
saxpy(a[1], &bs[13][0],
saxpy(a[2], &bs[14][0],
saxpy(a[3], &bs[15][0],

c);
c);
c);
c);

SGEMM Implementation

!
!

__global__ void optimisedDLA(const float *A, int lda, const float *B, int ldb,
B += 16;
float *C, int ldc, int k) {
__syncthreads();
const int inx = threadIdx.x;
}
while (B < Blast);
const int iny = threadIdx.y;
const int ibx = blockIdx.x * 64;
for (int i = 0; i < 16; i++, C += ldc)
const int iby = blockIdx.y * 16;
C[0] = c[i];
const int id = inx + iny * 16;
}
.
A += ibx + id;

!
!
!
!
!

B += inx + __mul24(iby + iny, ldb);

C += ibx + id + __mul24(iby, ldc);
const float *Blast = B + k;
float c[16] = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
do {
float a[4] = {A[0 * lda], A[1 * lda], A[2 * lda], A[3 * lda]};
__shared__ float bs[16][17];
bs[inx][iny] = B[0 * ldb];
bs[inx][iny + 4] = B[4 * ldb];
bs[inx][iny + 8] = B[8 * ldb];
bs[inx][iny + 12] = B[12 * ldb];
__syncthreads();

!33

Analysis Opportunities

What errors are most common?"

What optimizations are most beneficial?"

Can we detect those and give feedback"

Which did users have the most problems with?"

What causes GPUs to crash?"

Can we avoid those?"

Did people plagiarize?"

What does the peer review tell us?

!34

Source of Data

This is a big data problem"

Different analysis can be performed on different parts"

Program analysis on programs"

Power analysis on recorded GPU power draw"

NLP on questions and peer review

!35

What would be different?

!36

Lessons Learned

Volunteer TAs were very helpful and allowed me to

spend less than 30 hours a day on the forums"

Not everyone will perform peer reviews, which means

that not everyone will get feedback"

Some people just criticize for the sake of it you need

to develop a thicker shell

!37

Current Work
!38

Current Work

Build up some tools for data analysis"

Course work projects"

Porting Parboil to threaded Java and Renderscript"

ZOne a compiler/language to explore Map/Reduce

compiler optimizations"

Optimization Machine learning+Vision applications

!39

Questions?
!40

The UCC Connection, by Howard Freeman
90% (10)
The UCC Connection, by Howard Freeman
12 pages
Angular Min
No ratings yet
Angular Min
39 pages
By Sandeep Tripathi
No ratings yet
By Sandeep Tripathi
9 pages
DSLAB2023
No ratings yet
DSLAB2023
88 pages
Lab1 Directed Verification
No ratings yet
Lab1 Directed Verification
17 pages
International Informatics Olympiad
No ratings yet
International Informatics Olympiad
4 pages
Ii. Technical
No ratings yet
Ii. Technical
20 pages
Lecture 1 - Introduction PDF
No ratings yet
Lecture 1 - Introduction PDF
46 pages
1610final Test Paper 2013
No ratings yet
1610final Test Paper 2013
8 pages
C++ - The New Clang Toolchain in C++Builder 12.1!
No ratings yet
C++ - The New Clang Toolchain in C++Builder 12.1!
11 pages
AutoCad Plugin For 3D Coal Seam Model
No ratings yet
AutoCad Plugin For 3D Coal Seam Model
26 pages
Library Management System
No ratings yet
Library Management System
118 pages
C Language 1208539706757125 9
No ratings yet
C Language 1208539706757125 9
58 pages
Guess Paper 2011 Class-XII Subject - Computer Science (Theory)
No ratings yet
Guess Paper 2011 Class-XII Subject - Computer Science (Theory)
8 pages
Cdac Sample Paper
No ratings yet
Cdac Sample Paper
30 pages
Dotnet 1ducarrt
No ratings yet
Dotnet 1ducarrt
6 pages
Openscad Manual 9
No ratings yet
Openscad Manual 9
11 pages
1591 1 Bca102
No ratings yet
1591 1 Bca102
4 pages
CN Lab Manual R18 - Modified
No ratings yet
CN Lab Manual R18 - Modified
58 pages
2011 Aricent Placement Paper
No ratings yet
2011 Aricent Placement Paper
13 pages
Assignment - 3 Solution
No ratings yet
Assignment - 3 Solution
16 pages
Module2 MCQ's
No ratings yet
Module2 MCQ's
11 pages
NET Interview Questions
No ratings yet
NET Interview Questions
8 pages
Sample Paper - 2012 Class - XII Subject - Computer Science: Instructions
No ratings yet
Sample Paper - 2012 Class - XII Subject - Computer Science: Instructions
9 pages
C# Lab Manual For BCA
100% (1)
C# Lab Manual For BCA
34 pages
Oracle 123
No ratings yet
Oracle 123
27 pages
Cs506Pc: Computer Networks and Web Technologies Lab Course Objectives
100% (1)
Cs506Pc: Computer Networks and Web Technologies Lab Course Objectives
134 pages
Record COA.1730813930729
No ratings yet
Record COA.1730813930729
79 pages
2014 03 02 RethinkingScalaMacros
No ratings yet
2014 03 02 RethinkingScalaMacros
37 pages
MCQ Questions
No ratings yet
MCQ Questions
8 pages
21-CP-6 (Report12) DLD
No ratings yet
21-CP-6 (Report12) DLD
18 pages
Cbse Class Xii Computer Science Project File On Book Shop 2010 Exam
50% (4)
Cbse Class Xii Computer Science Project File On Book Shop 2010 Exam
30 pages
Book Shop Project File
No ratings yet
Book Shop Project File
27 pages
Isro Paper Cs
No ratings yet
Isro Paper Cs
12 pages
2nd Semester Exam Js 2021 Final
No ratings yet
2nd Semester Exam Js 2021 Final
9 pages
6081.hostel Management (1) Org
No ratings yet
6081.hostel Management (1) Org
13 pages
Art of Programming Contest SE For Uva PDF
No ratings yet
Art of Programming Contest SE For Uva PDF
35 pages
Sample Midterm
No ratings yet
Sample Midterm
14 pages
C Sharp Lab Manual PDF
No ratings yet
C Sharp Lab Manual PDF
25 pages
Conv.c:: C Program To Implement Linear Convolution
No ratings yet
Conv.c:: C Program To Implement Linear Convolution
16 pages
XII CompSc2
No ratings yet
XII CompSc2
8 pages
MCQ-7 Constructors
No ratings yet
MCQ-7 Constructors
7 pages
School Model Question Paper
No ratings yet
School Model Question Paper
4 pages
Assessment - Mcqs
No ratings yet
Assessment - Mcqs
6 pages
BSEB-12th-Computer-Science-Model-Paper-2021
No ratings yet
BSEB-12th-Computer-Science-Model-Paper-2021
26 pages
CGV Lab Manual by Chandrashekar M A
No ratings yet
CGV Lab Manual by Chandrashekar M A
34 pages
C++ Prject Report
No ratings yet
C++ Prject Report
16 pages
Classes
No ratings yet
Classes
10 pages
Hsslive-SAY-652 (Com. Sci. Infn. Tech.)
No ratings yet
Hsslive-SAY-652 (Com. Sci. Infn. Tech.)
8 pages
Unit - 1: 1) What Is Object Oriented Programming? What Are The Benefits and Applications of Oop?
No ratings yet
Unit - 1: 1) What Is Object Oriented Programming? What Are The Benefits and Applications of Oop?
51 pages
Class XII Computer Science: HOTS (High Order Thinking Skill)
No ratings yet
Class XII Computer Science: HOTS (High Order Thinking Skill)
35 pages
Informatics Olympiad Websites and Book
No ratings yet
Informatics Olympiad Websites and Book
4 pages
How I Stopped Using Shared Variables and Learned To Love OO
No ratings yet
How I Stopped Using Shared Variables and Learned To Love OO
46 pages
ITPC27-ES-2019-ITPC27ES2019
No ratings yet
ITPC27-ES-2019-ITPC27ES2019
2 pages
Assessment - MCQS
No ratings yet
Assessment - MCQS
6 pages
Sample Paper
No ratings yet
Sample Paper
12 pages
CCS Code
No ratings yet
CCS Code
3 pages
COMPUTER PROJECT Class Xii
25% (4)
COMPUTER PROJECT Class Xii
34 pages
Development Using MVC: Web Developer Newsletter
No ratings yet
Development Using MVC: Web Developer Newsletter
20 pages
Couchbase Certified Java Developer - Exam Practice Tests
From Everand
Couchbase Certified Java Developer - Exam Practice Tests
Cristian Scutaru
No ratings yet
Bits to Bitcoin: How Our Digital Stuff Works
From Everand
Bits to Bitcoin: How Our Digital Stuff Works
Mark Stuart Day
2/5 (1)
Reading 5
No ratings yet
Reading 5
21 pages
The History of The Evolution of Business Analytics
No ratings yet
The History of The Evolution of Business Analytics
5 pages
How To Integrate Google Optimize A/B Testing Platform: Getting Started
No ratings yet
How To Integrate Google Optimize A/B Testing Platform: Getting Started
3 pages
(External) tROAS For Ad Revenue Implementation Guide
No ratings yet
(External) tROAS For Ad Revenue Implementation Guide
31 pages
Digital Business (BADB2034)
No ratings yet
Digital Business (BADB2034)
9 pages
Sitecore-The Nonlinear Way
No ratings yet
Sitecore-The Nonlinear Way
122 pages
Google Analytics 4
No ratings yet
Google Analytics 4
25 pages
Google Analytics Marketing Website BSA
No ratings yet
Google Analytics Marketing Website BSA
1 page
DADS403 Unit-03
No ratings yet
DADS403 Unit-03
28 pages
Salmankhan Alamkhan Pathan: Professional Summary
No ratings yet
Salmankhan Alamkhan Pathan: Professional Summary
2 pages
Daraz PK
100% (1)
Daraz PK
12 pages
User Manual
No ratings yet
User Manual
13 pages
Digital Marketing PART ONE: Brand Analysis
No ratings yet
Digital Marketing PART ONE: Brand Analysis
13 pages
Analytics Platforms Solution Design Workbook
No ratings yet
Analytics Platforms Solution Design Workbook
86 pages
Digital Marketing - Vin University-Sep To Nov 2024 - Session 6
No ratings yet
Digital Marketing - Vin University-Sep To Nov 2024 - Session 6
62 pages
Nishant Sharma: Contact
No ratings yet
Nishant Sharma: Contact
4 pages
Unlock Product Growth Ebook
No ratings yet
Unlock Product Growth Ebook
10 pages
Religare Report Print PDF
33% (3)
Religare Report Print PDF
50 pages
Yash Project File
No ratings yet
Yash Project File
52 pages
Handy Guide To Product Requirement Document
No ratings yet
Handy Guide To Product Requirement Document
39 pages
Google Ads Audit 2023
100% (1)
Google Ads Audit 2023
7 pages
Vidya Sarthi 2024 Brochure (FBD)
No ratings yet
Vidya Sarthi 2024 Brochure (FBD)
45 pages
UNIT-2 Web Analytics
100% (1)
UNIT-2 Web Analytics
18 pages
Google Data Studio: College of Communication and Information Technology
No ratings yet
Google Data Studio: College of Communication and Information Technology
12 pages
CurriculumVitaeFlavioRodriguez FINAL
No ratings yet
CurriculumVitaeFlavioRodriguez FINAL
5 pages
TUG 2022 Reporting Analytics
No ratings yet
TUG 2022 Reporting Analytics
39 pages
How To Start A Blog That Makes Money (Lessons Learned)
100% (1)
How To Start A Blog That Makes Money (Lessons Learned)
20 pages
Digital Marketing Course Brochure
No ratings yet
Digital Marketing Course Brochure
40 pages
Custom Event Framework For GTM: Alternative Approach
No ratings yet
Custom Event Framework For GTM: Alternative Approach
3 pages