0% found this document useful (0 votes)

31 views

Lecture GPUArchCUDA01

Lecture_GPU Architecture

Uploaded by

srinixr

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

Lecture GPUArchCUDA01

Lecture_GPU Architecture

Uploaded by

srinixr

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

Lecture: Manycore GPU Architectures

and Programming, Part 1

CSCE 569 Parallel Computing

Department of Computer Science and Engineering
Yonghong Yan
yanyh@cse.sc.edu
https://passlab.github.io/CSCE569/

1
Manycore GPU Architectures and
Programming: Outline
• Introduction
– GPU architectures, GPGPUs, and CUDA
• GPU Execution model
• CUDA Programming model
• Working with Memory in CUDA
– Global memory, shared and constant memory
• Streams and concurrency
• CUDA instruction intrinsic and library
• Performance, profiling, debugging, and error handling
• Directive-based high-level programming model
– OpenACC and OpenMP
2
Computer Graphics
GPU: Graphics Processing Unit

3
Graphics Processing Unit (GPU)

Image: http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html

4
Graphics Processing Unit (GPU)
• Enriching user visual
experience
• Delivering energy-efficient
computing
• Unlocking potentials of
complex apps
• Enabling Deeper scientific
discovery

5
What is GPU Today?
• It is a processor optimized for 2D/3D graphics, video, visual
computing, and display.
• It is highly parallel, highly multithreaded multiprocessor
optimized for visual computing.
• It provide real-time visual interaction with computed
objects via graphics images, and video.
• It serves as both a programmable graphics processor and a
scalable parallel computing platform.
– Heterogeneous systems: combine a GPU with a CPU

• It is called as Many-core
6
Graphics Processing Units (GPUs): Brief History
GPU Computing
General-purpose computing on
graphics processing units
(GPGPUs)
GPUs with programmable
shading
Nvidia GeForce
GE 3 (2001) with
programmable shading
DirectX graphics API
OpenGL graphics API
Hardware-accelerated
3D graphics
S3 graphics cards-
single chip 2D
accelerator
Atari 8-bit computer IBM PC Professional Playstation
text/graphics chip Graphics Controller card

1970 1980 1990 2000 2010

Source of information http://en.wikipedia.org/wiki/Graphics_Processing_Unit 7
NVIDIA Products
• NVIDIA Corp. is the leader in GPUs for HPC Maxwell
• We will concentrate on NVIDIA GPU Tesla 2050 GPU
has 448 thread
(2013)
Kepler
– Others AMD, ARM, etc processors (2011)

Fermi
NVIDIA's first Tesla
GPU with general C870, S870, C1060, S1070, C2050, …
purpose GeForce 400 series
processors GTX460/465/470/475/
Quadr 480/485
Established by Jen- GT 80o
Hsun Huang, Chris GeForce GeForce 200 series
GTX260/275/280/285/295
8800
Malachowsky, Curtis
GeForce 8 series
Priem
GeForce 2 series GeForce FX series
NV1 GeForce 1

1993 1995 1999 2000 2001 2002 2003 2004 2005 20062007 2008 2009 2010
http://en.wikipedia.org/wiki/GeForce
8
GPU Architecture Revolution
• Unified Scalar Shader Architecture

• Highly Data Parallel Stream Processing

Image: http://www.ntu.edu.sg/home/ehchua/programming/opengl/CG_BasicsTheory.html

An Introduction to Modern GPU Architecture, Ashu Rege, NVIDIA Director of Developer Technology
9
ftp://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf
GPUs with Dedicated Pipelines
-- late 1990s-early 2000s
• Graphics chips generally
Input stage
had a pipeline structure
with individual stages
performing specialized Vertex shader
stage
operations, finally
Graphics
leading to loading frame memory Geometry
buffer for display. shader stage

• Individual stages may Rasterizer

stage
have access to graphics Frame
memory for storing buffer Pixel shading
intermediate computed stage

data.
10
Specialized Pipeline Architecture

GeForce 6 Series Architecture

(2004-5)
From GPU Gems 2

11
Graphics Logical Pipeline

Graphics logical pipeline. Programmable graphics shader stages are blue, and fixed-function blocks are
white. Copyright © 2009 Elsevier, Inc. All rights reserved.

Processor Per Function, each could be vector

Unbalanced
and
inefficient
utilization

12
Unified Shader
• Optimal utilization in unified architecture

FIGURE A.2.4 Logical pipeline mapped to physical processors. The programmable shader stages execute on the
array of unified processors, and the logical graphics pipeline dataflow recirculates through the processors. Copyright ©
2009 Elsevier, Inc. All rights reserved. 13
Unified Shader Architecture

FIGURE A.2.5 Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14
streaming multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA
GeForce 8800. The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has
eight SP cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a 14
shared memory. Copyright © 2009 Elsevier, Inc. All rights reserved.
Streaming Processing
To be efficient, GPUs must have high throughput, i.e.
processing millions of pixels in a single frame, but may be
high latency

• “Latency is a time delay between the moment something is

initiated, and the moment one of its effects begins or
becomes detectable”
• For example, the time delay between a request for texture
reading and texture data returns
• Throughput is the amount of work done in a given amount
of time
– CPUs are low latency low throughput processors
– GPUs are high latency high throughput processors
15
Streaming Processing to Enable Massive
Parallelism
• Given a (typically large) set of data(“stream”)
• Run the same series of operations (“kernel” or “shader”) on
all of the data (SIMD)

• GPUs use various optimizations to improve throughput:

• Some on chip memory and local caches to reduce
bandwidth to external memory
• Batch groups of threads to minimize incoherent memory
access
– Bad access patterns will lead to higher latency and/or thread
stalls.
• Eliminate unnecessary operations by exiting or killing
threads
16
GPU Computing – The Basic Idea
• Use GPU for more than just generating graphics
– The computational resources are there, they are most of the
time underutilized

– The ironical fact: It takes about 20 years (80/90s – 2007) to

realize that a GPU that can do graphics well should do image
processing well too.

17
17
GPU Performance Gains Over CPU

http://docs.nvidia.com/cuda/cuda-c-programming-guide

18
GPU Performance Gains Over CPU

19
Parallelism in CPUs v. GPUs
• Multi-/manycore/CPUs use • Manycore GPUs use data
task parallelism parallelism
– MIMD, i.e. Multiple tasks map – SIMD model (Single Instruction
to multiple threads Multiple Data)

– Tasks run different instructions – Same instruction on different

data
– 10s of relatively heavyweight
threads run on 10s of cores – 10,000s of lightweight threads
on 100s of cores
– Each thread managed and
scheduled explicitly – Threads are managed and
scheduled by hardware
– Each thread has to be
individually programmed – Programming done for batches
(MPMD) of threads (e.g. one pixel
shader per group of pixels, or
draw call)
20
GPU Computing – Offloading Computation

• The GPU is connected to the CPU by a reasonable fast bus

(8 GB/s is typical today): PCIe

• Terminology
– Host: The CPU and its memory (host memory)
– Device: The GPU and its memory (device memory)
21
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory to

GPU memory

22
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory to

GPU memory
2. Load GPU program and execute,
caching data on chip for performance

23
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory to

GPU memory
2. Load GPU program and execute,
caching data on chip for performance
3. Copy results from GPU memory to
CPU memory

24
Offloading Computation
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16

global void stencil_1d(int in, int out) {

__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory

temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];

parallel fn
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}

// Synchronize (ensure all the data is available)

__syncthreads();

// Apply the stencil

int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];

// Store the result

out[gindex] = result;
}

void fill_ints(int *x, int n) {

fill_n(x, n, 1);
}

int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);

// Alloc space for host copies and setup values

in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);

// Alloc space for device copies

cudaMalloc((void **)&d_in, size);
cudaMalloc((void **)&d_out, size); serial code
// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on GPU

stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS, d_out +
RADIUS);
parallel exe on GPU
// Copy result back to host
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

// Cleanup
serial code
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
25
}
Programming for NVIDIA GPUs

http://docs.nvidia.com/cuda/cuda-c-
programming-guide/
26
CUDA(Compute Unified Device Architecture)
Both an architecture and programming model
• Architecture and execution model
– Introduced in NVIDIA in 2007
– Get highest possible execution performance requires
understanding of hardware architecture
• Programming model
– Small set of extensions to C
– Enables GPUs to execute programs written in C
– Within C programs, call SIMT “kernel” routines that are
executed on GPU.
• Hello world introduction today
– More in later lectures
27
CUDA Thread Hierarchy
stencil_1d<<<N/BLOCK_SIZE,BLOCK_SIZE>>>(d_in + RADIUS, d_out + RADIUS);

• Allows flexibility and

efficiency in
processing 1D, 2-D,
and 3-D data on GPU.
Can be 1, 2 or 3
• Linked to internal dimensions

organization

• Threads in one block

execute together.
28
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
Output:
• Standard C that runs on the host
$ nvcc
hello.cu
• NVIDIA compiler (nvcc) can be used to $ ./a.out
compile programs with no device code Hello World!
• Try on bridges, using interactive mode $
• On your computer that has NVIDIA GPU
• You need to install CUDA SDK and NVIDIA
graphics driver

29
Hello World! with Device Code
__global__ void hellokernel() {
printf(”Hello World!\n”);
}
int main(void){
int num_threads = 1;
int num_blocks = 1;
hellokernel<<<num_blocks,num_threads>>>();
cudaDeviceSynchronize();
return 0; Output:
} $ nvcc
hello.cu
§ Two new syntactic elements… $ ./a.out
Hello World!
$
30
GPU code examples and try on Bridges

• GPU code examples:

– https://passlab.github.io/CSCE569/resources/gpu_code_examples
– You can download by yourself or copy from my home folder on bridges
• Bridge instruction:
– https://passlab.github.io/CSCE569/resources/HardwareSoftware.html#interact
ive

• Bridges:
– interact -gpu
– module load gcc/5.3.0 cuda/8.0 opencv/3.2.0
– cp -r ~yan/gpu_code_examples ~

– cd gpu_code_examples
– nvcc hello-1.cu –o hello-1
– ./hello-1
– nvcc hello-2.cu –o hello-2
– ./hello-2
31
Hello World! with Device Code
__global__ void hellokernel(void)

• CUDA C/C++ keyword global indicates a function that:

– Runs on the device
– Is called from host code

• nvcc separates source code into host and device

components
– Device functions (e.g. hellokernel()) processed by NVIDIA
compiler
– Host functions (e.g. main()) processed by standard host
compiler
• gcc, cl.exe
32
Hello World! with Device COde
hellokernel<<<num_blocks,num_threads>>>();

• Triple angle brackets mark a call

from host code to device code
– Also called a “kernel launch”
– <<< ... >>> parameters are for thread
dimensionality
• That’s all that is required to
execute a function on the GPU!

33
Hello World! with Device Code
__device__ const char *STR = "Hello World!";
const char STR_LENGTH = 12;

global void hellokernel(){

printf("%c", STR[threadIdx.x % STR_LENGTH]);
}
int main(void){
int num_threads = STR_LENGTH; Output:
int num_blocks = 1; $ nvcc
hello.cu
hellokernel<<<num_blocks,num_threads>>>();
cudaDeviceSynchronize(); $ ./a.out
return 0; Hello World!
} $
34
Hello World! with Device Code
__device__ const char *STR = "Hello World!";
const char STR_LENGTH = 12;

device: Identify device-only data

global void hellokernel(){

printf("%c", STR[threadIdx.x % STR_LENGTH]);
}
threadIdx.x: the thread ID
int main(void){
int num_threads = STR_LENGTH;
int num_blocks = 2;
hellokernel<<<num_blocks,num_threads>>>();
cudaDeviceSynchronize();
return 0;
}
Each thread only prints one character
35
Manycore GPU Architectures and Programming

• GPU architectures, graphics and GPGPUs

• GPU Execution model
• CUDA Programming model
• Working with Memory in CUDA
– Global memory, shared and constant memory
• Streams and concurrency
• CUDA instruction intrinsic and library
• Performance, profiling, debugging, and error handling
• Directive-based high-level programming model
– OpenACC and OpenMP

36
GPU Execution Model
• The GPU is a physically separate processor from the CPU
– Discrete vs. Integrated
• The GPU Execution Model offers different abstractions from
the CPU to match the change in architecture

PCI Bus

37
GPU Execution Model
• The GPU is a physically separate processor from the CPU
– Discrete vs. Integrated
• The GPU Execution Model offers different abstractions from
the CPU to match the change in architecture

PCI Bus

38
The Simplest Model: Single-Threaded
• Single-threaded Execution Model
– Exclusive access to all variables
– Guaranteed in-order execution of loads and stores
– Guaranteed in-order execution of arithmetic instructions

• Also the most common execution model, and simplest for

programmers to conceptualize and optimize

Single-Threaded

39
CPU SPMD Multi-Threading
• Single-Program, Multiple-Data (SPMD) model
– Makes the same in-order guarantees within each thread
– Says little or nothing about inter-thread behaviour or exclusive
variable access without explicit inter-thread synchronization
SPMD
Synchronize

40
GPU Multi-Threading
• Uses the Single-Instruction, Multiple-Thread model
– Many threads execute the same instructions in lock-step
– Implicit synchronization after every instruction (think vector
parallelism)

SIMT

41
GPU Multi-Threading
• In SIMT, all threads share instructions but operate on their
own private registers, allowing threads to store thread-local
state

SIMT

42
GPU Multi-Threading
• SIMT threads can be a = 4
b = 3
a = 3
b = 4
“disabled” when they need
to execute instructions
if (a > b) {
different from others in their
group

Disabled
max = a;

• Improves the flexibility of the } else {

SIMT model, relative to

Disabled
similar vector-parallel models max = b;

(SIMD)
}

43
GPU Multi-Threading
• GPUs execute many groups of SIMT threads in parallel
– Each executes instructions independent of the others

SIMT Group 0

SIMT Group 1

44
Execution Model to Hardware
• How does this
execution model
map down to actual
GPU hardware?

• NVIDIA GPUs consist

of many streaming
multiprocessors (SM)

45
Execution Model to Hardware
• NVIDIAGPU Streaming
Multiprocessors (SM) are
analogous to CPU cores
– Single computational unit
– Think of an SM as a single
vector processor
– Composed of multiple CUDA
“cores”, load/store units,
special function units (sin,
cosine, etc.)
– Each CUDA core contains
integer and floating-point
arithmetic logic units
46
Execution Model to Hardware
• GPUs can execute multiple SIMT groups on each SM
– For example: on NVIDIA GPUs a SIMT group is 32 threads, each
Kepler SM has 192 CUDA cores è simultaneous execution of 6 SIMT
groups on an SM

• SMs can support more concurrent SIMT groups than core count
would suggest
– Each thread persistently stores its own state in a private register set
– Many SIMT groups will spend time blocked on I/O, not actively
computing
– Keeping blocked SIMT groups scheduled on an SM would waste
cores
– Groups can be swapped in and out without worrying about losing
state

47
Execution Model to Hardware
• This leads to a nested thread hierarchy on GPUs
A SIMT Groups that SIMT Groups that
single SIMT concurrently run on the
Group execute together on the
thread same SM same GPU

48
GPU Memory Model
• Now that we understand how SIMT Thread Groups on a GPU
abstract threads of execution
SIMT Thread Groups on an SM
are mapped to the GPU:
– How do those threads store SIMT Thread Group
and retrieve data?
Registers Local Memory
– What rules are there about
memory consistency?
On-Chip Shared Memory
– How can we efficiently use
GPU memory?
Global Memory

Constant Memory

Texture Memory

49
GPU Memory Model
• There are many levels and types of GPU memory, each of
which has special characteristics that make it useful
– Size
– Latency
– Bandwidth
– Readable and/or Writable
– Optimal Access Patterns
– Accessibility by threads in the same SIMT group, SM, GPU

• Later lectures will go into detail on each type of GPU

memory

50
GPU Memory Model
• For now, we focus on two memory
types: on-chip shared memory and SIMT Thread Groups on a GPU

registers SIMT Thread Groups on an SM

– These memory types affect the GPU
execution model SIMT Thread Group

Local
Registers
Memory
• Each SM has a limited set of
registers, each thread receives its On-Chip Shared Memory

own private set of registers

Global Memory

• Each SM has a limited amount of

Shared Memory, all SIMT groups on Constant Memory

an SM share that Shared Memory Texture Memory

51
GPU Memory Model
• è Shared Memory and Registers are limited
– Per-SM resources which can impact how many threads can
execute on an SM

• For example: consider an imaginary SM that supports

executing 1,024 threads concurrently (32 SIMT groups of 32
threads)
– Suppose that SM has a total of 16,384 registers
– Suppose each thread in an application requires 64 registers to
execute
– Even though we can theoretically support 1,024 threads, we
can only simultaneously store state for 16,384 registers / 64
registers per thread = 256 threads

52
GPU Communication
• Communicating between the host and GPU is a piece of
added complexity, relative to homogeneous programming
models

• Generally, CPU and GPU have physically and logically

separate address spaces (though this is changing)

PCIe Bus

53
GPU Communication
• Data transfer from CPU to GPU over the PCI bus adds
– Conceptual complexity
– Performance overhead

Communication Latency Bandwidth

Medium
On-Chip Shared A few clock cycles Thousands of GB/s
Memory
GPU Memory Hundreds of clock Hundreds of GB/s
cycles
PCI Bus Hundreds to Tens of GB/s
thousands of clock
cycles

54
GPU Communication
• As a result, computation-communication overlap is a
common technique in GPU programming
– Asynchrony is a first-class citizen of most GPU programming
frameworks

GPU
Compute Compute Compute Compute

PCIe Bus
Copy Copy Copy Copy Copy

55
GPU Execution Model
• GPUs introduce a new conceptual model for programmers
used to CPU single- and multi-threaded programming

• While the concepts are different, they are no more complex

than those you would need to learn to extract optimal
performance from CPU architectures

• GPUs offer programmers more control over how their

workloads map to hardware, which makes the results of
optimizing applications more predictable

56
References
1. The sections on Introducing the CUDA Execution Model,
Understanding the Nature of Warp Execution, and Exposing
Parallelism in Chapter 3 of Professional CUDA C Programming
2. Michael Wolfe. Understanding the CUDA Data Parallel Threading
Model. https://www.pgroup.com/lit/articles/insider/v2n1a5.htm
3. Will Ramey. Introduction to CUDA Platform.
http://developer.download.nvidia
.com/compute/developertrainingmaterials/presentations/general/W
hy_GPU_ Computing.pptx
4. Timo Stich. Fermi Hardware & Performance Tips.
http://theinf2.informatik.uni-jena.de/
theinf2_multimedia/Website_downloads/NVIDIA_Fermi_Perf_Jena_
2011.pdf

Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
From Everand
Sega Saturn Architecture: Architecture of Consoles: A Practical Analysis, #5
Rodrigo Copetti
No ratings yet
chapter-8
No ratings yet
chapter-8
58 pages
0-gpu-computing-i-give-it
No ratings yet
0-gpu-computing-i-give-it
57 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
gpus
No ratings yet
gpus
32 pages
UNIT-4
No ratings yet
UNIT-4
48 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
1 Cuda
100% (1)
1 Cuda
173 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
No ratings yet
Graphics Processing Unit Graphics Processing Unit: Dhan V Sagar CB - EN.P2CSE13007
21 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
3-1
No ratings yet
3-1
35 pages
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
No ratings yet
Why GPU?: CS8803SC Software and Hardware Cooperative Computing
14 pages
Lec 1
No ratings yet
Lec 1
27 pages
GPU Introduction
No ratings yet
GPU Introduction
52 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
No ratings yet
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
GPU Architecture
No ratings yet
GPU Architecture
12 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lecture 17-Introduction to GPU
No ratings yet
Lecture 17-Introduction to GPU
36 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Lec 14
No ratings yet
Lec 14
52 pages
CUDA
No ratings yet
CUDA
33 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Parallel & Distributed Computing Report
No ratings yet
Parallel & Distributed Computing Report
4 pages
DS1822 - Parallel Computing-unit3
No ratings yet
DS1822 - Parallel Computing-unit3
6 pages
COE4590_15_GPU1
No ratings yet
COE4590_15_GPU1
14 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
No ratings yet
Graphics Processing Unit: Shashwat Shriparv Infinitysoft
39 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
GPGPU
No ratings yet
GPGPU
139 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
10 GPU-IntroCUDA3
No ratings yet
10 GPU-IntroCUDA3
141 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Unit 2 - GPU DFG
No ratings yet
Unit 2 - GPU DFG
27 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Modern GPU Architecture
No ratings yet
Modern GPU Architecture
93 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Gpgpu Workshop Cuda
No ratings yet
Gpgpu Workshop Cuda
10 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
HPC 5th Unit - 240504 - 160548
No ratings yet
HPC 5th Unit - 240504 - 160548
18 pages
Lecture - 01 - CUDA Programming
No ratings yet
Lecture - 01 - CUDA Programming
52 pages
Analyzing CUDA Workloads Using A Detailed GPU Simulator
No ratings yet
Analyzing CUDA Workloads Using A Detailed GPU Simulator
12 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Lecture-12-GPU-Programming
No ratings yet
Lecture-12-GPU-Programming
65 pages
00_CourseIntroduction
No ratings yet
00_CourseIntroduction
33 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
Developers Had To Map Scientific Calculations Onto Problems That Could Be Represented by Triangles and Polygons
No ratings yet
Developers Had To Map Scientific Calculations Onto Problems That Could Be Represented by Triangles and Polygons
2 pages
Lec 6
No ratings yet
Lec 6
16 pages
GPU Khoruzhenko
No ratings yet
GPU Khoruzhenko
5 pages
Graphics Processing Units Paper PDF
No ratings yet
Graphics Processing Units Paper PDF
14 pages
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
From Everand
GPU Assembly and Shader Programming for Compute: Low-Level Optimization Techniques for High-Performance Parallel Processing
Robert Johnson
No ratings yet
Hso G Accredited Medical Clinics
No ratings yet
Hso G Accredited Medical Clinics
4 pages
Exercício Inglês Instrumental
No ratings yet
Exercício Inglês Instrumental
3 pages
Portable Data Terminal PT-20 / PT-20B Programming Guide
No ratings yet
Portable Data Terminal PT-20 / PT-20B Programming Guide
113 pages
iPF 765 760 Brochure
No ratings yet
iPF 765 760 Brochure
4 pages
Untitled Document
No ratings yet
Untitled Document
10 pages
HP Sales Central: HP ENVY Laptop 13-ba1500TU Bundle (381K5PA)
No ratings yet
HP Sales Central: HP ENVY Laptop 13-ba1500TU Bundle (381K5PA)
3 pages
Activate Software ATM Installation - Win7 Ver 04 00 00 05 Kh5
No ratings yet
Activate Software ATM Installation - Win7 Ver 04 00 00 05 Kh5
16 pages
What Is A Computer?: - Computer: A Collection of Electronic Switches That Can
No ratings yet
What Is A Computer?: - Computer: A Collection of Electronic Switches That Can
28 pages
Parts of A Motherboard and Their Function
No ratings yet
Parts of A Motherboard and Their Function
9 pages
Storage Devices: Chapter NO.4
No ratings yet
Storage Devices: Chapter NO.4
5 pages
WISTRON JE71-DN 09945-1 48 4JN01 011 Rev - 1 PDF
No ratings yet
WISTRON JE71-DN 09945-1 48 4JN01 011 Rev - 1 PDF
61 pages
2 500676 - 2017-06 - EN - CARDIOVIT - CS-200 - Office - PP - HQ
No ratings yet
2 500676 - 2017-06 - EN - CARDIOVIT - CS-200 - Office - PP - HQ
4 pages
Tellite A135 - S2326 Satellite A205 - S4577: TOS Shiba TO Oshiba
No ratings yet
Tellite A135 - S2326 Satellite A205 - S4577: TOS Shiba TO Oshiba
1 page
8051 Microcontroller
No ratings yet
8051 Microcontroller
45 pages
Activity Sheet Week5 7
No ratings yet
Activity Sheet Week5 7
2 pages
CPU Structure & Functions
No ratings yet
CPU Structure & Functions
44 pages
Syllabus Microcontroller
No ratings yet
Syllabus Microcontroller
3 pages
Unit - IV 1. What Is Mean by Microcontroller?: Internal Blocks of Microcontroller
No ratings yet
Unit - IV 1. What Is Mean by Microcontroller?: Internal Blocks of Microcontroller
6 pages
Lenovo Legion Y540-17IRH: 81Q4001GGE
No ratings yet
Lenovo Legion Y540-17IRH: 81Q4001GGE
3 pages
Computer Fundamentals: Karishma Verma Institute of Hotel Management, Bangalore
No ratings yet
Computer Fundamentals: Karishma Verma Institute of Hotel Management, Bangalore
47 pages
125 Hardware Requirements CacheMARA XXS 2013-11-06
No ratings yet
125 Hardware Requirements CacheMARA XXS 2013-11-06
3 pages
UNIT 1 21 Regulation
No ratings yet
UNIT 1 21 Regulation
81 pages
WWW Productkeyslist Com Windows 8 1 Product Key HTML Amp 1
No ratings yet
WWW Productkeyslist Com Windows 8 1 Product Key HTML Amp 1
8 pages
Chilli Fire Hotspot Router Installation Guide Mikrotik
No ratings yet
Chilli Fire Hotspot Router Installation Guide Mikrotik
8 pages
Samsung Nand Flash Code
No ratings yet
Samsung Nand Flash Code
3 pages
Cao Unit-3
No ratings yet
Cao Unit-3
14 pages
Cos 101 Assignment Answers
No ratings yet
Cos 101 Assignment Answers
4 pages
HighNote4 U8C Vocabulary Quiz B
No ratings yet
HighNote4 U8C Vocabulary Quiz B
1 page
Unit 5 Ertos
No ratings yet
Unit 5 Ertos
28 pages
2
No ratings yet
2
6 pages