0% found this document useful (0 votes)

60 views

Introduction To CUDA

This document provides an introduction to GPU computing. It discusses the motivation for using GPUs to accelerate applications, describes GPU architecture including components like streaming multiprocessors and global memory, and shows how applications from various domains have achieved significant speedups by offloading work to GPUs.

Uploaded by

Vishwa Mohan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views

Introduction To CUDA

Uploaded by

Vishwa Mohan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Introduction to GPU

Computing
Mike Clark, NVIDIA
Developer Technology Group
Outline

Today
Motivation
GPU Architecture
Three ways to accelerate applications
Tomorrow
QUDA: QCD on GPUs
Why GPU Computing?

1200 160

140 Tesla 20-series

Tesla 20-series
1000
120
800 100

600 80
Tesla 20-series
60
400
Westmere
40 Nehalem 3 GHz
Westmere 3 GHz
200 Nehalem
3 GHz
3 GHz 20

0 0
2003 2004 2005 2006 2007 2008 2009 2010 2003 2004 2005 2006 2007 2008 2009 2010

GFlops/sec GBytes/sec

Single Precision: NVIDIA GPU Single Precision: x86 CPU NVIDIA GPU X86 CPU
Double Precision: NVIDIA GPU Double Precision: x86 CPU ECC off
Stunning Graphics Realism Lush, Rich Worlds

Crysis © 2006 Crytek / Electronic Arts

Id software ©

Incredible Physics Effects Core of the Definitive Gaming Platform

Hellgate: London © 2005-2006 Flagship Studios, Inc. Licensed by NAMCO BANDAI Games America, Inc. Full Spectrum Warrior: Ten Hammers © 2006 Pandemic Studios, LLC. All rights reserved. © 2006 THQ Inc. All rights reserved.
Add GPUs: Accelerate Science Applications

CPU GPU
Nbody GPU versus CPU
Low Latency or High Throughput?

CPU GPU
Optimized for low-latency Optimized for data-parallel,
access to cached data sets throughput computation
Control logic for out-of-order Architecture tolerant of
and speculative execution memory latency
More transistors dedicated to
computation
Small Changes, Big Speed-up
Application Code

Rest of Sequential
Compute-Intensive Functions CPU Code
GPU Use GPU to Parallelize CPU

+
146X 36X 18X 50X 100X
Medical Imaging Molecular Dynamics Video Transcoding Matlab Computing Astrophysics
U of Utah U of Illinois, Urbana Elemental Tech AccelerEyes RIKEN

GPUs Accelerate Science

149X 47X 20X 130X 30X

Financial Simulation Linear Algebra 3D Ultrasound Quantum Chemistry Gene Sequencing
Oxford Universidad Jaime Techniscan U of Illinois, Urbana U of Maryland
NVIDIA GPU Roadmap:
Increasing Performance/Watt
16
Maxwell
Sustained DP GFLOPS per Watt

8
Kepler
6

4
Fermi
2 Tesla

2008 2010 2012 2014

GPU Architecture
GPU Architecture:
Two Main Components
Global memory
Analogous to RAM in a CPU server
Accessible by both GPU and CPU
Currently up to 6 GB
Bandwidth currently up to 177 GB/s for Quadro and
Tesla products

DRAM I/F
DRAM I/F
ECC on/off option for Quadro and Tesla products

HOST I/F

DRAM I/F
Streaming Multiprocessors (SMs) L2

Giga Thread
Perform the actual computations

DRAM I/F
Each SM has its own:

DRAM I/F
Control units, registers, execution pipelines, caches

DRAM I/F
GPU Architecture – Fermi:
Instruction Cache

Scheduler Scheduler

Streaming Multiprocessor (SM) Dispatch Dispatch

32 CUDA Cores per SM Core Core Core Core

32 fp32 ops/clock
Core Core Core Core

Core Core Core Core

16 fp64 ops/clock
Core Core Core Core
32 int32 ops/clock
Core Core Core Core

2 warp schedulers Core Core Core Core

Up to 1536 threads Core Core Core Core

concurrently Core Core Core Core

4 special-function units Load/Store Units x 16

Special Func Units x 4

64KB shared mem + L1 cache Interconnect Network

32K 32-bit registers

64K Configurable
Cache/Shared Mem

Uniform Cache
Kepler
Fermi Kepler
SM
Instruction Cache Instruction Cache
Warp Scheduler Warp Scheduler Warp Scheduler Warp Scheduler

Scheduler Scheduler Dispatch Unit Dispatch Unit Dispatch Unit Dispatch Unit Dispatch Unit Dispatch Unit Dispatch Unit Dispatch Unit

CUDA Core
Dispatch Port Dispatch Port
Register File (65,536 x 32-bit)
Dispatch Dispatch Operand Collector

Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Register File ALU

Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Result Queue
Core Core Core Core
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU

Core Core Core Core Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU

Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU

Core Core Core Core Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU

Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU

Core Core Core Core

Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU

Core Core Core Core Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU

Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Core Core Core Core
Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU

Core Core Core Core Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU

Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU
Load/Store Units x 16

Special Func Units x 4 Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU

Interconnect Network Core Core Core Core Core Core LD/ST SFU Core Core Core Core Core Core LD/ST SFU

Uniform Cache
64K Configurable
64 KB Shared Memory / L1 Cache
Cache/Shared Mem
Interconnect Network

Uniform Cache
3 Ways to Accelerate Applications

Applications

OpenACC Programming
Libraries
Directives Languages

“Drop-in” Easily Accelerate Maximum

Acceleration Applications Flexibility
Libraries: Easy, High-Quality Acceleration

Ease of use: Using libraries enables GPU acceleration without in-depth

knowledge of GPU programming

“Drop-in”: Many GPU-accelerated libraries follow standard APIs, thus

enabling acceleration with minimal code changes

Quality: Libraries offer high-quality implementations of functions

encountered in a broad range of applications

Performance: NVIDIA libraries are tuned by experts

Some GPU-accelerated Libraries

NVIDIA cuBLAS NVIDIA cuRAND NVIDIA cuSPARSE NVIDIA NPP

Vector Signal GPU Accelerated Matrix Algebra on

Image Processing Linear Algebra GPU and Multicore NVIDIA cuFFT

Building-block
ArrayFire Matrix Sparse Linear C++ STL Features
IMSL Library Algorithms for CUDA
Computations Algebra for CUDA
3 Steps to CUDA-accelerated application

Step 1: Substitute library calls with equivalent CUDA library calls

saxpy ( … ) cublasSaxpy ( … )

Step 2: Manage data locality

- with CUDA: cudaMalloc(), cudaMemcpy(), etc.
- with CUBLAS: cublasAlloc(), cublasSetVector(), etc.

Step 3: Rebuild and link the CUDA-accelerated library

nvcc myobj.o –l cublas
Drop-In Acceleration (Step 1)

int N = 1 << 20;

// Perform SAXPY on 1M elements: y[]=a*x[]+y[]

saxpy(N, 2.0, d_x, 1, d_y, 1);
Drop-In Acceleration (Step 1)

int N = 1 << 20;

// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[] Add “cublas” prefix and

cublasSaxpy(N, 2.0, d_x, 1, d_y, 1); use device variables
Drop-In Acceleration (Step 2)

int N = 1 << 20;

cublasInit(); Initialize CUBLAS

// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]

cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);

cublasShutdown(); Shut down CUBLAS

Drop-In Acceleration (Step 2)

int N = 1 << 20;

cublasInit();
cublasAlloc(N, sizeof(float), (void**)&d_x);
Allocate device vectors
cublasAlloc(N, sizeof(float), (void*)&d_y);

// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]

cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);

cublasFree(d_x);
Deallocate device vectors
cublasFree(d_y);
cublasShutdown();
Drop-In Acceleration (Step 2)

int N = 1 << 20;

cublasInit();
cublasAlloc(N, sizeof(float), (void**)&d_x);
cublasAlloc(N, sizeof(float), (void*)&d_y);

cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1); Transfer data to GPU

cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);

// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]

cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);

cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1); Read data back GPU

cublasFree(d_x);
cublasFree(d_y);
cublasShutdown();
Explore the CUDA (Libraries) Ecosystem

CUDA Tools and Ecosystem

described in detail on NVIDIA
Developer Zone:

developer.nvidia.com/cuda-
tools-ecosystem
3 Ways to Accelerate Applications

Applications

OpenACC Programming
Libraries
Directives Languages

“Drop-in” Easily Accelerate Maximum

Acceleration Applications Flexibility
OpenACC Directives
CPU GPU

Simple Compiler hints

Program myscience Compiler Parallelizes code

... serial code ...
!$acc kernels
do k = 1,n1 OpenACC
do i = 1,n2
... parallel code ...
Compiler Works on many-core GPUs &
Hint
enddo
enddo
multicore CPUs
!$acc end kernels
...
End Program myscience

Your original
Fortran or C code
OpenACC
Open Programming Standard for Parallel Computing
“OpenACC will enable programmers to easily develop portable applications that maximize
the performance and power efficiency benefits of the hybrid CPU/GPU architecture of
Titan.”
--Buddy Bland, Titan Project Director, Oak Ridge National Lab

“OpenACC is a technically impressive initiative brought together by members of the

OpenMP Working Group on Accelerators, as well as many others. We look forward to
releasing a version of this proposal in the next release of OpenMP.”
--Michael Wong, CEO OpenMP Directives Board

OpenACC Standard
OpenACC
The Standard for GPU Directives

Easy: Directives are the easy path to accelerate compute

intensive applications

Open: OpenACC is an open GPU directives standard, making GPU

programming straightforward and portable across parallel
and multi-core processors

Powerful: GPU Directives allow complete access to the massive

parallel power of a GPU
2 Basic Steps to Get Started
Step 1: Annotate source code with directives:
!$acc data copy(util1,util2,util3) copyin(ip,scp2,scp2i)
!$acc parallel loop
…
!$acc end parallel
!$acc end data

Step 2: Compile & run:

pgf90 -ta=nvidia -Minfo=accel file.f

OpenACC Directives Example
!$acc data copy(A,Anew) Copy arrays into GPU memory
iter=0 within data region
do while ( err > tol .and. iter < iter_max )

iter = iter +1
err=0._fp_kind

!$acc kernels Parallelize code inside region

do j=1,m
do i=1,n
Anew(i,j) = .25_fp_kind *( A(i+1,j ) + A(i-1,j ) &
+A(i ,j-1) + A(i ,j+1))
err = max( err, Anew(i,j)-A(i,j))
end do
end do
!$acc end kernels Close off parallel region
IF(mod(iter,100)==0 .or. iter == 1) print *, iter, err
A= Anew

end do
!$acc end data Close off data region,
copy data back
Directives: Easy & Powerful
Real-Time Object Valuation of Stock Portfolios Interaction of Solvents and
Detection using Monte Carlo Biomolecules
Global Manufacturer of Navigation Global Technology Consulting Company University of Texas at San Antonio
Systems

5x in 40 Hours 2x in 4 Hours 5x in 8 Hours

“Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The
most important thing is avoiding restructuring of existing code for production applications. ”
-- Developer at the Global Manufacturer of Navigation Systems
Start Now with OpenACC Directives
Sign up for a free trial of the
directives compiler now!
Free trial license to PGI Accelerator

Tools for quick ramp

www.nvidia.com/gpudirectives
3 Ways to Accelerate Applications

Applications

OpenACC Programming
Libraries
Directives Languages

“Drop-in” Easily Accelerate Maximum

Acceleration Applications Flexibility
GPU Programming Languages

Numerical analytics MATLAB, Mathematica, LabVIEW

Fortran OpenACC, CUDA Fortran

C OpenACC, CUDA C

C++ Thrust, CUDA C++

Python PyCUDA, Copperhead

C# GPU.NET
CUDA C
Standard C Code Parallel C Code
__global__
void saxpy_serial(int n, void saxpy_parallel(int n,
float a, float a,
float *x, float *x,
float *y) float *y)
{ {
int i = blockIdx.x*blockDim.x +
for (int i = 0; i < n; ++i) threadIdx.x;
y[i] = a*x[i] + y[i]; if (i < n) y[i] = a*x[i] + y[i];
} }

// Perform SAXPY on 1M elements // Perform SAXPY on 1M elements

saxpy_serial(4096*256, 2.0, x, y); saxpy_parallel<<<4096,256>>>(n,2.0,x,y);

http://developer.nvidia.com/cuda-toolkit
CUDA C++: Develop Generic Parallel Code

CUDA C++ features enable

sophisticated and flexible template <typename T>
struct Functor {
applications and middleware __device__ Functor(_a) : a(_a) {}
__device__ T operator(T x) { return a*x; }
Class hierarchies
T a;
__device__ methods }

Templates
template <typename T, typename Oper>
Operator overloading __global__ void kernel(T *output, int n) {
Oper op(3.7);
Functors (function objects)
output = new T[n]; // dynamic allocation
Device-side new/delete int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
More…
output[i] = op(i); // apply functor
}
http://developer.nvidia.com/cuda-toolkit
Rapid Parallel C++ Development

Resembles C++ STL // generate 32M random numbers on host

High-level interface thrust::host_vector<int> h_vec(32 << 20);
thrust::generate(h_vec.begin(),
Enhances developer productivity h_vec.end(),
rand);
Enables performance portability
// transfer data to device (GPU)
between GPUs and multicore CPUs thrust::device_vector<int> d_vec = h_vec;
Flexible // sort data on device
CUDA, OpenMP, and TBB backends thrust::sort(d_vec.begin(), d_vec.end());

Extensible and customizable // transfer data back to host

thrust::copy(d_vec.begin(),
Integrates with existing software d_vec.end(),
h_vec.begin());
Open source

http://developer.nvidia.com/thrust or http://thrust.googlecode.com
CUDA Fortran
Program GPU using Fortran module mymodule contains
attributes(global) subroutine saxpy(n,a,x,y)
Key language for HPC real :: x(:), y(:), a,
Simple language extensions integer n, i
attributes(value) :: a, n
Kernel functions
i = threadIdx%x+(blockIdx%x-1)*blockDim%x
Thread / block IDs if (i<=n) y(i) = a*x(i) + y(i);
Device & data end subroutine saxpy
management end module mymodule

Parallel loop directives program main

Familiar syntax use cudafor; use mymodule
real, device :: x_d(2**20), y_d(2**20)
Use allocate, deallocate x_d = 1.0; y_d = 2.0
Copy CPU-to-GPU with call saxpy<<<4096,256>>>(2**20,3.0,x_d,y_d,)
assignment (=) y = y_d
write(*,*) 'max error=', maxval(abs(y-5.0))
end program main
http://developer.nvidia.com/cuda-fortran
More Programming Languages

Python PyCUDA

C# .NET GPU.NET

Numerical
Analytics
Get Started Today
These languages are supported on all CUDA-capable GPUs.
You might already have a CUDA-capable GPU in your laptop or desktop PC!

CUDA C/C++ GPU.NET

http://developer.nvidia.com/cuda-toolkit http://tidepowerd.com

Thrust C++ Template Library

http://developer.nvidia.com/thrust MATLAB
http://www.mathworks.com/discovery/
matlab-gpu.html
CUDA Fortran
http://developer.nvidia.com/cuda-toolkit

Mathematica
PyCUDA (Python) http://www.wolfram.com/mathematica/new
http://mathema.tician.de/software/pycuda -in-8/cuda-and-opencl-support/
Six Ways to SAXPY
Programming Languages
for GPU Computing
Single precision Alpha X Plus Y (SAXPY)

Part of Basic Linear Algebra Subroutines (BLAS) Library

GPU SAXPY in multiple languages and libraries

A menagerie* of possibilities, not a tutorial

*technically, a program chrestomathy: http://en.wikipedia.org/wiki/Chrestomathy

OpenACC Compiler Directives
Parallel C Code Parallel Fortran Code

void saxpy(int n, subroutine saxpy(n, a, x, y)

float a, real :: x(:), y(:), a
float *x, integer :: n, i
float *y) !$acc kernels
do i=1,n
{
y(i) = a*x(i)+y(i)
#pragma acc kernels enddo
for (int i = 0; i < n; ++i) !$acc end kernels
y[i] = a*x[i] + y[i]; end subroutine saxpy
}

... ...
// Perform SAXPY on 1M elements ! Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y); call saxpy(2**20, 2.0, x_d, y_d)
... ...

http://developer.nvidia.com/openacc or http://openacc.org
CUBLAS Library
Serial BLAS Code Parallel cuBLAS Code
int N = 1<<20; int N = 1<<20;

... cublasInit();
cublasSetVector(N, sizeof(x[0]), x, 1, d_x, 1);
// Use your choice of blas library cublasSetVector(N, sizeof(y[0]), y, 1, d_y, 1);

// Perform SAXPY on 1M elements // Perform SAXPY on 1M elements

blas_saxpy(N, 2.0, x, 1, y, 1); cublasSaxpy(N, 2.0, d_x, 1, d_y, 1);

cublasGetVector(N, sizeof(y[0]), d_y, 1, y, 1);

cublasShutdown();

You can also call cuBLAS from Fortran,

C++, Python, and other languages
http://developer.nvidia.com/cublas
CUDA C
Standard C Parallel C

__global__
void saxpy(int n, float a, void saxpy(int n, float a,
float *x, float *y) float *x, float *y)
{ {
for (int i = 0; i < n; ++i) int i = blockIdx.x*blockDim.x + threadIdx.x;
y[i] = a*x[i] + y[i]; if (i < n) y[i] = a*x[i] + y[i];
} }

int N = 1<<20; int N = 1<<20;

cudaMemcpy(d_x, x, N, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N, cudaMemcpyHostToDevice);

// Perform SAXPY on 1M elements // Perform SAXPY on 1M elements

saxpy(N, 2.0, x, y); saxpy<<<4096,256>>>(N, 2.0, d_x, d_y);

cudaMemcpy(y, d_y, N, cudaMemcpyDeviceToHost);

http://developer.nvidia.com/cuda-toolkit
Thrust C++ Template Library
Serial C++ Code
with STL and Boost
Parallel C++ Code

int N = 1<<20; int N = 1<<20;

std::vector<float> x(N), y(N); thrust::host_vector<float> x(N), y(N);

... ...

thrust::device_vector<float> d_x = x;
thrust::device_vector<float> d_y = y;

// Perform SAXPY on 1M elements // Perform SAXPY on 1M elements

std::transform(x.begin(), x.end(), thrust::transform(d_x.begin(), d_x.end(),
y.begin(), y.end(), d_y.begin(), d_y.begin(),
2.0f * _1 + _2); 2.0f * _1 + _2);

www.boost.org/libs/lambda http://thrust.github.com
CUDA Fortran
Standard Fortran Parallel Fortran
module mymodule contains module mymodule contains
subroutine saxpy(n, a, x, y) attributes(global) subroutine saxpy(n, a, x, y)
real :: x(:), y(:), a real :: x(:), y(:), a
integer :: n, i integer :: n, i
do i=1,n attributes(value) :: a, n
y(i) = a*x(i)+y(i) i = threadIdx%x+(blockIdx%x-1)*blockDim%x
enddo if (i<=n) y(i) = a*x(i)+y(i)
end subroutine saxpy end subroutine saxpy
end module mymodule end module mymodule

program main program main

use mymodule use cudafor; use mymodule
real :: x(2**20), y(2**20) real, device :: x_d(2**20), y_d(2**20)
x = 1.0, y = 2.0 x_d = 1.0, y_d = 2.0

! Perform SAXPY on 1M elements ! Perform SAXPY on 1M elements

call saxpy(2**20, 2.0, x, y) call saxpy<<<4096,256>>>(2**20, 2.0, x_d, y_d)

end program main end program main

http://developer.nvidia.com/cuda-fortran
Python
Standard Python Copperhead: Parallel Python

from copperhead import *

import numpy as np import numpy as np

@cu
def saxpy(a, x, y): def saxpy(a, x, y):
return [a * xi + yi return [a * xi + yi
for xi, yi in zip(x, y)] for xi, yi in zip(x, y)]

x = np.arange(220, dtype=np.float32) x = np.arange(220, dtype=np.float32)

y = np.arange(2**20, dtype=np.float32) y = np.arange(2**20, dtype=np.float32)

with places.gpu0:
gpu_result = saxpy(2.0, x, y)

with places.openmp:
cpu_result = saxpy(2.0, x, y) cpu_result = saxpy(2.0, x, y)

http://numpy.scipy.org http://copperhead.github.com
Enabling Endless Ways to SAXPY

Developers want to build CUDA

C, C++, Fortran
New Language
Support
front-ends for
Java, Python, R, DSLs
LLVM Compiler
For CUDA
Target other processors like
ARM, FPGA, GPUs, x86
NVIDIA x86 New Processor
GPUs CPUs Support

CUDA Compiler Contributed to

Open Source LLVM
Thank you
developer.nvidia.com

AMD Graphics Tool (AGT) User's Guide: AMD Confidential - Advance Information
100% (1)
AMD Graphics Tool (AGT) User's Guide: AMD Confidential - Advance Information
23 pages
FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection
No ratings yet
FPGA-SoC Implementation of YOLOv4 For Flying-Object Detection
20 pages
GPU Datasheet
No ratings yet
GPU Datasheet
3 pages
Mil STD 1553b
No ratings yet
Mil STD 1553b
61 pages
Cuda
No ratings yet
Cuda
93 pages
Bus Controller
No ratings yet
Bus Controller
26 pages
IDA+VMWare - Linux Debugger
No ratings yet
IDA+VMWare - Linux Debugger
8 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
Jetson Orin NX Series Modules Data Sheet DS 10712 001 v1.1
No ratings yet
Jetson Orin NX Series Modules Data Sheet DS 10712 001 v1.1
54 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
CS8076 - GPU Architecture and Programming
No ratings yet
CS8076 - GPU Architecture and Programming
244 pages
Project Report
No ratings yet
Project Report
24 pages
12.10 16.10b Open Source Verification Platform For RISC V Processors
No ratings yet
12.10 16.10b Open Source Verification Platform For RISC V Processors
27 pages
PSU Peak Power PCIe Gen5 W - Psys Training Rev 5 WW44
No ratings yet
PSU Peak Power PCIe Gen5 W - Psys Training Rev 5 WW44
39 pages
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
No ratings yet
A Comprehensive Survey On Applications of Transformers For Deep Learning Tasks
58 pages
Embd Course Pamplet - 3 - 4months
100% (1)
Embd Course Pamplet - 3 - 4months
2 pages
A Configurable Risc V Processor Core For Fpga Devices
No ratings yet
A Configurable Risc V Processor Core For Fpga Devices
53 pages
Eee 11
No ratings yet
Eee 11
28 pages
RA MCU & Solution Introduction - 20200312
No ratings yet
RA MCU & Solution Introduction - 20200312
38 pages
PCB Design Guidelines - Eurocircuits PCB Design Guidelines
No ratings yet
PCB Design Guidelines - Eurocircuits PCB Design Guidelines
37 pages
Signal Integrity Course
No ratings yet
Signal Integrity Course
6 pages
TheIoTAcademy LPWAN LORA
No ratings yet
TheIoTAcademy LPWAN LORA
83 pages
LPC 2378 Development Board
No ratings yet
LPC 2378 Development Board
160 pages
VHDL Programming
No ratings yet
VHDL Programming
92 pages
IOT Lab Manual
No ratings yet
IOT Lab Manual
84 pages
Allwinner H3 Datasheet V1.1
0% (1)
Allwinner H3 Datasheet V1.1
616 pages
High Performance FPGA Based CNN Accelerator
No ratings yet
High Performance FPGA Based CNN Accelerator
4 pages
White Paper On Hybrid Approach For ADAS Data Fusion Algorithm Development
No ratings yet
White Paper On Hybrid Approach For ADAS Data Fusion Algorithm Development
15 pages
DDR, DDR3, DDR4, DDR5 Ram Architecture
No ratings yet
DDR, DDR3, DDR4, DDR5 Ram Architecture
11 pages
R05 411104erts
No ratings yet
R05 411104erts
8 pages
Lab RISC-V
No ratings yet
Lab RISC-V
5 pages
Product Selector Guide
No ratings yet
Product Selector Guide
36 pages
Allwinner H3 Datasheet V1.2 PDF
No ratings yet
Allwinner H3 Datasheet V1.2 PDF
614 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Broadcom Management API PDF
No ratings yet
Broadcom Management API PDF
166 pages
Quartus II Handbook Volume 2: Design Implementation and Optimization
No ratings yet
Quartus II Handbook Volume 2: Design Implementation and Optimization
321 pages
Tegra K1 DataSheet DS06742001v02
No ratings yet
Tegra K1 DataSheet DS06742001v02
83 pages
LAB 09 RISC-V Assembly (Part I: Introduction) : EE-222 Microprocessors Systems April 11, 2019
100% (1)
LAB 09 RISC-V Assembly (Part I: Introduction) : EE-222 Microprocessors Systems April 11, 2019
9 pages
Board-Level Timing Analysis
100% (1)
Board-Level Timing Analysis
4 pages
Optimizing Your Automotive System With Jacinto™ 7 Socs and Mcu Integration
No ratings yet
Optimizing Your Automotive System With Jacinto™ 7 Socs and Mcu Integration
18 pages
Digital Voltmeter
100% (1)
Digital Voltmeter
75 pages
Data Communication Faq
No ratings yet
Data Communication Faq
4 pages
Amit Bar Analog Resume Updated JAN2023
No ratings yet
Amit Bar Analog Resume Updated JAN2023
1 page
Design & Verification of AMBA APB Protocol
No ratings yet
Design & Verification of AMBA APB Protocol
4 pages
Cortex - M3: Technical Reference Manual
100% (1)
Cortex - M3: Technical Reference Manual
133 pages
Software Development Process
No ratings yet
Software Development Process
6 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Seminar Presentation - CUDA
No ratings yet
Seminar Presentation - CUDA
30 pages
Real-Time Operating Systems: Part 1: Mars Pathfinder 1997
No ratings yet
Real-Time Operating Systems: Part 1: Mars Pathfinder 1997
159 pages
DE1-SoC User Manual
No ratings yet
DE1-SoC User Manual
113 pages
Debugger Arc
No ratings yet
Debugger Arc
46 pages
Image Processing With CUDA
No ratings yet
Image Processing With CUDA
66 pages
Mmwave - SDK - User - Guide 1.0.0
No ratings yet
Mmwave - SDK - User - Guide 1.0.0
64 pages
Worklog 5761 Sync
No ratings yet
Worklog 5761 Sync
14 pages
Pantech Solutions
No ratings yet
Pantech Solutions
22 pages
NXP S32K1xx Design
No ratings yet
NXP S32K1xx Design
26 pages