Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
38 views

Lesson 17 Particle System

1. The document summarizes key lessons from particle system implementations including global synchronization using atomic operations, asynchronous copying from the host to GPU, using shared memory to cache particle data, and using texture cache to accelerate particle lookup. 2. It provides an example code to build an inverse mapping from grid cells to particles using atomic operations to increment a cell counter and write particle indices. 3. It briefly discusses using asynchronous copying between host and device memory to overlap data transfer with computation by copying one data tile asynchronously while computing on another.

Uploaded by

JM Mejia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Lesson 17 Particle System

1. The document summarizes key lessons from particle system implementations including global synchronization using atomic operations, asynchronous copying from the host to GPU, using shared memory to cache particle data, and using texture cache to accelerate particle lookup. 2. It provides an example code to build an inverse mapping from grid cells to particles using atomic operations to increment a cell counter and write particle indices. 3. It briefly discusses using asynchronous copying between host and device memory to overlap data transfer with computation by copying one data tile asynchronously while computing on another.

Uploaded by

JM Mejia
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

4/16/09

Administrative
•  Still missing some design reviews
-  Please email to me slides from presentation
L17: Lessons from -  And updates to reports

Particle System -  By Thursday, Apr 16, 5PM

•  Grading
Implementations -  Lab2 problem 1 graded, problem 2 under construction
-  Return exams by Friday AM

•  Upcoming cross-cutting systems seminar,


Monday, April 20, 12:15-1:30PM, LCR: “Technology Drivers
for Multicore Architectures,” Rajeev Balasubramonian,
Mary Hall, Ganesh Gopalakrishnan, John Regehr
•  Final Reports on projects
-  Poster session April 29 with dry run April 27
-  Also, submit written document and software by May 6
-  Invite your friends! I’ll invite faculty, NVIDIA, graduate
CS6963 students, application owners, ..
CS6963 L17: Particle Systems
2

Particle Systems Sources for Today’s Lecture


•  MPM/GIMP •  A particle system simulation in the CUDA Software
Developer Kit called particles
•  Particle animation and other special effects
•  Implementation description in /Developer/CUDA/
•  Monte-carlo transport simulation projects/particles/doc/particles.pdf
•  Fluid dynamics •  Possibly related presentation in
•  Plasma simulations http://www.nvidia.com/content/cudazone/download/
Advanced_CUDA_Training_NVISION08.pdf
•  What are the performance/implementation
challenges?
-  Global synchronization
This presentation also talks about finite differencing
-  Global memory access costs (how to reduce) and molecular dynamics.
-  Copy to/from host overlapped with computation
•  Asynchronous copies in CUDA Software Developer Kit
•  Many of these issues arise in other projects called asyncAPI
-  E.g., overlapping host copies with computation image
mosaicing
L17: Particle Systems L17: Particle Systems
CS6963 3 CS6963 4

1
4/16/09

Relevant Lessons from Particle Simulation 1. Global synchronization


1.  Global synchronization using atomic operation •  Concept:
-  We need to perform some computations on particles, and
2.  Asynchronous copy from Host to GPU others on grid cells
3.  Use of shared memory to cache particle data -  Existing MPM/GIMP provides a mapping from particles to
the grid nodes to which they contribute
4.  Use of texture cache to accelerate particle lookup -  Would like an inverse mapping from grid cells to the
particles that contribute to their result
5.  OpenGL rendering
•  Strategy:
-  Decompose the threads so that each computes results at a
particle
-  Use global synchronization to construct an inverse mapping
from grid cells to particles
-  Primitive: atomicAdd

L17: Particle Systems L17: Particle Systems


CS6963 5 CS6963 6

Example Code to Build Inverse Mapping 2. Asynchronous Copy To/From Host


•  Warning: I have not tried this, and could not find a
__device__ void addParticleToCell(int3 gridPos, uint index refers to index of lot of information on it.
index, uint* gridCounters, uint* gridCells) particle
{
•  Concept:
gridPos represents -  Memory bandwidth can be a limiting factor on GPUs
// calculate grid hash grid cell in 3-d space
uint gridHash = calcGridHash(gridPos);
-  Sometimes computation cost dominated by copy cost
gridCells is data structure -  But for some computations, data can be “tiled” and
in global memory for the computation of tiles can proceed in parallel (some of our
// increment cell counter using atomics inverse mapping projects)
int counter = atomicAdd(&gridCounters[gridHash], 1); -  Can we be computing on one tile while copying another?
counter = min(counter, params.maxParticlesPerCell-1);
// write particle index into this cell (very
What this does: •  Strategy:
uncoalesced!) Builds up gridCells as array
limited by max # particles -  Use page-locked memory on host, and asynchronous copies
gridCells[gridHash*params.maxParticlesPerCell + per grid -  Primitive cudaMemcpyAsync
counter] = index;
atomicAdd gives how many
} -  Synchronize with cudaThreadSynchronize()
particles have already been
added to this cell

L17: Particle Systems L17: Particle Systems


CS6963 7 CS6963 8

2
4/16/09

Copying from Host to Device Example of Asynchronous Data Transfer


•  cudaMemcpy(dst, src, nBytes, direction)
•  Can only go as fast as the PCI-e bus and not eligible for cudaStreamCreate(&stream1);
asynchronous data transfer
cudaStreamCreate(&stream2);
•  cudaMallocHost(…): Page-locked host memory cudaMemcpyAsync(dst1, src1, size, dir, stream1);
-  Use this in place of standard malloc(…) on the host
kernel<<<grid, block, 0, stream1>>>(…);
-  Prevents OS from paging host memory
-  Allows PCI-e DMA to run at full speed
cudaMemcpyAsync(dst2, src2, size, dir, stream2);
kernel<<<grid, block, 0, stream2>>>(…);
•  Asynchronous data transfer
-  Requires page-locked host memory

src1 and src2 must have been allocated using cudaMallocHost


stream1 and stream2 identify streams associated with asynchronous
call (note 4th “parameter” to kernel invocation)

L17: Particle Systems L17: Particle Systems


CS6963 9 CS6963 10

Particle Data has some Reuse Code from Oster presentation


•  Two ideas: •  Newtonian mechanics on point masses:
-  Cache particle data in shared memory (3.)
struct particleStruct{
-  Cache particle data in texture cache (4.)
float3 pos;
float3 vel;
float3 force;
};

pos = pos+ vel*dt


vel = vel+ force/mass*dt

11
L17: Particle Systems L17: Particle Systems
CS6963 L2:Introduction to CUDA CS6963
11 12

3
4/16/09

3. Cache Particle Data in Shared Memory 4. Use texture cache for read-only data
__shared__ float3 s_pos[N_THREADS]; •  Texture memory is special section of device global
memory
__shared__ float3 s_vel[N_THREADS];
-  Read only
__shared__ float3 s_force[N_THREADS];
-  Cached by spatial location (1D, 2D, 3D)
int tx= threadIdx.x;
•  Can achieve high performance
idx= threadIdx.x+ blockIdx.x*blockDim.x;
-  If reuse within thread block so access is cached
s_pos[tx] = P[idx].pos;
-  Useful to eliminate cost of uncoalesced global memory
s_vel[tx] = P[idx].vel; access
s_force[tx] = P[idx].force; •  Requires special mechanisms for defining a texture,
__syncthreads(); and accessing a texture
s_pos[tx] = s_pos[tx] + s_vel[tx] * dt;
s_vel[tx] = s_vel[tx] + s_force[tx]/mass * dt;
P[idx].pos= s_pos[tx];
P[idx].vel= s_vel[tx];
L17: Particle Systems L17: Particle Systems
CS6963 13 CS6963 14

Using Textures: from Finite Difference Example Use of Textures in Particle Simulation
•  Declare a texture ref •  Macro determines whether texture is used
texture<float, 1, …> fTex; a. Declaration of texture references in
particles_kernel.cu
•  Bind f to texture ref via an array
#if USE_TEX
cudaMallocArray(fArray,…) // textures for particle position and velocity
cudaMemcpy2DToArray(fArray, f, …);
cudaBindTextureToArray(fTex, fArray…); texture<float4, 1, cudaReadModeElementType> oldPosTex;
texture<float4, 1, cudaReadModeElementType> oldVelTex;
•  Access with array texture functions
f[x,y] = tex2D(fTex, x,y); texture<uint2, 1, cudaReadModeElementType> particleHashTex;
texture<uint, 1, cudaReadModeElementType> cellStartTex;

texture<uint, 1, cudaReadModeElementType> gridCountersTex;


texture<uint, 1, cudaReadModeElementType> gridCellsTex;
L17: Particle Systems #endif L17: Particle Systems
CS6963 15 CS6963 16

4
4/16/09

Use of Textures in Particle Simulation Use of Textures in Particle Simulation


b. Bind/Unbind Textures right before kernel invocation c. Texture fetch (hidden in a macro)
#if USE_TEX
CUDA_SAFE_CALL(cudaBindTexture(0, oldPosTex, oldPos,
numBodies*sizeof(float4))); ifdef USE_TEX
CUDA_SAFE_CALL(cudaBindTexture(0, oldVelTex, oldVel, #define FETCH(t, i) tex1Dfetch(t##Tex, i)
numBodies*sizeof(float4)));
#endif #else
#define FETCH(t, i) t[i]
reorderDataAndFindCellStartD<<< numBlocks, numThreads
>>>((uint2 *) particleHash, (float4 *) oldPos, (float4 *) oldVel, #endif
(float4 *) sortedPos, (float4 *) sortedVel, (uint *) cellStart);

#if USE_TEX •  Here’s an access in particles_kernel.cu


CUDA_SAFE_CALL(cudaUnbindTexture(oldPosTex));
float4 pos = FETCH(oldPos, index);
CUDA_SAFE_CALL(cudaUnbindTexture(oldVelTex));
#endif
L17: Particle Systems L17: Particle Systems
CS6963 17 CS6963 18

5. OpenGL Rendering OpenGL Interoperability


•  OpenGL buffer objects can be mapped into the •  Register a buffer object with CUDA
CUDA address space and then used as global -  cudaGLRegisterBufferObject(GLuintbuffObj);
memory -  OpenGL can use a registered buffer only as a source
-  Vertex buffer objects -  Unregister the buffer prior to rendering to it by OpenGL
-  Pixel buffer objects
•  Map the buffer object to CUDA memory
•  Allows direct visualization of data from -  cudaGLMapBufferObject(void**devPtr, GLuintbuffObj);
computation -  Returns an address in global memory Buffer must be registered
-  No device to host transfer prior to mapping
-  Data stays in device memory –very fast compute / viz •  Launch a CUDA kernel to process the buffer
-  Automatic DMA from Tesla to Quadro (via host for now) •  Unmap the buffer object prior to use by OpenGL
•  Data can be accessed from the kernel like any other global -  cudaGLUnmapBufferObject(GLuintbuffObj);
data (in device memory)
•  Unregister the buffer object
-  cudaGLUnregisterBufferObject(GLuintbuffObj);
-  Optional: needed if the buffer is a render target
L17: Particle Systems •  Use the buffer object in OpenGL code
L17: Particle Systems
CS6963 19 CS6963 20

5
4/16/09

Final Project Presentation Final Remaining Lectures


•  Dry run on April 27 •  This one:
-  Easels, tape and poster board provided •  Particle Systems
-  Tape a set of Powerpoint slides to a standard 2’x3’ poster,
or bring your own poster.
•  April 20
•  Sorting
•  Final Report on Projects due May 6
-  Submit code •  April 22
-  And written document, roughly 10 pages, based on earlier -  ?
submission. -  Would like to talk about dynamic scheduling?
-  In addition to original proposal, include -  If nothing else, following paper:
-  Project Plan and How Decomposed (from DR) “Efficient Computation of Sum-products on GPUs Through Software-
-  Description of CUDA implementation Managed Cache,” M. Silberstein, A. Schuster, D. Geiger, A. Patney, J.
Owens, ICS 2008.
-  Performance Measurement
http://www.cs.technion.ac.il/~marks/docs/SumProductPaper.pdf
-  Related Work (from DR)

L17: Particle Systems L17: Particle Systems


CS6963 21 CS6963 22

You might also like