C++ AMP Errata: Chapter 1: Vectorization (Page 8)
C++ AMP Errata: Chapter 1: Vectorization (Page 8)
C++ AMP Errata: Chapter 1: Vectorization (Page 8)
The following is a list of the errata for the book. If you find other errors with the book text then please
report them on the O'Reilly Errata page. For issues related to the code samples please file an issue on
Codeplex.
The developer is responsible for writing a loop that is parallelizable, of course, and this is the truly hard part of
the job. For example, this loop is not parallelizable in its current form:
for (int i = 1; i <= n; ++i)
a[i] = a[i - 1] + b[i];
This difference in execution time is due to the writes to outData being uncoalesced. Although the reads from
inData on each thread are from adjacent memory addresses, the writes to outData from consecutive threads oc-
cur on different rows.
inData outData
1 2 3 4 1 5 9 13
5 6 7 8 2 6 10 14
9 10 11 12 3 7 11 15
13 14 15 16 4 8 12 16
This explains the big difference in performance. The threads (numbered in the diagram above) are writing to
memory locations that are not adjacent. Remember that C++ AMP, like C and C++, stores multi-dimensional data
in row-major order, so the shaded row represents coalesced memory access while the column in outData is an
uncoalesced access.
It's actually possible to use tile static memory to mitigate this by adding an additional set of copies, as shown
in the following example:
... Code sample is unchanged. See the original text.
Here the kernel is divided into two phases using the familiar tiled kernel pattern introduced in Chapter 4. In the
first part of the kernel each thread in the tile copies coalesced data from inData in global memory into tile static
localData and transposes it during the copy to tile static memory. After the barrier—which ensures that all threads
have finished the copy—the data is written in a coalesced way back to global memory. Tile static memory has a
much higher bandwidth and smaller interface width than global memory, so the penalty for uncoalesced memory
accesses is far less. By transferring the matrix elements by means of tile static memory and doing the transpose
there, uncoalesced writes to global memory can be eliminated. The diagram shows four tiles (numbered in bold),
each with four threads. The memory accesses for the threads in tile 2 are shown shaded. It clearly shows that the
writes to outData by threads 1 and two in tile 2 are now coalesced.
11 2 21 2 21 3 11 2 31 2
3 4 3 4 2 4 3 4 3 4
31 2 41 2 21 2 41 2
3 4 3 4 21 2 3 4 3 4
3 4
Since publication of the book Microsoft has released the Platform Update for Windows 7. This update enables
both debugger support and the WARP accelerator on Windows 7 and Windows Server 2008 R2. The following
blog posts detail how these features are now enabled.