2.1. Spatial Data Reuse Using Two-Dimensional Multiplier-Accumulator (2-D MAC) Array Structure
This subsection presents how data reuse can be achieved by using parallel processing with the 2-D MAC array structure. The data reuse obtained with the 2-D MAC array can be considered a spatial gain because the data reuse capability is determined by the 2-D MAC array size [
15,
16].
To analyze the spatial data reuse using a 2-D MAC array structure, we take a simple example, as shown in
Figure 1 [
22].
Figure 1 illustrates a 2-D MAC array consisting of
MAC units, each of which performs a convolution between each input data and weight. For simplicity, but without loss of generality, it can be assumed that the numbers of channels for both the input data and each weight are set to 128, with the number of weights being 16.
For the convolutional operation at each MAC to occur correctly, the memory control shown in
Figure 1 is provided in such a way that both the input data and weight are fetched from the global buffer and transferred into the MAC array in accordance with the preset operational order. In other words, the key operation of the memory control is to generate a data sequence in accordance with the preset operational order; this is accomplished by generating the address of the global buffer correctly. Once the input data and weight are transferred from the global buffer into the MAC array, each MAC operator performs multiplication, and the final output feature is generated by accumulating the multiplication results.
The MAC unit in the ith row of the jth column of the 2-D array provides the convolution between the input data and ith weight, where both the input data and weight correspond to the jth channel. The operation at each of the MACs occurs for each pixel of the input data during the corresponding convolutional window.
Because each column of the 2-D MAC array represents the corresponding channel for both the input data and weights, the
jth channel input data are repeatedly used for the convolution with each of the 16 weights in the
jth column, whereas j runs from 1 to 128, as in the example shown in
Figure 1. In other words, the input data can be reused for each of the 16 weights in each column. In particular, data reuse can be accomplished for each of the 128-channel input data at each column for as many rows in a given 2-D MAC structure: 16 in the example of
Figure 1.
As discussed above, the data reuse provided by the 2-D MAC array is available for each column of the array, meaning that the data reuse effect is equivalent only to the case of the 1-D MAC array with the same number of rows.
To expand the data reuse, the procedure of computing the convolution at each MAC should be modified in such a way that data reuse can be provided for each row, as well as for each column. To allow data reuse for a given weight along each row, the weight should remain the same in each row. The weight value at each row is taken from the corresponding channel because the weight should be convolved with the input data of the same channel. For the weight to remain the same in each row, each of the 128 columns must represent a single channel. To accomplish this, instead of performing row-wise parallel processing with each of the 128-channel data, parallel processing should be performed with the different pixel input data of the same channel at each row of the 2-D MAC array. Then, the data reuse for the input data (as well as that for the weight) can be accomplished for both the columns and rows of the 2-D MAC array.
To involve the input data corresponding to all of the different pixels in each row, however, is to increase the width of the output feature pixels as well. This would result in a considerable increase in the buffer size required for storing the partial sums corresponding to each of the output feature pixels [
15], potentially imposing a serious limit on hardware implementation [
23,
24]. Unless this problem is resolved, the data reuse factor cannot be set to a sufficiently high value in all conventional methods [
17,
18,
19,
20,
21,
22]. Furthermore, as mentioned earlier in this subsection, the spatial data reuse factor must be fixed depending on the size of the 2-D MAC array. As will be shown later in
Section 3, we present a new technology to allow the data reuse factor to be arbitrarily set by introducing a local register file whose size can be freely set.
2.3. Global Buffer Access Pattern
As summarized in
Section 2.1 and
Section 2.2, spatial data reuse can be obtained in accordance with the given 2-D MAC array structure, whereas temporal data reuse is applicable regardless of the MAC array structure. Because the two different types of data reuse methods are independent of each other, they can be implemented together on a given 2-D MAC array. More specifically, it was demonstrated in [
22] that both spatial data reuse and temporal data reuse can be exploited for the input data and weights, respectively, when implementing a convolutional accelerator with a 2-D MAC array. As mentioned earlier, however, the width of the output feature pixels increases as either the spatial or temporal data reuse factor is increased.
In this subsection, we analyze the global buffer read pattern when both the spatial and temporal data reuses are simultaneously exploited. The objective is to find the input data pixels commonly used for convolutional operations with different weight pixels. Using the analysis given in this subsection, we suggest a novel method of exploiting both spatial and temporal data reuse, in which the latter allows for the reuse of both the input data and weight values. Notably, temporal data reuse was allowed only for the weight pixels in previous works [
17,
18,
19,
20,
21,
22].
In principle, each convolutional operation consists of two steps: first, to multiply the input data pixels by the weight pixels correspondingly and then, to sum up the multiplication results. As a result of this operation, a corresponding output feature pixel is generated. This operation should be repeated for the entire set of input data pixels. However, to apply the method of temporal data reuse, a number of input data pixels are first multiplied by a given weight pixel. This operation is repeated for every weight pixel. The output feature pixels in this case cannot be obtained until the multiplication between the weight pixel and each of the input data pixels is completed. The number of input data pixels processed with each weight pixel, i.e., the temporal data reuse factor, is predetermined, as discussed in
Section 2.2 and set to 16 in our implementation, as discussed in
Section 3 and
Section 4.
It can be assumed that each of the 16 weights shown in
Figure 1 consists of nine (=
) pixels, as shown on the right-hand side of
Figure 2. To exploit the temporal data reuse with a width of 16, 16 input data pixels should be read from the global buffer to be convolved with the corresponding weight pixel.
Figure 2 shows how the input data should be read to provide temporal data reuse with a reuse factor of 16. It can be observed that each set of the 16 input data pixels, {(0,0), (0,1), …, (0,15)}, {(0,1), (0,2), …, (0,16)}, and {(0,2), (0,3), …, (0,17)}, is convolved with the corresponding weight pixels, (0,0), (0,1), and (0,2), during the periods of
,
, and
, respectively. In other words, to read each set of 16 input data pixels, {(0,0), (0,1), …, (0,15)}, {(0,1), (0,2), …, (0,16)}, and {(0,2), (0,3), …, (0,17)}, from the global buffer during
,
, and
, respectively, IIC generates the addresses of the corresponding data.
In fact,
Figure 2 explicitly shows which pixels of the 16 input data pixels out of the block of
pixels are to be read from the global buffer to be convolved with which one of the weight pixels during that time period. The operational procedure shown in
Figure 2 represents the convolution between each of the 16 input data pixels and a corresponding weight pixel for producing 16 output feature pixels. In other words, the operational procedure shown in
Figure 2 corresponds only to the convolution for a single block of
input data pixels to produce 16 output feature pixels. Here, we define the term input data block to denote the number of input data pixels needed to generate 16 output feature pixels. In general, for the convolutional operations shown in
Figure 2, one input data block includes
pixels, where P denotes the temporal data reuse factor with the weight dimension being
.
Let us consider an arbitrary size of input data, for example, the case of
input data pixels, as shown in
Figure 3. The objective is to find the input data read patterns among the different blocks. Out of the
input data pixels shown in
Figure 3, we take three example blocks of the input data pixels. The first block shown at the top of
Figure 3 corresponds to the block of
input data pixels consisting of three rows of {(0,0), (0,1), …, (0,17)}, {(1,0), (1,1), …, (1,17)}, and {(2,0), (2,1), …, (2,17)}. The second block shown at the center of
Figure 3 corresponds to a block of
input data pixels, consisting of three rows of {(0,16), (0,17), …, (0,20), (1,0), (1,1), …, (1,14)}, {(1,16), (1,17), …, (1,20), (2,0), (2,1), …, (2,14)}, and {(2,16), (2,17), …, (2,20), (3,0), (3,1), …, (3,14)}. The third block shown at the bottom of
Figure 3 corresponds to a block of
input data pixels, consisting of three rows of {(1,13), (1,14), …, (1,20), (2,0), (2,1), …, (2,11)}, {(2,13), (2,14), …, (2,20), (3,0), (3,1), …, (3,11)}, and {(3,13), (3,14), …, (3,20), (4,0), (4,1), …, (4,11)}.
Now, let us take a closer look at the convolutional operations corresponding to each of the three blocks of input data pixels. The input data pixels shown on the right-hand side of
Figure 3 are convolved with the first-row weight pixels. Our explanation here is given only for the convolution with the first-row weight pixels because the convolutional operations corresponding to the second- and third-row weight pixels are exactly the same as those for the first-row weight pixels.
First, when the block is given in an entire rectangular shape, as in the case of the first block of
pixels shown at the top of
Figure 3, the input data read pattern is determined in such a way that 15 out of 16 input data pixels are used in common for the convolution with two consecutive weight pixels if the two weight pixels are in the same row. In particular, after the 16 input data pixels, for example, {(0,0), (0,1), …, (0,15)}, have been read from the global buffer and processed for the convolution with the weight pixel of (0,0), the input data pixels of {(0,1), (0,2), …, (0,15), (0,16)} should be read from the global buffer for the convolution with the weight pixel of (0,1). Consequently, 15 input data pixels out of 16 are repeatedly read from the global buffer. This global buffer read pattern is repeated for every weight pixel in each row.
In contrast, for the convolution of the second or third block of input data pixels, as shown at the center and bottom of
Figure 3, respectively, the input data read pattern is quite different.
Figure 3 shows which input data pixels are used in common for the convolution with two consecutive weight pixels in the three different cases. Although the global buffer read patterns vary significantly between the three cases, there are many input data pixels used in common for the convolution with two consecutive weight pixels. In the following section, we present how to minimize global buffer access by exploiting the analysis of the global buffer read pattern.
Figure 4 shows the entire convolutional operation comprising all of the input data blocks. It is assumed that the dimensions of the entire input dataset have been given in such a way that each of the input data blocks is solely determined in the form of an entire rectangular shape, as in the first block shown at the top of
Figure 3. This condition can be satisfied if the width of the input data pixels is
for i = 0, 1, 2, …when the temporal data reuse factor is 16, with the dimensions of the weight pixels being
. In general, the condition for each input data block to be the type of the first block shown at the top of
Figure 3 can be satisfied if the width of the output feature pixels is set as a multiple of the temporal data reuse factor.
Table 1 summarizes all the indices used in the convolutional operations discussed herein.
Thus far, we have observed how each of the input data pixels in a given block should be read from the global buffer for convolution with the corresponding weight pixel. From this observation, it has been found that some of the input data pixels are used repeatedly for the convolution with the next weight pixel, meaning that these overlapping input data pixels can be reused such that they do not have to be read again from the global buffer. However, it has also been found that the pattern of the repeated input data pixels varies depending on which input data block is to be convolved with the present weight pixel.
To add to the analysis of the global buffer read pattern within a given block of input data pixels as discussed herein, the global buffer read pattern among the interblock operations can be observed. In other words, we want to find which input data pixels out of those read for the convolutional operations for one block of input data pixels can be reused for the convolutional operations for the next block of input data pixels. In
Figure 3, the input data pixels inside the yellow box, for example, are used repeatedly for both the first and second blocks. Similarly, the input data pixels inside the blue box are used for both the second and third blocks of input data pixels. In other words, the six input data pixels located at the last two columns of every input data block are always identical to those located at the first two columns of the next block, thereby indicating that they are read repeatedly for the convolutional operations for the two blocks.
In addition to the six input data pixels, it can be observed from
Figure 3 that quite a few other groups of input data pixels are used in common for the convolutional operations of the two different blocks of input data pixels. For example, the input data pixels inside the purple and pink boxes are used in common for both the first and second blocks and both the second and third blocks, respectively.
However, as can be observed, the pattern of repeated input data pixels between the first and second blocks is not the same as that between the second and third blocks, as denoted in purple and pink, respectively, in
Figure 3. In other words, although some input data pixels in the two different blocks are used in common, the pattern of the repeatedly used input data pixels might vary at different blocks. In particular, although some input data pixels are commonly used for the convolutional operations of two different blocks, it is impossible to exploit the commonly used input data pixels for data reuse unless the locations of the commonly used input data pixels at each of the two consecutive blocks are fixed.
Nevertheless, if we can exploit the repeated use of input data pixels among the interblock convolutional operations, we can reduce the global buffer access for the interblock operations, as well as for the convolutions within a given block of input data pixels. In the following section, we present a novel procedure for rearranging the convolutional operations such that a group of input data pixels used repeatedly appears with a fixed regularity. By doing so, we can significantly reduce the global buffer access required for reading the input data pixels.