Graph application workloads are dominated by random memory accesses with the poor locality. To tackle the irregular and sparse nature of computation, ReRAM-based Processing-in-Memory (PIM) architectures have been proposed recently. Most of these ReRAM architecture designs have focused on mapping graph computations into a set of multiply-and-accumulate (MAC) operations. ReRAMs also offer a key advantage in reducing memory latency between cores and memory by allowing for PIM. However, when implemented on a ReRAM-based manycore architecture, graph applications still pose two key challenges—significant storage requirements (particularly due to wasted zero cell storage), and significant amount of on-chip traffic. To tackle these two challenges, in this article, we propose the design of a 3D NoC-enabled ReRAM-based manycore architecture. Our proposed architecture incorporates a novel crossbar-aware node reordering to reduce ReRAM storage requirements. Secondly, its 3D NoC-enabled design reduces on-chip communication latency. Our architecture outperforms the state-of-the-art in ReRAM-based graph acceleration by up to 5× in performance while consuming up to 10.3× less energy for a range of graph inputs and workloads.
1 Introduction
Graphs have become ubiquitous in several data-driven applications and machine learning workflows, as they offer an effective way to model networked behavior in both the natural world and human-engineered systems. However, with steep increases in both the volume of observable data and the diversity in applications, scalable processing for graph workloads on emerging manycore platforms remains a challenge. While CPU- and GPU-based manycore platforms continue to be used for executing graph applications, poor locality in graph structures and irregular data access patterns pose significant challenges. Skewed vertex degree distributions of real-world graphs make it nearly impossible to maintain high locality in graph structures, causing repeated accesses to vertex neighborhoods or random walk traversals to incur a high volume of cache misses. Furthermore, the deep memory hierarchies in conventional manycore architectures (such as CPUs and GPUs) exacerbate the cost of data movement [1].
Resistive random-access memory (ReRAM)-based Processing-in-Memory (PIM) modules, offer an effective way to address the high memory bandwidth requirement of graph analytics by integrating the computing logic in the memory. The ReRAM crossbars can store the adjacency matrix of a graph and the computation in most graph primitives can be decomposed into multiply-and-accumulate (MAC) operations, which are supported by ReRAM. However, most real-world graphs are sparse–i.e., with far fewer number of nonzero cells than the zero cells–causing significant wastage in the storage across the ReRAM crossbars (as only nonzeros contribute to meaningful computation). One way to reduce storage as well as improve locality in the distribution of nonzeros is through vertex (re)ordering [2]. By assigning similar ranks to vertices that are also neighbors on the graph, reordering techniques can effectively cluster the nonzero cells along the main diagonal of the adjacency matrix. While this increased density of nonzero cells can reduce wasted storage on ReRAMs, current vertex reordering schemes are not fully equipped to maximize on this potential as they do not consider the crossbar structure of ReRAMs [3]. Secondly, current ReRAM-based approaches [2, 3] also do not support an efficient communication backbone between ReRAM-based processing elements (PEs). Graph computations frequently feature irregular memory accesses, including long-range traffic between PEs, which could degrade overall performance and energy efficiency.
In this article, we address the above limitations of ReRAM-based graph acceleration by presenting the design of an efficient 3D Network-on-Chip (NoC)-enabled ReRAM manycore accelerator for graph analytics. The main contributions are as follows:
(1)
(Software-level). To improve performance and reduce storage for ReRAM-based graph applications, we propose an efficient crossbar-aware vertex reordering-based approach.
(2)
(Hardware-level). To reduce communication latency for irregular graph workloads, we present the design of a 3D NoC architecture that optimizes ReRAM block placement on the manycore platform.
(3)
(Evaluation). We present a thorough evaluation of our proposed architecture on various real-world graph inputs using different graph operations—namely, PageRank, Single Source Shortest Path (SSSP), Connected Components (CC), BFS, and Triangle counting. Our proposed framework significantly outperforms existing state-of-the-art ReRAM-based graph accelerators both in terms of execution time and energy consumption.
2 Background and Related Work
Designing specialized manycore architectures for graph analytics has been an area of active research in recent years. Though CPU and GPU-based manycore computing have been used, the data movement due to irregular memory accesses limits performance and energy efficiency. One possible way is to modify the organization of caches and partition them into multiple planar layers in a 3D structure to improve the cache hit rate [4]. DRAM-based Hybrid Memory Cube (HMC) is another way to enhance the performance of graph accelerators [5, 24, 25]. However, the deep memory hierarchies in these architectures degrade the overall performance.
Performance of most of the current ReRAM-based accelerators is limited by the sparsity and lack of locality in graph structures [6, 7]. To this end, vertex reordering techniques can help by clustering non-zero elements in graph adjacency matrix [2, 3]. Yet, in almost all existing ReRAM-based graph accelerators, either reordering techniques are unaware of crossbar structure, or the crossbar bounded property does not utilize the benefit introduced by the clustering of non-zero entries.
Another factor influencing performance is the cost of data movement. An efficient communication backbone for inter-PE exchanges is critical; however, existing ReRAM-based graph accelerators do not support such efficient and scalable on-chip communication [3]. An optimized placement of the PEs and suitable NoC design have been shown to significantly improve the overall latency and energy efficiency, including for graph analytics [8]. However, these NoC architectures do not consider ReRAM-based PEs. Design of 3D NoC-based ReRAM architecture for training graph neural networks (GNN) involving dense weight matrices has been proposed [15]. The dense computations make the GNN workloads different from the sparsity seen in graph workloads. Hence, in this article, we bridge the gap in the state-of-the-art of ReRAM-based graph accelerators by designing a crossbar-aware vertex reordering scheme (software-level) complemented with an optimized NoC architecture (hardware-level) to achieve high performance and energy efficiency. We postulate that optimizing solely at either the software-level or at hardware-level will be inadequate as the gains achieved at one layer can be lost in the other if left unoptimized. In contrast, our software-hardware design is better positioned to generate significant performance gains because of its complementary nature—i.e., reducing data movement and storage requirement using software, while reducing communication latency using hardware.
3 Vertex Reordering
Preliminaries: Graph computations involve traversing the input sparse adjacency matrix corresponding to the graph. Since it is only the nonzero values of the matrix that contribute to work, reducing the zero storage becomes an important consideration. One way to achieve this is to rearrange the rows and columns of the adjacency matrix such that the concentration of nonzero cells is “clustered” in only some regions of the matrix, so that the vast remaining sections of the matrix, which have only zero cells need not be stored.
Vertex (re)ordering is an effective way to perform such a clustering [9]. Given an input graph \(G = (V,E)\) with n vertices (in V) and m edges in E, the goal is to compute a linear ordering \(\varPi :i \to [1,n]\), for every vertex \(i \in V\), such that the average linear gap distance in \(\varPi\) between any two neighbors \((i,j) \in E\) is minimized. The assignment \({\varPi}\)(i) is also referred to as the rank of vertex i. We refer to the original input ordering as the graph's natural ordering (\({\varPi}\)(i) = i, for each \(i \in V\)). The process of taking a natural ordering and producing a different vertex ordering is referred to as “reordering”. Several heuristics are used to generate reordering [10]. These schemes range from light-weight (e.g., degree-based) to more heavy-weight (window- and partitioning-based) schemes [9]. However, most existing node reordering algorithms are designed assuming a more traditional parallel platform (multicores, cluster computing) and remain oblivious to the ReRAM crossbar structure. Two recently proposed ReRAM-based graph accelerators (GraphSAR [3] and Spara [2]) leveraged vertex reordering techniques, which helped to outperform several well-known previous investigations (e.g., GraphR [6] and HyVE [7]) making them appropriate as baselines to consider.
GraphSAR [3] proposes vertex reordering technique where the rank of each vertex is assigned in an incremental order depending on their location in the original graph input file. More specifically, while loading the original edge list, an index is assigned to each new vertex starting from 0. For example, if vertices 1 and 3 are the vertices of the first edge listed in the input file, then these two vertices are renumbered as 0 and 1, respectively (i.e., 1 → 0 and 3 → 1). Subsequently, any vertex to be encountered for the first time is assigned the next unallocated vertex rank in an incremental fashion. This implies that the vertex reordering will depend on the order in which the list of edges is provided at input. Figures 1(a) and (b) show the adjacency matrices of the original and the reordered graph while using GraphSAR. Considering the ReRAM crossbar of size 2 × 2 (for illustration purpose only), we can see that the number of active blocks for original and reordered graph from Figures 1(a) and (b). In this example, the GraphSAR scheme reduces the number of active blocks from 12 in the natural ordering to 11 in the GraphSAR ordering. This reordering scheme is oblivious to the underlying crossbar configuration.
Fig. 1.
Fig. 1. The adjacency matrices for (a) original (Natural) and reordered graph for (b) GraphSAR, (c) Spara and (d) CARE.
Spara [2] uses different graph formats (compressed sparse row (CSR) and compressed sparse column (CSC)) to determine the ranks for destination and source vertices of an edge. Starting from an initial vertex, it searches its destination vertex set based on the CSR-formatted graph. Next, each node in the destination vertex set is analyzed one-by-one to obtain a new source vertex set based on the CSC representation. Based on that source vertex set, it then finds the new destination vertex set until it reaches the bounded threshold, which directly depends on the crossbar size. Figure 1(c) shows the adjacency matrix of the reordered graph using Spara. Considering the threshold as two for illustration purpose, we can see from Figure 1(c) that the number of active blocks is nine in the reordered graph by Spara, whereas it is twelve for the original graph. Hence, clustering the edges by reordering results in the reduction of the number of active crossbars.
However, the crossbar-aware feature does not fully exploit the advantage introduced by the clustering, leading to suboptimal use of ReRAM-based architectures. Moreover, Spara and GraphSAR both rely on sequentially processing the vertices to determine the new vertex labels. This makes both these two algorithms inherently sequential. Hence, we present a new crossbar-aware vertex reordering scheme called CARE that improves the clustering factor of the adjacency matrix and thus reduces the total number of “active blocks.”
Crossbar-Aware Vertex Reordering (CARE) Algorithm: A matrix block of size X*X is considered active if it contains at least one non-zero cell. The objective of the CARE algorithm is to minimize the total number of active blocks (via reordering of rows and columns), which in turn reduces the execution time, storage requirement, and power consumption.
Terminology: For a given adjacency matrix A, a row panel of size l starting at row r is a slice of A, which includes all rows from r to r + l − 1. Let col_seg(j, r, l) denote a column segment of a given column j of length l starting at the cell at row r, i.e., the contiguous slice A[r:r + l − 1, j]. A column segment col_seg(j, r, l) is considered “active” if at least one of its cells is a non-zero. Similarly, a 2D block of matrix A is considered “active” if at least one of its cells is a non-zero. Let \(activ{e}_{col( p )}\\)\)represent the set of column IDs of all the non-zero cells in row p. We define the similarity of two rows, p and q, using the Jaccard similarity of the active columns, i.e.,
The CARE algorithm is based on the following main ideas: (i) for a crossbar of size X, the number of active blocks is positively correlated with the number of active column segments; (ii) grouping rows with a high similarity can reduce the total number of active column segments; and (iii) empty 2D blocks within a row panel can be safely ignored.
Row ordering: Building on these ideas, CARE first tries to reorder rows with high similarity together and then reorders the columns to minimize the total number of active 2D blocks. First, rows are reordered so that similar rows are assigned contiguous row ids. Jaccard similarity can be used to group and reorder the rows; however, such an approach is expensive (\(O({n}^2\sigma )\); n = #rows, \(\sigma\) = average #nnz per row). Alternatively, a light-weight approach is to sort the rows based on the number of non-zeros – intuitively, vertex (row) pairs that share a high Jaccard similarity also need to have similar degrees (i.e., a necessary but not sufficient condition). Once the rows are reordered, the set of rows is partitioned into row panels of size X. Figure 2 depicts phase 1 of our algorithm. Figure 2(a) shows the original adjacency matrix A. Figure 2(b) shows the state of A after row sorting. Figure 2(c) shows the conceptual view of A after the X-way row panel split.
Fig. 2.
Fig. 2. Overview of CARE reordering algorithm (X represents crossbar size).
Column ordering: The second step of the approach is to reorder the columns. While existing approaches reorder columns, we reorder the column segments within each row panel (without explicitly renumbering the column ids), allowing for a better clustering of nonzero cells. For each row panel, we find the list of active column segments. Each such active column is then reordered such that the first active column is placed in column 0, the second active column in column 1, and so on. In other words, all the active columns are grouped together and moved to the left side, leaving the non-active columns grouped together to the right side. A separate array per row panel is used to indicate the column id (metadata). Figure 2(c) shows the state after reordering. The blue boxes represent the active blocks. Clustering subgraphs with active elements in left columns helps to discard the inactive blocks placed in the right side of the adjacency matrix. As ReRAM crossbars store active blocks only, it reduces storage requirement. Moreover, the locality improvement by the CARE reordering scheme brings a vertex closer to its neighboring vertices and thus decreases the on-chip traffic. Figure 1(d) shows the adjacency matrix of the reordered graph by using CARE. Here, the value of X is considered as two for illustration purpose. We can see from Figure 1(a), (b), (c) and (d) that the number of active blocks is 12, 11, and 9 for natural, GraphSAR and Spara, respectively, whereas the number of active blocks for CARE is seven. However, though CARE potentially reduces the number of active vertices, when irregular graph workloads are mapped onto a ReRAM-based manycore architecture, inter-PE communication is significant. It should be noted that any ReRAM-based architecture must be divided into multiple ReRAM tiles with bounded crossbar size. Hence, inter-PE traffic is inevitable. Therefore, to reduce communication latency for irregular graph workloads, we present the design of a 3D NoC that optimizes ReRAM-based PE placement on the manycore platform in the next section.
4 Overall Architecture
In this section, we present the key attributes of our proposed architecture including the ReRAM-based tile (4.1) and NoC (4.2).
4.1 Tiled Architecture
Vertex In ReRAM-based accelerators, the adjacency matrix of the input graph is stored across the ReRAM cells, and graph computations are decomposed into a set of MAC operations that are performed based on Ohm's and Kirchhoff's current laws. By applying a voltage into the word line and sensing the resultant current along the bit-line, we implement the product of the input voltage and the cell conductance. Along with the product, the sum is obtained through the current summation over the bit-lines. Each row computes a product by streaming in the multiplicand via the word-line Digital to Analog Converter (DAC). The overall system consists of multiple ReRAM PEs, where each PE contains several ReRAM tiles. Each ReRAM tile is composed of several crossbars and the associated peripherals [15].
We use a simple strategy to map each active block (blue boxes in Figure 2(c)) to ReRAM tiles\(.\\)\)Each active block is assigned by a sequential id S and is mapped to the unique tile (i, j), where
When irregular graph workloads are mapped onto a ReRAM-based manycore architecture, inter-PE communication is significant. To analyze the effects of inter-PE communication, we considered three graph applications, viz. PageRank, SSSP and CC. Six different datasets (Table 1) considered in this work, are taken from the Stanford Network Analysis Platform [18] and the Network Repository [19].
Table 1.
Input graph (label)
No. vertices
No. edges
musae_Github (GH)
37,699
289,003
gemsec-Deezer (DZ)
41,773
125,826
road_luxembourg-osm (RM)
114,598
119,667
com-Orkut (OR)
2,937,612
20,959,854
socfb-A-anon (FB)
3,097,165
23,667,394
soc-LiveJournal1 (LJ)
4,847,571
68,993,773
Table 1. Input Statistics of the Graph Datasets Used in Our Experiments
Figures 3(a), 4(a), and 5(a) show the normalized time needed for inter-PE communication with the natural ordering and CARE reordering scheme with PageRank, SSSP, and CC, respectively. It is evident from Figures 3(a), 4(a), and 5(a) that locality improvement by CARE achieves significant reduction (25.2× to 76.1×) in on-chip communication time compared to the natural ordering except for the RM (5.3×) dataset. The reduction in savings is least for RM because it is a road network with a uniform degree distribution, and consequently, there is relatively less to be gained in locality through reordering relative to natural ordering. All other inputs (GH, DZ, OR, FB, and LJ), which have power-law degree distribution characteristics, demonstrate larger savings with the CARE ordering. Though CARE reduces the overall communication cost compared to natural, it still has significant amount of inter-PE data traffic. Figures 3(b), 4(b), and 5(b) show the contribution of inter-PE communication in total processing time for a 2D Mesh NoC-based manycore architecture incorporating CARE. We can see from Figures 3(b), 4(b), and 5(b) that even after applying CARE, the contribution of inter-PE communication to total execution time for all the datasets is high (63.4% to 89.7%) except for RM (16.6%). This motivates the need for designing an efficient NoC for inter-PE communication (even with CARE).
Fig. 3.
Fig. 3. PageRank communication analysis: (a) Factor of increase in communication time for Natural ordering relative to CARE. (b) Contribution of communication to the total execution time for CARE.
Fig. 4.
Fig. 4. SSSP communication analysis: (a) Factor of increase in communication time for Natural ordering relative to CARE. (b) Contribution of communication to the total execution time for CARE.
Fig. 5.
Fig. 5. CC communication analysis: (a) Factor of increase in communication time for Natural ordering relative to CARE. (b) Contribution of communication to the total execution time for CARE.
Traffic and network topology: Most of the graph workloads follow the Gather-Apply-Scatter (GAS) model, where processing each vertex includes (a) gathering values from incoming neighbors, (b) generating new value and (c) scattering that to all outgoing neighbors [3]. Hence, graph operations on ReRAM-based accelerators are expected to predominantly give rise to many-to-few traffic patterns. This many-to-few traffic pattern involves long-range communication, which degrades the overall performance. Conventional 2D Mesh NoC architectures are not suitable for this kind of traffic [11]. It has already been shown that either by inserting long-range shortcuts in a regular Mesh to induce small-world effects or by adopting power-law based small-world connectivity, we can achieve significant performance gain and lower energy dissipation compared to traditional multi-hop Mesh networks [11]. Therefore, we design a small-world network based NoC (SWNoC) where the links between routers are established following a power law distribution for the graph applications under consideration. However, when a small-world network is implemented in a 2D structure, there will be multiple physically long wires connecting the largely separated PEs. Ultimately, this will give rise to high timing and energy overheads. However, when a small-world NoC is implemented using 3D integration, the largely separated PEs in a 2D structure can be placed in different planar dies and connected using vertical links. Figure 6 shows the overall architecture and the illustration of the 3D SWNoC based ReRAM-based manycore architecture. As shown in Figure 6, relatively longer planner links, can be converted to shorter vertical links by placing the communicating PEs in two planar dies. Hence, it reduces the timing and energy costs [11]. Therefore, in this work, we design a 3D SWNoC to enhance the overall performance.
Fig. 6.
Fig. 6. Illustration of the overall architecture.
Placement: Due to massive data parallelism, ReRAM based manycore architectures are typically optimized to achieve high throughput. Reducing the average hop count reduces latency, making PEs available for more computation and thereby improving the throughput of computation. Additionally, load balancing across the NoC is used to further enhance throughput [12]. Minimizing the standard deviation of hop count will achieve load balancing by reducing the congestion along various paths. Hence, we compare designs (θ) with different PE and link placements via the degree of achievable load balancing in the NoC, i.e., using mean M(θ) and standard deviation SD(θ) of the hop count, as given by
where C and L represent the number of PEs and the number of links, respectively, in the overall architecture, \({h}_{ij}\\)\)is the number of hops from PE i to PE j. Therefore, designing the optimized SWNoC boils down to a multi-objective optimization (MOO) problem where both M(θ) and SD(θ) are minimized to maximize the achievable throughput. We can represent the MOO-formulation as follows:
where, \({P}^*\) is the set of Pareto optimal designs. We choose the design (\({\theta}^*\)) from the set of Pareto optimal designs where the throughput is maximum. The optimization problem is solved by using the popular simulated annealing (SA) based multi-objective heuristic, AMOSA [13] as it can find a high-quality solution with optimized placement of PEs and links in a reasonable time.
5 Experiment Results
Experimental Setup: We use NVSim [17] in conjunction with BookSim [16] to evaluate the performance of the proposed ReRAM-based manycore architecture. We leverage Booksim [16] for implementing different NoC architectures considered in this work. In the proposed architecture, each PE has four tiles. Each tile contains 96 crossbars (128 × 128) and associated peripheral circuits such as ADC, DAC, and so on, along with eDRAM. The capacity of eDRAM is considered as 36 MB in our proposed design. The value of LRS and HRS are 14.7 KΩ and 167 KΩ, respectively. Here, we assume ReRAMs that can store 2-bits per cell. Each PE takes up 0.37 mm2 of area [14]. The architecture requires multiple such ReRAM PEs for storage as well as computation, to accommodate the large sizes of input graphs. For implementing 2D Mesh, 1,024 PEs are arranged in a 32 × 32 grid pattern. Considering a 20 \(\times\) 20 mm die, the length of each inter-router link is 0.625 mm. The overall system runs at the clock frequency of 2.5 GHz. Considering this clock frequency, a 0.625 mm link can be traversed in one cycle. In our proposed 3D SWNoC architecture, 1,024 PEs are equally partitioned into four planar layers. Each layer is of size 10 \(\times\) 10 mm (considering same area as the 2D system). Within each layer, 256 PEs are placed in 16 × 16 grid pattern. In the SWNoC architecture, there are planar links longer than 0.625 mm. The longer links are divided into multiple pipelined stages where each stage is of length 0.625 mm. Hence, multiple cycles are necessary to traverse these links. All the vertical links connecting the planar layers are traversed in one cycle. BookSim determines the overall NoC latency. We use the PE and memory characteristics along with total NoC latency in NVSim to determine the overall energy consumption and execution time. We evaluate the performance of the manycore architecture incorporating CARE with respect to two state-of-the-art ReRAM-based graph accelerators, GraphSAR [3] and Spara [2]. We choose GraphSAR and Spara as these are state-of-the-art architectures that outperform other previously developed techniques such as GraphR [6] and HyVE [7]. More specifically, GraphSAR achieves 4.43\(\times\) energy reduction and 1.85\(\times\) speedup with respect to GraphR and 1.29\(\times\) speedup and 2.18\(\times\) energy reduction compared with HyVE on an average. On the other hand, Spara outperforms GraphR and GraphSAR by 8.21\(\times\) and 5.01\(\times\) in terms of performance, and by 8.97\(\times\) and 5.68\(\times\) in terms of energy savings, respectively. Therefore, as GraphSAR and Spara already demonstrated the comparative performance analysis with respect to other state-of-the-art counterparts, we refrain from repeating those results in this article for brevity. Table 1 shows all the inputs used for the full system performance analysis. Table 2 shows the specifications of the proposed 3D manycore ReRAM architecture. For thermal evaluation, we model the overall architecture in 3-D-ICE simulator based on various parameters e.g., layer thickness, thermal conductivity, and so on. as listed in [20].
Table 2.
No. of planar layers
4
No. of total PEs
1,024
Area of each PE
0.37 mm2
Area of each planar layer
10 × 10 mm2
Clock frequency
2.5 GHz
Value of LRS
14.7 KΩ
Value of HRS
167 KΩ
Capacity of eDRAM
36 MB
ReRAM cell size
2-bits per cell
Table 2. Specifications of the Proposed 3D Manycore ReRAM Architecture
5.1 Selection of Crossbar Size
While storing the graph in crossbars, the adjacency matrix is decomposed into multiple non-overlapping \(N \times N\) segments to map on to \(N \times N\) shaped ReRAM crossbars. Current graph PIM architectures use relatively small crossbars (\(8 \times 8\)) to reduce the storage of zeros [3]. However, this also negatively impacts the area and power as those terms are dominated by peripheral circuits [14]. To reduce area and power, as well as to minimize the overall number of required ReRAM crossbars, a larger size becomes more desirable, and experimentation is needed to evaluate this tradeoff. In other words, when considering the total area, power, and zero storage, it boils down to two choices: (a) smaller size implies a greater number of crossbars and fewer zeros, and (b) larger size implies fewer crossbars and more zeros. We conducted an experiment to evaluate this tradeoff with multiple inputs. Figure 7 shows the normalized area, power, and zero storage by varying the crossbar size from \(8 \times 8\) to \(256 \times 256\\)\)for all the graph datasets considered in this work. We can see that the area and power continuously decrease with increasing crossbar size. However, beyond \(128 \times 128\) both area and power show saturating trends, while the zero storage increases. Consequently, we select the \(128 \times 128\) crossbar configuration as our default for all our experiments. This also implies setting the value of parameter X in CARE reordering (Figure 2) to 128.
Fig. 7.
Fig. 7. Area-Power-Zero storage tradeoffs for different crossbar configurations.
5.2 Performance of Reordering Scheme
Due to the sparsity in most real-world graph datasets, vertex reordering schemes help to reduce number of active blocks (i.e., matrix blocks with at least one non-zero element). Hence, we compare the number of active blocks generated using CARE to that of GraphSAR, Spara, and natural orderings. Figure 8(a) shows that all the ReRAM-based accelerators (CARE, GraphSAR, and Spara) outperform natural. Furthermore, we observe that CARE significantly outperforms GraphSAR and Spara, by up to 23.8\(\times\) and 18.3\(\times\), respectively. Note that the storage improvements achieved expectedly vary with inputs as it is tied to the structural organization of the underlying graphs. For instance, CARE reduces the number of active blocks for RM by 9.5\(\times\) compared to natural, whereas the gains are varying from 27.7\(\times\) to 58.4\(\times\) for the other social media datasets (e.g., GH, DZ, OR, FB, and LJ) considered in this work. It should be noted that the reduction in the number of active blocks for RM is much lower than the other social media datasets with power-law characteristics. This is because the RM dataset's natural ordering already had a good locality to start with. Hence, there is less room for improving the locality toward the goal of reducing the number of active blocks.
Fig. 8.
Fig. 8. (a) Normalized no. of active blocks (relative to CARE) for Spara, GraphSAR, and natural, (b) Normalized reordering time of CARE, GraphSAR, and Spara.
Next, to assess the cost of the reordering time (preprocessing), we compared the reordering times for CARE, GraphSAR, and Spara (Figure 8(b)). In all the cases, the CARE reordering times were the smallest, with the other two schemes taking considerably longer (up to over 10\(\times\) more in some cases). In summary, these experiments demonstrate that CARE reordering outperforms both GraphSAR and Spara on both zero storage as well as preprocessing cost.
5.3 Full System Performance Evaluation
Using CARE as the chosen reordering scheme, we analyzed hop count distribution of the proposed 3D SWNoC and compared that with 2D and 3D Mesh-based designs. For evaluation purposes, we refer to traffic with more than three 2D mesh hops as “long-range”. Figures 9(a), 10(a), and 11(a) show the percentage of long-range traffic for the 2D Mesh, 3D Mesh and 3D SWNoC. The traffic shown is for PageRank, SSSP, and CC, respectively, with GH input as an example. We observe similar characteristics for other datasets as well. The results show that long-range traffic for 3D SWNoC is 40% of total traffic, compared to 57% and 47.3% for 2D Mesh and 3D Mesh, respectively. It should be noted that introduction of the 3D structure helps in the reduction of long-range traffic due to the presence of shorter vertical links. As mentioned earlier, the PEs that are separated by long distance on a 2D Mesh can be placed in different planar layers and connected through vertical links. As vertical links are smaller in length compared to their planar counterparts, they can establish one-hop data exchange. Hence, it is evident from Figures 9(a), 10(a), and 11(a) that the traffic within three hops has been increased and the amount of long-range traffic has been reduced. Figures 9(b), 10(b), and 11(b) show the normalized execution time for the three NoC architectures running PageRank, SSSP, and CC. Figures 9(b), 10(b), and 11(b) show that 3D SWNoC achieves the lowest execution time, and that improvement is also input-dependent. For instance, more savings are achieved on the power-law graphs with 3D SWNoC (26% to 32.5% savings) than with RM (17%). As mentioned above, the reduction in savings is least for RM because it is a road network with a uniform degree distribution.
Fig. 9.
Fig. 9. (a) Percentage of total traffic that is long-range, under different NoC architectures for PageRank with GH, (b) Normalized execution time w.r.t 2D Mesh for PageRank.
Fig. 10.
Fig. 10. (a) Percentage of total traffic that is long-range, under different NoC architectures for SSSP with GH, (b) Normalized execution time w.r.t 2D Mesh for SSSP.
Fig. 11.
Fig. 11. (a) Percentage of total traffic that is long-range, under different NoC architectures for CC with GH, (b) Normalized execution time w.r.t 2D Mesh for CC.
In Figures 12 and 13, we compare the speedups achieved in full-system execution time by the proposed manycore architecture using CARE and Spara against GraphSAR (baseline). To ensure a fair and direct comparison, we tested the graph applications with the datasets reordered by Spara and GraphSAR reordering schemes, on our proposed architecture. The full-system execution time includes the computation time, inter-PE communication time and the data transfer time from the host. It should be noted that we show preprocessing cost separately from this analysis because we execute the reordering process once offline and then the reordered graph is being used multiple times for various applications (e.g., SSSP, PageRank, CC). Hence, the preprocessing cost is being amortized over multiple runs of the accelerator. The same strategy is adopted in other related works such as GraphSAR and Spara as well. For comparing these preprocessing times, we have reported the preprocessing timings, which are in seconds, for CARE along with GraphSAR and Spara in Figure 8(b). It should be noted that reordering not only helps to achieve speed up in graph computation on ReRAM, but also reduces the storage requirement. For example, the CARE reordering takes 89.2\(\times\) more time than the processing time of PageRank with LJ on ReRAM-based manycore system. However, by paying that amount of preprocessing cost once, the proposed reordering scheme, CARE reduces on-chip storage requirement by 58.4\(\times\) and achieves 84.2\(\times\) speed up compared to that of natural (i.e., without any preprocessing). This speed up compared to natural is valid for executing the application on the ReRAM-based system as many times as the user requires. So, we pay the one-time offline processing cost to achieve this huge speed up multiple times.
Fig. 12.
Fig. 12. Speedup of Spara and CARE in total execution time compared to GraphSAR for (a) PageRank, (b) SSSP and (c) CC.
Fig. 13.
Fig. 13. Speedup of Spara and CARE in total execution time compared to GraphSAR for (a) BFS, (b) Triangle counting.
We can see from Figures 12 and 13 that the SWNoC-enabled manycore architecture incorporating CARE achieves highest speedup (2.87\(\times\) to 49.4\(\times\)) compared to GraphSAR. This was clearly superior to the speedups achieved by Spara, which ranged from 1.3\(\times\) to 16.3\(\times\). Another key observation is that the speedup gains realized by the proposed architecture with CARE is higher for the larger datasets (where it matters more)—e.g., CARE achieves peak speedups (41\(\times\) to 49.4\(\times\)) for the largest input tested (LJ: 4.8M vertices and 68.9 M edges). This is because CARE significantly reduces the number of active blocks and on-chip data movement, and thereby also reducing the time taken for data transfer, on-chip traffic, and computation. Furthermore, if we were to exclude the contribution from the inter-PE on-chip communication in total execution time, the speedup for CARE with respect to GraphSAR would reduce (ranging from 1.29\(\times\) to 14\(\times\)) relative to its counterpart with communication time (2.87\(\times\) to 49.4\(\times\)). Excluding on-chip communication compromises the achievable performance gain significantly, which reinforces the necessity of designing an efficient NoC.
Figures 14(a)–(c) illustrate the comparison of full-system energy consumption for CARE with respect to GraphSAR and Spara for PageRank, SSSP, and CC, respectively. Similarly, Figures 15(a) and (b) show the comparison of full-system energy consumption for CARE with respect to GraphSAR and Spara for BFS and Triangle counting, respectively. Figures 14 and 15 show that CARE consumes 1.3\(\times\) to 16.3\(\times\) less energy compared to GraphSAR and 1.1\(\times\) to 10.3\(\times\) less energy compared to Spara for the inputs tested. As mentioned above, CARE is more efficient in reducing the number of active blocks and on-chip data movement. Also, due to high energy efficiency of ReRAM-based PEs, the peak temperature of the system remains below 85°C for all the configurations tested.
Fig. 14.
Fig. 14. Normalized energy of (a) PageRank, (b) SSSP and (c) CC for CARE, Spara and GraphSAR.
Fig. 15.
Fig. 15. Normalized energy of (a) BFS, (b) Triangle Counting for CARE, Spara and GraphSAR.
6 Conclusion
In this article, we presented a 3D manycore ReRAM architecture well suited for accelerating graph applications. We introduce a crossbar-aware node reordering scheme called CARE that reduces the storage requirement and on-chip traffic volume on ReRAM. However, even after applying CARE, the contribution of inter-PE communication in total execution time for all the datasets is high (63.4% to 89.7%) except for RM (16.6%), which motivates the need of an efficient NoC for inter-PE communication. To reduce latency of communication on the chip, we presented an optimized 3D SWNoC architecture that reduces the network latency incurred by the long-range traffic in graph workloads. This combination of CARE reordering in software coupled with SWNoC yields drastic reductions in both runtime and energy cost, also consistently outperforming two state-of-the-art ReRAM-based graph accelerators. We have also demonstrated that the speed up and energy improvement of our proposed architecture vary with the datasets. For social network inputs, vertex reordering and NoC architecture, both contribute noticeably to the overall performance and energy improvement whereas they are comparatively less for Road Map. This dichotomy in the results goes to show that the input characteristics could have a pronounced impact on what can be achieved through the combination of CARE reordering in software coupled with the proposed 3D SWNoC-based manycore ReRAM architecture.
References
[1]
K. A. Kalyanaraman and P. Pande. 2019. A brief survey of algorithms, architectures, and challenges toward extreme-scale graph analytics. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition. 1307–1312.
L. Zheng, J. Zhao, Y. Huang, Q. Wang, Z. Zeng, J. Xue, X. Liao, and H. Jin. 2020. Spara: An energy-efficient ReRAM-Based accelerator for sparse graph analytics applications. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). 696–707.
G. Dai, T. Huang, Y. Wang, H. Yang, and J. Wawrzynek. 2019. GraphSAR: A sparsity-aware processing-in-memory architecture for large-scale graph processing on ReRAMs. In Proceedings of the Asia and South Pacific Design Automation Conference. 120–126.
A. A. Maashri, G. Sun, X. Dong, V. Narayanan, and Y. Xie. 2009. 3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis. In Proceedings of the IEEE International Conference on Computer Design. 254–259.
M. M. Ozdal, S. Yesil, T. Kim, A. Ayupov, J. Greth, S. Burns, and O. Ozturk. 2016. Energy efficient architecture for graph analytics accelerators. In Proceedings of the International Symposium on Computer Architecture 44, 3 (2016), 166–177.
L. Song, Y. Zhuo, X. Qian, H. Li, and Y. Chen. 2018. GraphR: Accelerating graph processing using ReRAM. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture. 531–543.
T. Huang, G. Dai, Y. Wang, and H. Yang. 2018. HyVE: Hybrid vertex-edge memory hierarchy for energy-efficient graph processing. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition. 973–978.
K. Duraisamy, H. Lu, P. P. Pande, and A. Kalyanaraman. 2017. Accelerating graph community detection with approximate updates via an energy-efficient NoC. In Proceedings of the 54th ACM/IEEE Design Automation Conference (DAC). 1–6.
R. Barik, M. Minutoli, M. Halappanavar, N. R. Tallent, and A. Kalyanaraman. 2020. Vertex reordering for real-world graphs and applications: An empirical evaluation. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC). 240–251.
I. Safro, D. Ron, and A. Brandt. 2009. Multilevel algorithms for linear ordering problems. ACM Journal of Experimental Algorithmics 13, Article 4 (2009), 1.4.1.20.
S. Das, J. R. Doppa, P. P. Pande, and K. Chakrabarty. 2017. Design-Space exploration and optimization of an energy-efficient and reliable 3-D Small-World Network-on-Chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 36, 5 (2017), 719–732.
B. K. Joardar, R. G. Kim, J. R. Doppa, P. P. Pande, D. Marculescu, and R. Marculescu. 2019. Learning-Based Application-Agnostic 3D NoC design for heterogeneous manycore systems. IEEE Transactions on Computers 68, 6 (2019), 852–866.
S. Bandyopadhyay, S. Saha, U. Maulik, and K. Deb. 2008. A simulated annealing-based multiobjective optimization algorithm: AMOSA. IEEE Transactions on Evolutionary Computation 12, 3 (2008), 269–228.
A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA). 14-2.
A. I. Arka, B. K. Joardar, J. R. Doppa, and P. P. Pande. 2021. ReGraphX: NoC-enabled 3D Heterogeneous ReRAM architecture for training graph neural networks. Design, Automation & Test in Europe Conference & Exhibition (DATE). 1667–1672.
N. Jiang, D. U. Becker, G. Michelogiannakis, J. Balfour, B. Towles, D. E. Shaw, J. Kim, and W. Dally. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. 86–96.
X. Dong, C. Xu, Y. Xie, and N. P. Jouppi. 2014. Nvsim: A circuit-level performance, energy, and area model for emerging non-volatile memory. In Proceedings of the Emerging Memory Technologies. Springer, 15–50.
A . Sridhar, A. Vincenzi, M. Ruggiero, T. Brunschwiler, and D. Atienza. 2010. 3D-ICE: Fast compact transient thermal modeling for 3D ICs with inter-tier liquid cooling. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (2010), 463–470.
W. Lee, J. Park, J. Shin, and J. Woo. 2012. Varistor-type bidirectional switch (JMAX >107 A/cm2, selectivity∼104) for 3D bipolar resistive memory arrays. In Proceedings of the 2012 Symposium on VLSI Technology (Jun. 2012), 37–38.
L. Zhang, S. Cosemans, D. Wouters, and G. Groseneken. 2015. On the optimal ON/OFF resistance ratio for resistive switching element in one-selector one-resistor crosspoint arrays. IEEE Electron Device Letters 36, 6 (2015), 570–572.
T. Schultz, R. Jha, M. Casto, and B. Dupaix. 2020. Vulnerabilities and reliability of ReRAM based PUFs and memory logic. IEEE Transactions on Reliability 69, 2 (2020), 690–698.
D. Choudhury, A. S. Rajam, A. Kalyanaraman, and P. Pande. 2022. High-performance and energy-efficient 3D manycore GPU architecture for accelerating graph analytics. ACM Journal on Emerging Technologies in Computing Systems 18, 1 (2022), Article No: 18, 1–19.
D. Choudhury, R. Barik, A. S. Rajam, A. Kalyanaraman, and P. Pande. 2022. Software/hardware co-design of 3D NoC-based GPU architectures for accelerated graph computations. ACM Trans. Des. Autom. Electron. Syst. 27, 6 (2022), 1--22.
Manycore GPU architectures have become the mainstay for accelerating graph computations. One of the primary bottlenecks to performance of graph computations on manycore architectures is the data movement. Since most of the accesses in graph processing are ...
Recent advances in GPU-based manycore accelerators provide the opportunity to efficiently process large-scale graphs on chip. However, real world graphs have a diverse range of topology and connectivity patterns (e.g., degree distributions) that make the ...
Graph processing is widely used for many modern applications, such as social networks, recommendation systems, and knowledge graphs. However, processing large-scale graphs on traditional Von Neumann architectures is challenging due to the irregular graph ...
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].