Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Designing on-chip networks for throughput accelerators

Published: 16 September 2013 Publication History
  • Get Citation Alerts
  • Abstract

    As the number of cores and threads in throughput accelerators such as Graphics Processing Units (GPU) increases, so does the importance of on-chip interconnection network design. This article explores throughput-effective Network-on-Chips (NoC) for future compute accelerators that employ Bulk-Synchronous Parallel (BSP) programming models such as CUDA and OpenCL. A hardware optimization is “throughput effective” if it improves parallel application-level performance per unit chip area. We evaluate performance of future looking workloads using detailed closed-loop simulations modeling compute nodes, NoC, and the DRAM memory system. We start from a mesh design with bisection bandwidth balanced to off-chip demand. Accelerator workloads tend to demand high off-chip memory bandwidth which results in a many-to-few traffic pattern when coupled with expected technology constraints of slow growth in pins-per-chip. Leveraging these observations we reduce NoC area by proposing a “checkerboard” NoC which alternates between conventional full routers and half routers with limited connectivity. Next, we show that increasing network terminal bandwidth at the nodes connected to DRAM controllers alleviates a significant fraction of the remaining imbalance resulting from the many-to-few traffic pattern. Furthermore, we propose a “double checkerboard inverted” NoC organization which takes advantage of channel slicing to reduce area while maintaining the performance improvements of the aforementioned techniques. This organization also has a simpler routing mechanism and improves average application throughput per unit area by 24.3%.

    References

    [1]
    Abts, D., Jerger, N. D. E., Kim, J., Gibson, D., and Lipasti, M. H. 2009. Achieving predictable performance through better memory controller placement in many-core cmps. In Proceedings of the IEEE/ACM Symposium on Computer Architecture (ISCA'09). ACM Press, New York, 451--461.
    [2]
    Ahn, J. H., Dally, W. J., Khailany, B., Kapasi, U. J., and Das, A. 2004. Evaluating the imagine stream architecture. In Proceedings of the IEEE/ACM Symposium on Computer Architecture (ISCA'04). IEEE Computer Society, Washington, DC, 14--25.
    [3]
    Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wessel, D., and Yelick, K. 2009. A view of the parallel computing landscape. Comm. ACM 52, 10, 56--67.
    [4]
    Bai, P., Auth, C., Balakrishnan, S., Bost, M., Brain, R., et al. 2004. A 65nm logic technology featuring 35nm gate lengths, enhanced channel strain, 8 cu interconnect layers, low-k ild and 0.57 um2 sram cell. In Proceedings of the IEEE International Electron Devices Meeting, IEDM Technical Digest. 657--660.
    [5]
    Bakhoda, A., Kim, J., and Aamodt T. M. 2010. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the IEEE/ACM Symposium on Microarchitecture (MICRO'10). IEEE Computer Society, Washington, DC, 421--432.
    [6]
    Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., and Aamodt, T. M. 2009. Analyzing cuda workloads using a detailed gpu simulator. In Proceedings of the IEEE Symposium on Performance Analysis of Systems and Software (ISPASS'09). 163--174.
    [7]
    Balfour, J. D. and Dally, W. J. 2006. Design tradeoffs for tiled CMP on-chip networks. In Proceedings of the ACM Conference on Supercomputing (ICS'06). ACM Press, New York, 187--198.
    [8]
    Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer J. W., Lee, S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE Symposium on Workload Characterization (IISWC'09). 44--54.
    [9]
    Coon, B. W. and Lindholm, E. J. 2008. US patent 7,353,369: System and method for managing divergent threads in a simd architecture. https://www.google.com/patents/US7353369.
    [10]
    Sdk, C. 2009. NVIDIA CUDA SDK code samples. http://developer.nvidia.com/object/cuda sdk samples.html.
    [11]
    Dally, W. J., Labonte, F., Das, A., Hanrahan, P., Ahn, J.-H., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, T. J., and Kapasi, U. J. 2003. Merrimac: Supercomputing with streams. In Proceedings of the ACM/IEEE Conference on Supercomputing. ACM Press, New York, 35.
    [12]
    Dally, W. J. and Towles, B. 2004. Principles and Practices of Interconnection Networks. Morgan Kaufmann, San Fransisco, CA.
    [13]
    Das, R., Eachempati, S., Mishra, A. K., Narayanan, V., and Das, C. R. 2009. Design and evaluation of a hierarchical on-chip interconnect for next-generation CMPs. In Proceedings of the IEEE Symposium on High-Performance Computer Architecture (HPCA'09). 175--186.
    [14]
    Fang, J.-W., and Chang, Y.-W. 2010. Area-I/O flip-chip routing for chip-package co-design considering signal skews. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 29, 5, 711--721.
    [15]
    Fung, W. W. L., Sham, I., Yuan, G., and Aamodt, T. M. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th IEEE/ACM Symposium on Microarchitecture (MICRO'07). IEEE Computer Society, Washington, DC, 407--420.
    [16]
    Grot, B., Hestness, J., Keckler, S. W., and Mutlu, O. 2009. Express cube topologies for on-chip interconnects. In Proceedings of the IEEE Symposium on High-Performnce Computer Architecture (HPCA'09). 163--174.
    [17]
    Harris, M. 2009. UNSW CUDA tutorial part 4 optimizing CUDA. http://cs.anu.edu.au/files/systems/GPUWksp/PDFs/04 OptimizingCUDA full.pdf.
    [18]
    Ingerly, D., Agraharam, S., Becher, D., Chikarmane, V., Fischer, K., et al. 2008. Low-k interconnect stack with thick metal 9 redistribution layer and cu die bump for 45nm high volume manufacturing. In Proceedings of the International Interconnect Technology Conference (IITC'08). 216--218.
    [19]
    Itrs. 2008. International technology roadmap for semiconductors 2008 update. http://www.itrs.net/Links/2008ITRS/Home2008.htm.
    [20]
    Jiang, N., Becker, D. U., Michelogiannakis, G., Balfour, J., Towles, B., Kim, J., and Dally, W. J. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In Proceedings of the IEEE Symposium on Performance Analysis of Systems and Software (ISPASS'13). 86--96.
    [21]
    Kahng, A., Li, B., Peh, L.-S., and Samadi, K. 2009. ORION 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In Proceedings of the IEEE/ACM Conference on Design Automation and Test in Europe (DATE'09). 23--428.
    [22]
    Kelm, J. H., Johnson, D. R., Lumetta, S. S., Frank, M. I., and Patel, S. 2010a. A task-centric memory model for scalable accelerator architectures. IEEE Micro 30, 1, 29--39.
    [23]
    Kelm, J. H., Johnson, D. R., Touhy, W., Lumetta, S. S., and Patel, S. 2010b. Cohesion: A hybrid memory model for accelerator architectures. In Proceedings of the IEEE/ACM Symposium on Computer Architecture (ISCA'10). ACM Press, New York, 429--440.
    [24]
    Kessler, R. E., and Schwarzmeier, J. L. 1993. Cray t3d: A new dimension for cray research. In Compcon Spring Digest of Papers. 176--182.
    [25]
    Group, K. 2010. OpenCL - The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/.
    [26]
    Kim, J. 2009. Low-cost router microarchitecture for on-chip networks. In Proceedings of the IEEE/ACM Symposium on Microarchitecture (MICRO'09). 255--266.
    [27]
    Kim, J., Balfour, J., and Dally, W. 2007. Flattened butterfly topology for on-chip networks. In Proceedings of the IEEE/ACM Symposium on Microarchitecture (MICRO'07). IEEE Computer Society, Washington, DC, 172--182.
    [28]
    Kim, J., Dally, W. J., Towles, B., and Gupta, A. K. 2005. Microarchitecture of a high-radix router. In Proceedings of the IEEE/ACM Symposium on Computer Architecture (ISCA'05). IEEE Computer Society, Washington, DC, 420--431.
    [29]
    Kistler, M., Perrone, M., and Petrini, F. 2006. Cell multiprocessor communication network: Built for speed. IEEE Micro 26, 3, 10--23.
    [30]
    Kongetira, P., Aingaran, K., and Olukotun, K. 2005. Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25, 2, 21--29.
    [31]
    Krolak, D. 2005. Cell broadband engine eib bus. http://www.ibm.com/developerworks/power/library/paexpert9/.
    [32]
    Kumar, A., Kundu, P., Singh, A., Peh, L.-S., and Jha, N. 2007a. A 4.6tbits/s 3.6ghz singlecycle noc router with a novel switch allocator in 65nm cmos. In Proceedings of the IEEE Conference on Computer Design (ICCD'07). 63--70.
    [33]
    Kumar, A., Peh, L.-S., Kundu, P., and Jhay, N. K. 2007b. Express virtual channels: Towards the ideal interconnection fabric. In Proceedings of the IEEE/ACM Symposium on Computer Architecture (ISCA'07). ACM Press, New York, 150--161.
    [34]
    Kumar, A., Peh, L.-S., and Jha, N. K. 2008. Token flow control. In Proceedings IEEE/ACM Symposium on Microarchitecture (MICRO'08). IEEE Computer Society, Washington, DC, 342--353.
    [35]
    Kumar, P., Pan, Y., Kim, J., Memik, G., and Choudhary, A. N. 2009. Exploring concentration and channel slicing in on-chip network router. In Proceedings of the IEEE/ACM Symposium on Networks-on-Chip (NOCS'09). 276--285.
    [36]
    Lee, J. W., Ng, M. C., and Asanovic, K. 2008. Globally-synchronized frames for guaranteed quality-of-service in on-chip networks. In Proceedings of the IEEE/ACM Symposium on Computer Architecture (ISCA'08). IEEE Computer Society, Washington, DC, 89--100.
    [37]
    Lee, M. M., Kim, J., Abts, D., Marty, M., and Lee, J. W. 2010. Probabilistic distance-based arbitration: Providing equality of service for many-core CMPs. In Proceedings of the IEEE/ACM Symposium on Microarchitecture (MICRO'10). IEEE Computer Society, Washington, DC, 509--519.
    [38]
    Levinthal, A., and Porter, T. 1984. Chap - A simd graphics processor. In Proceedings of the ACM SIGGRAPH Conference on Computer Graphics and Interactive Techniques (SIGGRAPH'84). 77--82.
    [39]
    Lotfi-Kamran, P., Grot, B., and Falsafi, B. 2012. NOC-out: Microarchitecting a scale-out processor. In Proceedings of the IEEE/ACM Symposium on Microarchitecture (MICRO'12). IEEE Computer Society, Washington, DC, 177--187.
    [40]
    Mishra, A. K., Vijaykrishnan, N., and Das, C. R. 2011. A case for heterogeneous on-chip interconnects for CMPs. In Proceedings of the IEEE/ACM Symposium on Computer Architecture (ISCA'11). ACM Press, New York, 389--400.
    [41]
    Moscibroda, T., and Mutlu, O. 2009. A case for bufferless routing in on-chip networks. In Proceedings of the IEEE/ACM Symposium on Computer Architecture (ISCA'09). ACM Press, New York, 196--207.
    [42]
    Mullins, R. D., West, A., and Moore, S. W. 2004. Low-latency virtual-channel routers for on-chip networks. In Proceedings of the IEEE/ACM Symposium on Computer Architecture (ISCA'04). IEEE Computer Society, Washington, DC, 188--197.
    [43]
    Nesson, T., and Johnsson, S. L. 1995. ROMM routing on mesh and torus networks. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures (SPAA'95). ACM Press, New York, 275--287.
    [44]
    Nickolls, J., Buck, I., Garland, M., and Skadron, K. 2008. Scalable parallel programming with CUDA. ACM Queue 6, 2, 40--53.
    [45]
    Nickolls, J. R., Coon, B. W., and Shebanow, M. C. 2011. US patent application 20110072213: Instructions for managing a parallel cache hierarchy (Assignee NVIDIA Corp.). March.
    [46]
    Nvidia. 2009. NVIDIA's next generation CUDA compute architecture: Fermi. http://openclcomputing.com/index.php/cuda/10-fermi.
    [47]
    Nvidia 2010. NVIDIA CUDA Programming Guide, 3.0 ed. NVIDIA.
    [48]
    Peh, L.-S. and Dally, W. J. 2001. A delay model and speculative architecture for pipelined routers. In Proceedings of the IEEE Symposium on High-Performance Computer Architecture (HPCA'01). IEEE Computer Society, Washington, DC, 255--266.
    [49]
    Pfister, G. F. and Norton, V. A. 1985. Hot-spot contention and combining in multistage interconnection networks. IEEE Trans. Comput. 34, 10, 943--948.
    [50]
    Pullini, A., F., Angiolini, A., Murali, S., Atienza, D., Micheli, G. D., and Benini, L. 2007. Bringing nocs to 65 nm. IEEE Micro 27, 5, 75--85.
    [51]
    Rixner, S., Dally, W. J., Kapasi, U. J., Mattson, P., and Owens, J. D. 2000. Memory access scheduling. In Proceedings of the 27th International Symposium on Computer Architecture. ACM Press, New York, 128--138.
    [52]
    Ryoo, S., Rodrigues, C., Stone, S., Baghsorkhi, S., Ueng, S.-Z., Stratton, J., and Hwu, W.-M. W. 2008. Program optimization space pruning for a multithreaded GPU. In Proceedings of the IEEE/ACM Symposium on Code Generation and Optimization (CGO'08). ACM Press, New York, 195--204.
    [53]
    Salihundam, P., Jain, S., Jacob, T., Kumar, S., Erraguntla, V., et al. 2010. A 2tb/s 6*4 mesh network with DVFS and 2.3tb/s/w router in 45nm CMOS. In Proceedings of the IEEE Symposium on VLSI Circuits (VLSIC'10).79--80.
    [54]
    Seiler, L., Carmean, D., Sprangle, E., Forsyth, T., Abrash, M., Dubey, P., Junkins, S., Lake, A., Sugerman, J., Cavin, R., Espasa, R., Grochowski, E., Juan, T., and Hanrahan, P. 2008. Larrabee: A many-core x86 architecture for visual computing. ACM Trans. Graph. 27, 3, 18:1--18:15.
    [55]
    Seo, D., Ali, A., Lim, W.-T., Rafique, N., and Thottethodi, M. 2005. Near-optimal worst-case throughput routing for two-dimensional mesh networks. In Proceedings of the IEEE/ACM Symposium on Computer Architecture (ISCA'05). 432--443.
    [56]
    Sun, C., Chen, C.-H. O., Kurian, G., Wei, L., Miller, J., Agarwal, A., Peh, L.-S., and Stojanovic, V. 2012. DSENT - A tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In Proceedings of the IEEE/ACM Symposium on Networks-on-Chip (NOCS'12). IEEE Computer Society, Washington, DC, 201--210.
    [57]
    Sun Microsystems Inc. 2007. OpenSPARCTM t2 core microarchitecture specification. http://www.oracle.com/technetwork/systems/opensparc/t2-06-opensparct2-core-microarch-1537749.html.
    [58]
    Valiant, L. G. 1990. A bridging model for parallel computation. Comm. ACM 33, 8, 103--111.
    [59]
    Valiant, L. G. and Brebner, G. J. 1981. Universal schemes for parallel communication. In Proceedings of the ACM Symposium on Theory of Computing (STOC'81). ACM Press, New York, 263--277.
    [60]
    Vangal, S. R., Howard, J., Ruhl, G., Dighe, S., Wilson, H., et al. 2008. An 80-tile sub-100-w teraflops processor in 65-nm CMOS. IEEE J. Solid-State Circ. 43, 1, 29--41.
    [61]
    Volos, S., Seiculescu, C., Grot, B., Pour, N. K., Falsafi, B., and Micheli, G. D. 2012. CCNoC: Specializing on-chip interconnects for energy efficiency in cache-coherent servers. In Proceedings of the IEEE/ACM Symposium on Networks-on-Chip (NOCS'12). IEEE Computer Society, Washington, DC, 67--74.
    [62]
    Wentzlaff, D., Griffin, P., Hoffmann, H., Bao, L., Edwards, B., Ramey, C., Mattina, M., Miao, C.-C., Brown Iii, J. F., and Agarwal, A. 2007. On-chip interconnection architecture of the tile processor. IEEE Micro 27, 15--31.
    [63]
    Wong, H., Bracy, A., Schuchman, E., Aamodt, T. M., Collins, J. D., Wang, P. H., Chinya, G., Groen, A. K., Jiang, H., and Wang, H. 2008. Pangaea: A tightly-coupled ia32 heterogeneous chip multiprocessor. In Proceedings of the IEEE/ACM Conference on Parallel Architectures and Compilation Techniques (PACT'08). ACM Press, New York, 52--61.
    [64]
    Wong, H., Papadopoulou, M.-M., Sadooghi-Alvandi, M., and Moshovos, A. 2010. Demystifying GPU microarchitecture through microbenchmarking. In Proceedings of the IEEE Symposium on Performance Analysis of Systems and Software (ISPASS'10). 235--246.
    [65]
    Yuan, G. L., Bakhoda, A., and Aamodt, T. M. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the IEEE/ACM Symposium on Microarchitecture (MICRO'09). ACM Press, New York, 34--44.

    Cited By

    View all
    • (2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
    • (2021)AI TaxACM Transactions on Computer Systems10.1145/344068937:1-4(1-32)Online publication date: 26-Mar-2021
    • (2021)Highly Concurrent Latency-tolerant Register Files for GPUsACM Transactions on Computer Systems10.1145/341997337:1-4(1-36)Online publication date: 4-Jan-2021
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 10, Issue 3
    September 2013
    310 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/2509420
    Issue’s Table of Contents
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 September 2013
    Accepted: 01 June 2013
    Revised: 01 April 2013
    Received: 01 May 2011
    Published in TACO Volume 10, Issue 3

    Check for updates

    Author Tags

    1. Bulk-synchronous parallel
    2. GPGPU
    3. NoC
    4. throughput accelerator

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)96
    • Downloads (Last 6 weeks)13
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Criticality-aware priority to accelerate GPU memory accessThe Journal of Supercomputing10.1007/s11227-022-04657-379:1(188-213)Online publication date: 6-Jul-2022
    • (2021)AI TaxACM Transactions on Computer Systems10.1145/344068937:1-4(1-32)Online publication date: 26-Mar-2021
    • (2021)Highly Concurrent Latency-tolerant Register Files for GPUsACM Transactions on Computer Systems10.1145/341997337:1-4(1-36)Online publication date: 4-Jan-2021
    • (2021)LARA: Locality-aware resource allocation to improve GPU memory-access timeThe Journal of Supercomputing10.1007/s11227-021-03854-w77:12(14438-14460)Online publication date: 1-Dec-2021
    • (2019)BARANACM Transactions on Parallel Computing10.1145/32940495:3(1-29)Online publication date: 22-Jan-2019
    • (2018)LTRFACM SIGPLAN Notices10.1145/3296957.317321153:2(489-502)Online publication date: 19-Mar-2018
    • (2018)LTRFProceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3173162.3173211(489-502)Online publication date: 19-Mar-2018
    • (2017)BiNoCHSProceedings of the Eleventh IEEE/ACM International Symposium on Networks-on-Chip10.1145/3130218.3130222(1-8)Online publication date: 19-Oct-2017
    • (2015)10x10ACM SIGARCH Computer Architecture News10.1145/2856113.285611543:3(2-9)Online publication date: 8-Dec-2015

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media