Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2155620.2155630acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication

Published: 03 December 2011 Publication History

Abstract

The prevalence of multicore architectures has accentuated the need for scalable cache coherence solutions. Many of the proposed designs use a mix of 1-to-1, 1-to-many (1-to-M), and many-to-1 (M-to-1) communication to maintain data coherence and consistency. The on-chip network is the communication backbone that needs to handle all these flows efficiently to allow these protocols to scale. However, most research in on-chip networks has focused on optimizing only 1-to-1 traffic. There has been some recent work addressing 1-to-M traffic by proposing the forking of multicast packets within the network at routers, but these techniques incur high packet delays and power penalties. There has been little research in addressing M-to-1 traffic.
We propose two in-network techniques, Flow Across Network Over Uncongested Trees (FANOUT) and Flow AggregatioN In-Network (FANIN), which perform efficient 1-to-M forking and M-to-1 aggregation, respectively, such that packets incur only single-cycle delays at most routers along their path, thus approaching an ideal network (one that incurs only wire delay/energy). Full-system simulations on a 64-core CMP with SPLASH-2 and PARSEC benchmarks show that FANOUT and FANIN together reduce runtime by 14.9% and network energy by 40.2%, on average, compared to state-of-the-art networks, operating at just 1% and 9.6% above the runtime and energy of an ideal network.

References

[1]
Intel Nehalem. http://www.realworldtech.com/page.cfm?ArticleID=RWT040208182719.
[2]
Simics Full-system Simulator. http://www.windriver.com/products/simics.
[3]
SPLASH-2. http://www-flash.stanford.edu/apps/SPLASH/.
[4]
J. L. Abellán et al. Efficient and scalable barrier synchronization for many-core CMPs. In Proc. 7th ACM International Conference on Computing Frontiers, 2010.
[5]
N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. GARNET: A detailed on-chip network model inside a full-system simulator. In ISPASS, Apr. 2009.
[6]
N. Agarwal, L.-S. Peh, and N. K. Jha. In-network snoop ordering (INSO): Snoopy coherence on unordered interconnects. In HPCA, Feb. 2009.
[7]
A. R. Alameldeen and D. A. Wood. IPC considered harmful for multiprocessor workloads. IEEE Micro, 26(4):8--17, 2006.
[8]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In PACT, Oct. 2008.
[9]
E. E. Bilir et al. Multicast snooping: A new coherence method using a multicast address network. In ISCA, 1999.
[10]
J. G. Castanos et al. Evaluation of a multithreaded architecture for cellular computing. In ISCA, 2002.
[11]
P. Conway and B. Hughes. The AMD Opteron Northbridge Architecture. IEEE Micro, 27:10--21, Mar. 2007.
[12]
P. Conway et al. Cache hierarchy and memory subsystem of the AMD Opteron processor. IEEE Micro, 30:16--29, 2010.
[13]
W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Pub., 2003.
[14]
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI, Dec. 2008.
[15]
N. Enright Jerger, L.-S. Peh, and M. Lipasti. Virtual circuit tree multicasting: A case for on-chip hardware multicast support. In ISCA, Jun. 2008.
[16]
P. A. Fidalgo, V. Puente, and J.-Á. Gregorio. MRR: Enabling fully adaptive multicast routing for CMP interconnection networks. In HPCA, 2009.
[17]
M. Galles. Scalable pipelined interconnect for distributed endpoint routing: The SGI SPIDER chip. In Hot Interconnects 4, Aug. 1996.
[18]
A. Gara et al. Overview of the Blue Gene/L system architecture. IBM J. Res. Dev., 49:195--212, Mar. 2005.
[19]
A. Gottlieb et al. The NYU Ultracomputer -- designing an MIMD shared memory parallel computer. IEEE Trans. on Computers, 32:175--189, 1983.
[20]
Y. Hoskote et al. A 5-GHz mesh interconnect for a teraflops processor. IEEE Micro, 27(5):51--61, Sept. 2007.
[21]
A. B. Kahng et al. ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration. DATE, Feb. 2009.
[22]
V. Krishnan and J. Torrellas. The need for fast communication in hardware-based speculative chip multiprocessors. Int. J. Parallel Program., 29:3--33, Feb. 2001.
[23]
A. Kumar, P. Kundu, A. P. Singh, L.-S. Peh, and N. K. Jha. A 4.6Tbits/s 3.6GHz single-cycle NoC router with a novel switch allocator in 65nm CMOS. In ICCD, Oct. 2007.
[24]
A. Kumar, L.-S. Peh, and N. K. Jha. Token flow control. In MICRO, Nov. 2008.
[25]
A. Kumar et al. Express virtual channels: Towards the ideal interconnection fabric. In ISCA, Jun. 2007.
[26]
G. Kurian et al. ATAC: a 1000-core cache-coherent processor with on-chip optical network. In PACT, 2010
[27]
J. Laudon and D. Lenoski. The SGI origin: a ccNUMA highly scalable server. In ISCA, Jun. 1997.
[28]
D. Lenoski et al. The directory-based cache coherence protocol for the DASH multiprocessor. In ISCA, Jun. 1990.
[29]
M. M. K. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In ISCA, Jun. 2003.
[30]
M. M. K. Martin et al. Multifacet's General Execution-driven Multiprocessor Simulator (GEMS) Toolset. CAN, Sep. 2005.
[31]
H. Matsutani et al. Prediction router: Yet another low latency on-chip router architecture. In MICRO, Feb. 2009.
[32]
G. F. Pfister et al. The IBM Research Parallel Processor Prototype (RP3): Introduction and architecture. In ICPP, pages 764--771, 1985.
[33]
A. Raghavan et al. Token tenure: PATCHing token counting using directory-based cache coherence. In MICRO, Nov. 2008.
[34]
S. Rodrigo et al. Efficient unicast and multicast support for CMPs. In MICRO, Sep. 2008.
[35]
A. F. Samman et al. Multicast parallel pipeline router architecture for network-on-chip. In DATE, 2008.
[36]
K. Strauss et al. Uncorq: Unconstrained snoop request delivery in embedded-ring multiprocessors. In MICRO, 2007.
[37]
L. Wang, Y. Jin, H. Kim, and E. J. Kim. Recursive partitioning multicast: A bandwidth-efficient routing for networks-on-chip. In NOCS, 2009.
[38]
H.-S. Wang et al. Power-driven design of router microarchitectures in on-chip networks. In MICRO, 2003.
[39]
M. A. Watkins et al. ReMAP: A reconfigurable heterogeneous multicore architecture. In MICRO, 2010.
[40]
D. Wentzlaff et al. On-chip interconnection architecture of the tile processor. IEEE Micro, 27(5):15--31, Sept. 2007.

Cited By

View all
  • (2023)Routing and Wavelength Assignment for Multiple Multicasts in Optical Network-on-Chip (ONoC)IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.327495142:12(4934-4947)Online publication date: Dec-2023
  • (2022)An Asymmetric, One-To-Many Traffic-Aware mm-Wave Wireless Interconnection Architecture for Multichip SystemsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2020.302061510:1(324-338)Online publication date: 1-Jan-2022
  • (2021)Fuzzy-Token: An Adaptive MAC Protocol for Wireless-Enabled Manycores2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9473960(1657-1662)Online publication date: 1-Feb-2021
  • Show More Cited By

Index Terms

  1. Towards the ideal on-chip fabric for 1-to-many and many-to-1 communication

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      MICRO-44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture
      December 2011
      519 pages
      ISBN:9781450310536
      DOI:10.1145/2155620
      • Conference Chair:
      • Carlo Galuzzi,
      • General Chair:
      • Luigi Carro,
      • Program Chairs:
      • Andreas Moshovos,
      • Milos Prvulovic
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 03 December 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Funding Sources

      Conference

      MICRO-44
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 484 of 2,242 submissions, 22%

      Upcoming Conference

      MICRO '24

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)32
      • Downloads (Last 6 weeks)3
      Reflects downloads up to 30 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Routing and Wavelength Assignment for Multiple Multicasts in Optical Network-on-Chip (ONoC)IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.327495142:12(4934-4947)Online publication date: Dec-2023
      • (2022)An Asymmetric, One-To-Many Traffic-Aware mm-Wave Wireless Interconnection Architecture for Multichip SystemsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2020.302061510:1(324-338)Online publication date: 1-Jan-2022
      • (2021)Fuzzy-Token: An Adaptive MAC Protocol for Wireless-Enabled Manycores2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9473960(1657-1662)Online publication date: 1-Feb-2021
      • (2021)Attacks Toward Wireless Network-on-Chip and CountermeasuresIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2020.29734279:2(692-706)Online publication date: 1-Apr-2021
      • (2021)Network-on-Chip Routers for Efficient Dataflow Computing: A Survey2021 International Conference on Advances in Computing and Communications (ICACC)10.1109/ICACC-202152719.2021.9708116(1-8)Online publication date: 21-Oct-2021
      • (2020)An Efficient Multicast Router using Shared-Buffer with Packet Merging for Dataflow Architecture2020 14th IEEE/ACM International Symposium on Networks-on-Chip (NOCS)10.1109/NOCS50636.2020.9241709(1-8)Online publication date: 24-Sep-2020
      • (2020)SmartFork: Partitioned Multicast Allocation and Switching in Network-on-Chip Routers2020 IEEE International Symposium on Circuits and Systems (ISCAS)10.1109/ISCAS45731.2020.9180774(1-5)Online publication date: Oct-2020
      • (2020)A Survey of Multicast Communication in Optical Network-on-Chip (ONoC)Parallel Architectures, Algorithms and Programming10.1007/978-981-15-2767-8_6(58-70)Online publication date: 26-Jan-2020
      • (2020) Rainbow: A composable coherence protocol for multi‐chip servers Concurrency and Computation: Practice and Experience10.1002/cpe.594732:24Online publication date: 21-Jul-2020
      • (2019)Coordinated DMA: Improving the DRAM Access Efficiency for Matrix MultiplicationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2019.290689130:10(2148-2164)Online publication date: 1-Oct-2019
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media