Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
SlideShare a Scribd company logo
OmniXtend
Open Source Cache-coherence over Ethernet
Dr. Zvonimir Bandic
Next Gen Platforms Technologies, Western Digital
Chairman, CHIPS Alliance
Agenda
2
Who Are We? OmniXtend Details Next Steps
CHIPS Alliance: Who Are We?
What Is CHIPS Alliance?
4
▸ Organization which develops and hosts:
– Open source hardware code (IP cores) -> think open source CPUs
– Open source software design tools -> fastest growing component
– Interconnect IP (phy and logical protocols) -> attracts most interest
▸ A barrier free environment for collaboration:
– Standards organization framework for collaboration and development
– Legal framework – Apache v2 license
▸ Shared resources ($ and time) which lower the cost
of hardware development:
– For IP and tools
CHIPS Alliance: Who Are We?
5
Wilson Snyder Olof Kindgren
Workgroups
6
Tools-WG:
Cores-WG:
Chisel-WG
Interconnect:
‣ Verilator
‣ FuseSOC
‣ Cocotb-verilator
‣ SweRV Core™
‣ OmniXtend™
‣ TileLink 2.0
‣ AIB (Chiplets)
Rocket
SoC
AI
accelerator
OmniXtend Details
OmniXtend Details
Why OmniXtend?
▸ Processor acts as a control point within
the datacenter, limiting customer flexibility
▸ Memory blocked behind processor
– Limited number of DIMMs per socket
– Limited CPU memory address space
– Limited access to fast memory bus
▸ Analytics and machine learning driven
by accelerators, limited access to the
coherency bus
– Access to future fast-I/O storage attach
points may also be constrained
Main Memory
CPU
L1 $
GPU
FPGA
ML
Accelerator
.
.
.
I/O
8
OmniXtend vs. Other Memory-centric Concepts
Memory fabric may mean different things to different people
Context switch cost comparable
to memory access latency
Require software/kernel support
and/or rewriting of applications
‣ Page fault trap leading to RDMA request (incurs context switch
and SW overhead)
‣ Global address translation management in SW, leading to LD/ST
across global memory fabric
Fabric
CPU
Cache
DRAM
NIC
DMA RDMA
SW
Fabric
CPU
Cache
Tables
LD/ST
SW
This is OmniXtend.
No rewriting of software,
scalable like the algorithm
‣ Coherence protocol scaled out, global
page management and no context
switching
‣ No impact on application system call
interface (changes to boot needed)
Fabric
CPU
Cache
Cache
Coherence
Protocol
PT
OmniXtend — Open Unified Memory Fabric
10
Memory is the center of the architecture
Network Fabric
AI
Accelerator FPGA
GPU
Memory Fabric
Other
CPU
RISC-V
OmniXtend Details
11
▸ OmniXtend is
based off TileLink
– TileLink is an open,
coherent bus used
to connect Cores
with Memory
OmniXtend
encapsulates
TileLink and
serializes it
over Ethernet
Example OmniXtend Implementation
12
Internal Cache Coherence Switch
Cache Coherence Serializer
L2 cache
Coherence Manager 802.3 PHY Peripherals
DRAM Ethernet
with Cache
Coherency
L1 cache
L1 cache
L1 cache
L1 cache
OmniXtend
Benefits of OmniXtend
Already implemented
in FPGAs
Enables new data centric
architectures and decouples
compute from memory
Completely
unleashed memory
from the CPU
No need to
rewrite application
software
The only completely
open cache coherent
fabric standard
Based on low-
cost Ethernet
13
OmniXtend System Block Diagram Example
Only Ethernet based fabric that supports cache coherency and is open
14
OmniXtend Architecture Overview
15
RISC-V node
Match-Action Unit 0
Parser Deparser
Match-Action Unit 1 Match-Action Unit 11
…
Data
Packets
Data
Packets
Match Action
Unit (MAU)
PHV PHV
Match
Table
Parameters
Key Actions
TLoE Frame Structure
TLoE Frame Header (8 bytes)
Ethernet Preamble/SFD (8 bytes)
Ethernet MAC Header (14 bytes)
TileLink message 1
TileLink message 2
…
TileLink message m
Ethernet FCS (4 bytes)
TLoE Frame Mask (8 bytes)
Padding (Px8 bytes)
Color coding in right panel may need revision
FPGA “Real World” Measurements
RISC-V SoC with
OmniXtend running
in FPGA
Tofino Switch programmed
with P4 code to support
OmniXtend
1 1
2
1
2
16
FPGA Time Measurements
17
▸ 16 CPU (FPGA) system
& Tofino switch
– CPU runs at 100Mhz
in FPGA
OmniXtend Latency
(100MHz FPGA)
Software Support
▸ Kernel level memory management changes
▸ Option 1: Single kernel instance
– All nodes controlled under a single
kernel instance
▸ NUMA SMP like system
– Small scale systems
– Expose nodes memory as NUMA nodes
▸ Option 2: Independent kernel instances
– Independent kernel instance per node
– Large scale systems
– Applications can share memory through
an FS-like interface
▸ Memory mapped files
18
Memory 0 Memory 1
CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7
NUMA node 0 NUMA node1
Logical System
Memory 0
CPU 0 CPU 1 CPU 2 CPU 3
Physical node 0
Memory 1
CPU 0 CPU 1 CPU 2 CPU 3
Physical node 1
OmniXtend
Fabric
Memory
CPU 0 CPU 1 CPU 2 CPU 3
Physical node 0
Memory
CPU 0 CPU 1 CPU 2 CPU 3
Physical node 1
OmniXtend
Fabric
Shared Memory
z
Qemu Process
Software Support
▸ Single kernel image prototype implemented
– OpenSBI modifications mainly, very few kernel
changes needed
▸ Device tree description of memory
– Boots on FPGA prototype hardware, up to 4 nodes
▸ Independent kernels implementation under study
– Interface and implementation of shared memory
setup and access control
▸ QEMU based OmniXtend emulation under
development
– Facilitate software development and verification
– Allows access to physical compute nodes too
Memory Ops
Callback
Guest User Space
Guest Kernel
Read/write
to cacheable
memory
Read/write to non-cacheable
remote memory
Address
Checker
LRU
Cache
OmniXtend
Device
Memory
Backend
Virtio
Net-device
Virtio Disk
Remote cacheable
memory request
Implement
the
OmniXtend
protocol
MMIO
Device
Remote
Local
User Space
Kernel
Hardware Ethernet Phy
Access ethernet packets using a raw socket
19
Next Steps
Next Steps
OmniXtend Reference Design
Memory Fabric Innovation Platform
Standardize RISC-V coherency bus leveraging OmniXtend
21
Open Source Collaboration to Drive Development
CHIPS Alliance open to all organizations
OmniXtend: Well-positioned for Growth
©2019 Western Digital Corporation or its affiliates. All rights reserved. 22
22
OmniXtend Reference Design
Allegro files available now in CHIPS Alliance
https://github.com/chipsalliance/omnixtend
Summary
23
▸ OmniXtend is an open,
unified memory fabric
▸ Joint workgroups with
RISC-V International to
standardize on TileLink
2.0 and OmniXtend for
multicore systems
▸ Adopt the technology
in your next SoC
See more:
www.chipsalliance.org
OmniXtend Details
Thank you.
backup
25
OmniXtend Architecture Overview
26
RISC-V node
Visibility
Growth + Operations
CHIPS Alliance – organizational structure
27
Project maintainer 3
Project maintainer 2
Project maintainer 1
CHIPS Alliance Board of Directors
Zvonimir Bandic (Chairman)
Richard Ho (Vice-chairman)
Xiaoning Qi
Dave Ditzel
Yunsup Lee
David Kehlet
Prof. Borivoje Nikolic
Ted Marena
Interim Director
Henry Cook
Technical Committee
Michael Gielda
Outreach Committee
Brian Warner
Operations
Community Manager
Linux Foundation
Events
Elected
Staff
Agency
Future
Linux Foundation
Finance / Operations
Workgroup Chairs
Linux Foundation
Legal
Advocacy +
Outreach
SW Engineer 2
Verif. Engineer 1
Technology
Single Kernel Model
Physical Node 0
Memory 0
Physical Node 1
Memory 1
Logical System
CPU 0 CPU 1
Memory 0
CPU 2 CPU 3 CPU 4 CPU 5
Memory 1
CPU 6 CPU 7
NUMA node 0 NUMA node 1
OmniXtend
fabric
CPU 0 CPU 1 CPU 2 CPU 3 CPU 0 CPU 1 CPU 2 CPU 3
Independent Kernel Model
OmniXtend
fabric
Shared memory
Physical Node 0
Memory
CPU 0 CPU 1 CPU 2 CPU 3
Physical Node 1
Memory
CPU 0 CPU 1 CPU 2 CPU 3
OmniXtend System Block Diagram Example
ML Accelerator
802.3 Phy
Internal Cache Coherence Switch
Cache Coherence Serializer L2 cache
Coherence Manager
802.3 PHY
DRAM
Programmable
(P4) Switch
TofinoTM
Ethernet
with Cache
Coherency
L1 $
L1 $
L1 $
L1 $
L1 $
L1 $
L1 $
L1 $
Internal Cache Coherence Switch
Cache Coherence Serializer
L2 cache
Coherence Manager 802.3 PHY
DRAM
NVM
802.3 Phy
NVM
NVM – Main Memory
Only Ethernet based fabric that supports cache coherency and is open
This is the live version of the diagram on 15-16.
Pls do not delete.
OmniXtend vs Other Protocols
Interface Physical I/O Connection Standard Coherence? Reference Design
CCIX PCIe Point to Point Yes No No
Gen Z Custom P2P, Switched, Fabric Yes No Yes
CXL PCIe Point to Point Yes Partial No
Open CAPI Custom Point to Point Open Partial Yes
OmniXtend Ethernet Fabric, P2P Open Yes Yes
31

More Related Content

Chips alliance omni xtend overview

  • 1. OmniXtend Open Source Cache-coherence over Ethernet Dr. Zvonimir Bandic Next Gen Platforms Technologies, Western Digital Chairman, CHIPS Alliance
  • 2. Agenda 2 Who Are We? OmniXtend Details Next Steps
  • 4. What Is CHIPS Alliance? 4 ▸ Organization which develops and hosts: – Open source hardware code (IP cores) -> think open source CPUs – Open source software design tools -> fastest growing component – Interconnect IP (phy and logical protocols) -> attracts most interest ▸ A barrier free environment for collaboration: – Standards organization framework for collaboration and development – Legal framework – Apache v2 license ▸ Shared resources ($ and time) which lower the cost of hardware development: – For IP and tools
  • 5. CHIPS Alliance: Who Are We? 5 Wilson Snyder Olof Kindgren
  • 6. Workgroups 6 Tools-WG: Cores-WG: Chisel-WG Interconnect: ‣ Verilator ‣ FuseSOC ‣ Cocotb-verilator ‣ SweRV Core™ ‣ OmniXtend™ ‣ TileLink 2.0 ‣ AIB (Chiplets) Rocket SoC AI accelerator
  • 8. Why OmniXtend? ▸ Processor acts as a control point within the datacenter, limiting customer flexibility ▸ Memory blocked behind processor – Limited number of DIMMs per socket – Limited CPU memory address space – Limited access to fast memory bus ▸ Analytics and machine learning driven by accelerators, limited access to the coherency bus – Access to future fast-I/O storage attach points may also be constrained Main Memory CPU L1 $ GPU FPGA ML Accelerator . . . I/O 8
  • 9. OmniXtend vs. Other Memory-centric Concepts Memory fabric may mean different things to different people Context switch cost comparable to memory access latency Require software/kernel support and/or rewriting of applications ‣ Page fault trap leading to RDMA request (incurs context switch and SW overhead) ‣ Global address translation management in SW, leading to LD/ST across global memory fabric Fabric CPU Cache DRAM NIC DMA RDMA SW Fabric CPU Cache Tables LD/ST SW This is OmniXtend. No rewriting of software, scalable like the algorithm ‣ Coherence protocol scaled out, global page management and no context switching ‣ No impact on application system call interface (changes to boot needed) Fabric CPU Cache Cache Coherence Protocol PT
  • 10. OmniXtend — Open Unified Memory Fabric 10 Memory is the center of the architecture Network Fabric AI Accelerator FPGA GPU Memory Fabric Other CPU RISC-V
  • 11. OmniXtend Details 11 ▸ OmniXtend is based off TileLink – TileLink is an open, coherent bus used to connect Cores with Memory OmniXtend encapsulates TileLink and serializes it over Ethernet
  • 12. Example OmniXtend Implementation 12 Internal Cache Coherence Switch Cache Coherence Serializer L2 cache Coherence Manager 802.3 PHY Peripherals DRAM Ethernet with Cache Coherency L1 cache L1 cache L1 cache L1 cache OmniXtend
  • 13. Benefits of OmniXtend Already implemented in FPGAs Enables new data centric architectures and decouples compute from memory Completely unleashed memory from the CPU No need to rewrite application software The only completely open cache coherent fabric standard Based on low- cost Ethernet 13
  • 14. OmniXtend System Block Diagram Example Only Ethernet based fabric that supports cache coherency and is open 14
  • 15. OmniXtend Architecture Overview 15 RISC-V node Match-Action Unit 0 Parser Deparser Match-Action Unit 1 Match-Action Unit 11 … Data Packets Data Packets Match Action Unit (MAU) PHV PHV Match Table Parameters Key Actions TLoE Frame Structure TLoE Frame Header (8 bytes) Ethernet Preamble/SFD (8 bytes) Ethernet MAC Header (14 bytes) TileLink message 1 TileLink message 2 … TileLink message m Ethernet FCS (4 bytes) TLoE Frame Mask (8 bytes) Padding (Px8 bytes) Color coding in right panel may need revision
  • 16. FPGA “Real World” Measurements RISC-V SoC with OmniXtend running in FPGA Tofino Switch programmed with P4 code to support OmniXtend 1 1 2 1 2 16
  • 17. FPGA Time Measurements 17 ▸ 16 CPU (FPGA) system & Tofino switch – CPU runs at 100Mhz in FPGA OmniXtend Latency (100MHz FPGA)
  • 18. Software Support ▸ Kernel level memory management changes ▸ Option 1: Single kernel instance – All nodes controlled under a single kernel instance ▸ NUMA SMP like system – Small scale systems – Expose nodes memory as NUMA nodes ▸ Option 2: Independent kernel instances – Independent kernel instance per node – Large scale systems – Applications can share memory through an FS-like interface ▸ Memory mapped files 18 Memory 0 Memory 1 CPU 0 CPU 1 CPU 2 CPU 3 CPU 4 CPU 5 CPU 6 CPU 7 NUMA node 0 NUMA node1 Logical System Memory 0 CPU 0 CPU 1 CPU 2 CPU 3 Physical node 0 Memory 1 CPU 0 CPU 1 CPU 2 CPU 3 Physical node 1 OmniXtend Fabric Memory CPU 0 CPU 1 CPU 2 CPU 3 Physical node 0 Memory CPU 0 CPU 1 CPU 2 CPU 3 Physical node 1 OmniXtend Fabric Shared Memory
  • 19. z Qemu Process Software Support ▸ Single kernel image prototype implemented – OpenSBI modifications mainly, very few kernel changes needed ▸ Device tree description of memory – Boots on FPGA prototype hardware, up to 4 nodes ▸ Independent kernels implementation under study – Interface and implementation of shared memory setup and access control ▸ QEMU based OmniXtend emulation under development – Facilitate software development and verification – Allows access to physical compute nodes too Memory Ops Callback Guest User Space Guest Kernel Read/write to cacheable memory Read/write to non-cacheable remote memory Address Checker LRU Cache OmniXtend Device Memory Backend Virtio Net-device Virtio Disk Remote cacheable memory request Implement the OmniXtend protocol MMIO Device Remote Local User Space Kernel Hardware Ethernet Phy Access ethernet packets using a raw socket 19
  • 21. OmniXtend Reference Design Memory Fabric Innovation Platform Standardize RISC-V coherency bus leveraging OmniXtend 21
  • 22. Open Source Collaboration to Drive Development CHIPS Alliance open to all organizations OmniXtend: Well-positioned for Growth ©2019 Western Digital Corporation or its affiliates. All rights reserved. 22 22 OmniXtend Reference Design Allegro files available now in CHIPS Alliance https://github.com/chipsalliance/omnixtend
  • 23. Summary 23 ▸ OmniXtend is an open, unified memory fabric ▸ Joint workgroups with RISC-V International to standardize on TileLink 2.0 and OmniXtend for multicore systems ▸ Adopt the technology in your next SoC See more: www.chipsalliance.org
  • 27. Visibility Growth + Operations CHIPS Alliance – organizational structure 27 Project maintainer 3 Project maintainer 2 Project maintainer 1 CHIPS Alliance Board of Directors Zvonimir Bandic (Chairman) Richard Ho (Vice-chairman) Xiaoning Qi Dave Ditzel Yunsup Lee David Kehlet Prof. Borivoje Nikolic Ted Marena Interim Director Henry Cook Technical Committee Michael Gielda Outreach Committee Brian Warner Operations Community Manager Linux Foundation Events Elected Staff Agency Future Linux Foundation Finance / Operations Workgroup Chairs Linux Foundation Legal Advocacy + Outreach SW Engineer 2 Verif. Engineer 1 Technology
  • 28. Single Kernel Model Physical Node 0 Memory 0 Physical Node 1 Memory 1 Logical System CPU 0 CPU 1 Memory 0 CPU 2 CPU 3 CPU 4 CPU 5 Memory 1 CPU 6 CPU 7 NUMA node 0 NUMA node 1 OmniXtend fabric CPU 0 CPU 1 CPU 2 CPU 3 CPU 0 CPU 1 CPU 2 CPU 3
  • 29. Independent Kernel Model OmniXtend fabric Shared memory Physical Node 0 Memory CPU 0 CPU 1 CPU 2 CPU 3 Physical Node 1 Memory CPU 0 CPU 1 CPU 2 CPU 3
  • 30. OmniXtend System Block Diagram Example ML Accelerator 802.3 Phy Internal Cache Coherence Switch Cache Coherence Serializer L2 cache Coherence Manager 802.3 PHY DRAM Programmable (P4) Switch TofinoTM Ethernet with Cache Coherency L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ L1 $ Internal Cache Coherence Switch Cache Coherence Serializer L2 cache Coherence Manager 802.3 PHY DRAM NVM 802.3 Phy NVM NVM – Main Memory Only Ethernet based fabric that supports cache coherency and is open This is the live version of the diagram on 15-16. Pls do not delete.
  • 31. OmniXtend vs Other Protocols Interface Physical I/O Connection Standard Coherence? Reference Design CCIX PCIe Point to Point Yes No No Gen Z Custom P2P, Switched, Fabric Yes No Yes CXL PCIe Point to Point Yes Partial No Open CAPI Custom Point to Point Open Partial Yes OmniXtend Ethernet Fabric, P2P Open Yes Yes 31

Editor's Notes

  1. > - Tooling to support use of Linux in Safety Critical Systems > > Complete and consistent set of techniques mapped to a set of open source > tools provided by the project. > > - kernel and software stack lifecycle in these systems ( extending LTS > models ) > - Provisions for sustaining linux throughout the that product lifetime. > - Support for certification activities. > > Incident and Hazard Monitoring > > - Monitoring critical components in member system specific contexts and > reporting impact for updates > - Establish best practices for member response teams. > > Reference Documentation and Use Cases > > - Safety Concepts and building blocks > - Kernel selection and how to configure (how/why certain flags). > - Reference sample system to guide use of measures and techniques > > Education and Evangelism > > - Workshops and opportunities for knowledge sharing > - Course on Safety Engineering Best Practices Preexisting element selection and integration Safety concept/safety case based on pre-existing elements > - Use of analysis tools and gaps that require manual intervention Continuous feedback to FLOSS community - Process and traceability improvments - Automation of bug scanning and quality metrics - Building awareness on safety and its relation to reliabilty and availability Interaction with the safety community - Present and critically discuss FLOSS in safety - Present methods and tools for peer review - Establish acceptance for FLOSS in the safety community - Submit amendments and/or full standards to relevant committees
  2. Changed Integration/Minaturization to Low-cost Low-power SoCs Changed Storage Architecture Control Points to Memory Architecture Control Points
  3. Page fault trap leading to RDMA request (incurs context switch and SW overhead) Global address translation management in SW, leading to LD/ST across global memory fabric Coherence protocol scaled out, global page management and no context switching
  4. Completely unleashed memory from the CPU Main memory can be shared equally with CPUs, GPUs, ML accelerators, FPGAs, etc. No need to rewrite application software The only completely open cache coherent fabric standard Based on low cost Ethernet Already implemented in FPGAs Enables new data centric architectures and decouples compute from memory
  5. Kernel level memory management changes Option 1: Single kernel instance All nodes controlled under a single kernel instance NUMA SMP like system Small scale systems Expose nodes memory as NUMA nodes Option 2: Independent kernel instances Independent kernel instance per node Large scale systems Applications can share memory through an FS-like interface Memory mapped files
  6. Single kernel image prototype implemented OpenSBI modifications mainly, very few kernel changes needed Device tree description of memory Boots on FPGA prototype hardware, up to 4 nodes Independent kernels implementation under study Interface and implementation of shared memory setup and access control QEMU based OmniXtend emulation under development Facilitate software development and verification Allows access to physical compute nodes too
  7. So how do we make amazing data centric architectures happen? We believe it is possible via open standards. RISC-V affords us a unique opportunity to create an open standard memory coherency bus for heterogeneous computing architectures. Today, the existing CPUs all have coherency buses that are closed. We can lead the way with an open coherency bus. The CHIPS Alliance organization along with RISC-V International has been discussing this. We are asking you to join us in creating a standard for a RISC-V coherency bus. We think OmniXtend is a great foundation to build upon. OmniXtend has a reference design board and has been shown to perform well in initial system build outs. Contribute to our efforts in creating an open standard which will enable data centric solutions to thrive.
  8. OmniXtend is an open, data centric memory fabric Joint workgroups with RISC-V International to standardize on TileLink 2.0 and OmniXtend for multicore systems Adopt the technology in your next SoC