Cisco Nvme Fundamental
Cisco Nvme Fundamental
Cisco Nvme Fundamental
#CLMEL
Cisco Webex Teams
Questions?
Use Cisco Webex Teams (formerly Cisco Spark)
to chat with the speaker after the session
How
1 Open the Cisco Events Mobile App
2 Find your desired session in the “Session Scheduler”
3 Click “Join the Discussion”
4 Install Webex Teams or go directly to the team space
5 Enter messages/questions in the team space
cs.co/ciscolivebot#BRKDCN-2494
© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
Agenda
• Introduction
• Storage Concepts
• What is NVMe™
• NVMe Operations
• The Anatomy of NVM Subsystems
• Queuing and Queue Pairs
• Additional Resources
#CLMEL
BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
What This Presentation Is… and Is Not
• What it is:
• A technology conversation
• Deep dive (We’re going in, Jim!)
• What it is not
• A product conversation
• Comprehensive and exhaustive
• A discussion of best practices (see, e.g.,
BRKDCN-2729 Networking Impact of
NVMe over Fabrics)
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
Goals
#CLMEL
BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
Prerequisites
• Helpful to know…
• Basic PCIe semantics
• Some storage networking
Note: Screenshot Warning!
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
What We Will Not Cover (In Detail)
• NVMe-MI (Management)
• NVMe-KV (Key Value)
• Protocol Data Protection features
• Advances in NVMe features
• RDMA verbs
• Fibre Channel exchanges
• New form factor designs
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Who is NVM Express?
About NVM Express (The Technology)
• NVM Express (NVMe™) is an open collection of standards and information to fully expose the benefits of non-
volatile memory in all types of computing environments from mobile to data center
• NVMe™ is designed from the ground up to deliver high bandwidth and low latency storage access for current and
future NVM technologies
BRKDCN-2494
The Basics of Storage
and Memory
The Anatomy of Storage
• There is a “sweet spot” for storage
• Depends on the workload and application type
• No “one-size fits all”
• Understanding “where” the solution fits is critical
to understanding “how” to put it together
• Trade-offs between 3 specific forces
• “You get, at best, 2 out of 3”
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
The Anatomy of Storage
• There is a “sweet spot” for storage
• Depends on the workload and application type
• No “one-size fits all”
• Understanding “where” the solution fits is critical to
understanding “how” to put it together
• Trade-offs between 3 specific forces
• “You get, at best, 2 out of 3”
• NVMe goes here
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
Storage Solutions - Where Does It Fit?
• Different types of storage apply in different
places
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
Storage Solutions - Where Does NVMe Fit?
• Different types of storage apply in different
places
• NVMe is PCIe-based
• Local storage, in-server
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
Storage Solutions - Where Does NVMe Fit?
• Different types of storage apply in different
places
• NVMe is PCIe-based
• Local storage, in-server
• PCIe Extensions (HBAs and switches) extend NVMe
outside the server
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
Storage Solutions - Where Does NVMe-oF Fit?
• Different types of storage apply in different
places
• NVMe is PCIe-based
• Local storage, in-server
• PCIe Extensions (HBAs and switches) extend NVMe
outside the server
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
But First… SCSI
• SCSI is the command set used in traditional storage
• Is the basis for most storage used in the data center
• Obviously, is the most obvious starting point for working with Flash storage
• These commands are transported via:
• Fibre Channel
• Infiniband
• iSCSI (duh!)
• SAS, SATA
• Works great for data that can’t be accessed in parallel (like disk drives that rotate)
• Any latency in protocol acknowledgement is far less than rotational head seek time
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
Evolution from Disk Drives to SSDs
The Flash Conundrum
• Flash
• Requires far fewer commands than SCSI provides
• Does not rotate (no latency time, exposing latency
of a one command/one queue system)
• Thrives on random (non-linear) access
• Both read and write
• Nearly all Flash storage systems historically use
SCSI for access
• But they don’t have to!
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
So, NVMe…
• Specification for SSD access via PCI Express (PCIe), initially flash media
• Designed to scale to any type of Non Volatile Memory, including Storage Class Memory
• Design target: high parallelism and low latency SSD access
• Does not rely on SCSI (SAS/FC) or ATA (SATA) interfaces: New host drivers & I/O stacks
• Common interface for Enterprise & Client drives/systems: Reuse & leverage engineering investments
• New modern command set
• 64-byte commands (vs. typical 16 bytes for SCSI)
• Administrative vs. I/O command separation (control path vs. data path)
• Small set of commands: Small fast host and storage implementations
• Standards development by NVMe working group.
• NVMe is de facto future solution for NAND and post NAND SSD from all SSD suppliers
• Full support for NVMe for all major OS (Linux, Windows, ESX etc.)
• Learn more at nvmexpress.org
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
What is NVMe?
NVMe Operations
What’s Special about NVMe?
Slower Faster
HDD: varying speed conveyer belts Flash: all data blocks available at the Flash: all data blocks available at the
carrying data blocks (faster belts = same seek time & latency same seek time & latency
lower seek time & latency)
SCSI / SAS: pick & place robot SCSI /SAS: pick & place robot NVMe / PCIe: pick & place robot with
w/tracked, single arm executing 1 w/single arm executing 1 command at 1000s of arms, all processing &
command at a time, 1 queue a time, 1 queue executing commands simultaneously,
with high depth of commands
BRKDCN-2494 25
Technical Basics
• 2 key components: Host and NVM
Subsystem (a.k.a. storage target)
• “Host” is best thought of as the CPU and
NVMe I/O driver for our purposes
• “NVM Subsystem” has a component - the
NVMe Controller - which does the
communication work
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 26
Technical Basics
• Memory-based deep queues
(up to 64K commands per
queue, up to 64K queues)
• Streamlined and simple
command set (13; only 3
required commands)
• Command completion interface
optimized for success
(common case)
• NVMe Controller: SSD element
that processes NVMe
commands
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
The Anatomy of NVM
Subsystems
What You Need - NVM Subsystem Fabric Fabric
Port Port
• Architectural Elements
NVMe NVMe
• Fabric Ports controller controller
• NVMe Controllers NSID NSID NSID NSID
• NVMe Namespaces
NVMe subsystem
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 29
Fabric Ports Fabric Fabric
Port Port
NVMe subsystem
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 30
NVMe Controller Fabric Fabric
Port Port
• Admin Queue for configuration, Scalable number NVM Media NVM Media NVM Media
of IO Queues
NVMe subsystem
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 31
NVMe Namespaces and NVM Media Fabric Fabric
Port Port
• Namespace Reservations
NVMe subsystem
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 32
NVMe Subsystem Implementations
NVMe PCIe SSD Implementation
(single Subsystem/Controller)
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 33
Queuing and Queue Pairs
In This Section…
• NVMe Host/Controller Communications
• Command Submission and Completion
• NVMe Multi-Queue Model Multi-queue
communication
• Command Data Transfers
NVMe Command
• NVMe communications over multiple fabric Submission and
Completion
transports
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 35
NVMe Host/Controller Communications
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 36
NVMe Multi-Queue Interface
NVMe controller
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Queues Scale With Controllers
Host
• Each Host/Controller pair have an NVMe Host Driver
independent set of NVMe queues CPU Core 0 CPU Core 1 CPU Core N-1
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 38
NVMe Commands and Completions
• NVMe Commands are sent by the Host to the
Controller in Submission Queue Entries (SQE)
• Separate Admin and IO Commands
• Three mandatory IO Commands
• Added two fabric-only Commands
• Commands may complete out of order
• NVMe Completions are sent by the Controller to
the Host in Completion Queue Entries (CQE)
• Command Id identifies the completed
command
• SQ Head Ptr indicates the consumed SQE
slots that are available for posting new SQEs
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 39
NVMe Generic Queuing Operational Model
Host
1. Host Driver enqueues the SQE NVMe Host Driver
into the SQ
Fabric
Port
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 40
NVMe Generic Queuing Operational Model
Host
1. Host Driver enqueues the SQE NVMe Host Driver
into the SQ
2. NVMe Controller dequeues SQE
Fabric
Port
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 41
NVMe Generic Queuing Operational Model
Host
1. Host Driver enqueues the SQE NVMe Host Driver
into the SQ
2. NVMe Controller dequeues SQE
(A) Data transfer, if applicable, goes here
Fabric
Port
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 42
NVMe Generic Queuing Operational Model
Host
1. Host Driver enqueues the SQE into NVMe Host Driver
the SQ
2. NVMe Controller dequeues SQE
3. NVMe Controller enqueues CQE
into the CQ
Fabric
Port
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 43
NVMe Generic Queuing Operational Model
Host
1. Host Driver enqueues the SQE into NVMe Host Driver
the SQ
2. NVMe Controller dequeues SQE
3. NVMe Controller enqueues CQE
into the CQ
4. Host Driver dequeues CQE
Fabric
Port
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 44
NVMe Generic Queuing Operational Model
Host
1. Host Driver enqueues the SQE NVMe Host Driver
into the SQ
2. NVMe Controller dequeues SQE
3. NVMe Controller enqueues CQE
into the CQ
4. Host Driver dequeues CQE
Fabric
This queuing functionality is always present… Port
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 45
NVMe Queuing on Memory (PCIe)
Host
1.Host Driver enqueues the SQE in NVMe Host Driver
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 46
NVMe Queuing on Memory (PCIe)
Host
1.Host Driver enqueues the SQE in NVMe Host Driver
PCIe Write
(SQ Doorbell)
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 47
NVMe Queuing on Memory (PCIe)
Host
1.Host Driver enqueues the SQE in NVMe Host Driver
register 2
PCIe Read
Response
PCIe Read (SQE)
(SQE)
memory SQ
Dequeue
SQE
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 48
NVMe Queuing on Memory (PCIe)
Host
1.Host Driver enqueues the SQE in NVMe Host Driver
doorbell register 2
PCIe Read
Response
PCIe Read (SQE)
(SQE)
memory SQ
Dequeue
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 49
NVMe Queuing on Memory (PCIe)
Host
1.Host Driver enqueues the SQE in NVMe Host Driver
Fabric
writing it to host-resident CQ Port
Dequeue
SQE
Enqueue
CQE
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 50
NVMe Queuing on Memory (PCIe)
Host
1.Host Driver enqueues the SQE in host- NVMe Host Driver
Enqueue
CQE
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 51
NVMe Queuing on Memory (PCIe)
Host
1.Host Driver enqueues the SQE in NVMe Host Driver
writing it to host-resident CQ
Fabric
Port
Enqueue
CQE
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 52
What is NVMe™ over
Fabrics (NVMe-oF™)?
What is NVMe over Fabrics (NVMe-oF™)?
• Built on common NVMe architecture with additional
definitions to support message-based NVMe
operations
• Expansion of NVMe semantics to remote storage
• Retains NVMe efficiency and performance over
network fabrics
• Eliminates unnecessary protocol translations
• Standardization of NVMe over a range Fabric types
• RDMA (RoCE, iWARP, InfiniBand™), Fibre Channel,
and TCP
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 54
What’s Special About NVMe over Fabrics?
Maintaining Consistency
• Recall: Host
NVMe Host Driver
• Multi-queue model
CPU Core 0 CPU Core n…
• Multipathing capabilities built-in
• Optimized NVMe System
• Architecture is the same,
regardless of transport
• Extends efficiencies across Transport-Dependent Interfaces
fabric Memory PCIe Registers Fabric Capsule Operations
NVMe controller
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
NVMe Multi-Queue Scaling
Host
• Queue pairs scale NVMe Host Driver
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 57
NVMe and NVMe-oF Models
• NVMe is a Memory-Mapped, PCIe Model
• Fabrics is message-based, shared memory is optional
• In-capsule data transfer is always message-based
NVMe Transports
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 59
Key Differences Between NVMe and NVMe-oF
• One-to-one mapping between I/O Submission Queues and I/O Completion Queues
• NVMe-oF does not support multiple I/O Submission Queues being mapped to a single I/O Completion Queue
• NVMe over Fabrics does not define an interrupt mechanism that allows a controller to generate a host interrupt
• It is the responsibility of the host fabric interface (e.g., Host Bus Adapter) to generate host interrupts
• NVMe over Fabrics Queues are created using the Connect Fabrics command
• Replaces PCIe queue creation commands
• If metadata is supported, it may only be transferred as a contiguous part of the logical block
• NVMe over Fabrics does not support transferring metadata from a separate buffer
• NVMe over Fabrics does not support PRPs but requires use of SGLs for Admin, I/O, and Fabrics commands
• This differs from NVMe over PCIe where SGLs are not supported for Admin commands and are optional for I/O commands
• NVMe over Fabrics does not support Completion Queue flow control
• This requires that the host ensures there are available Completion Queue slots before submitting new commands
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 60
NVMe over Fabrics Capsules
NVMe Fabric CMD Capsule
• NVMe over Fabric Command Capsule Command Id
• Encapsulated NVMe SQE Entry OpCode
NSID
• May contain additional Scatter Gather Lists (SGL) or NVMe
Command Data Buffer Optional Additional SGL(s)
Address
• Transport-agnostic Capsule format (PRP/SGL)
or
Command Data
Command
Parameters
Command Id
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 61
NVMe-oF Encapsulation Process
NVMe NVMe-oF NVMe/RDMA RDMA QP
NVMe-oF
NVMe SQE NVMe-oF
Capsule
Capsule
NVMe/TCP
NVMe/TCP TCP/IP Byte Stream
Header
PCIe NVMe-oF
Capsule
NVMe SQE
Transport SGL
PRP/SGL
NVMe/FC FC Exchange
NVMe SQ FCP/NVMe
Header
FRAME FRAME FRAME
NVMe-oF
Capsule
Transport SGL
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 62 62
NVMe-oF Queuing Interface to Transports
Host
1.Host Driver encapsulates SQE into 1
NVMe Host Driver
Capsule Transport
NVMe controller
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 63
NVMe-oF Queuing Interface to Transports
Host
1.Host Driver encapsulates SQE into NVMe Host Driver
Transport-Dependent Message
2
network/Fabric Send
Capsule
Capsule Transport
NVMe controller
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 64
NVMe-oF Queuing Interface to Transports
Host
1.Host Driver encapsulates SQE into NVMe Host Driver
Receive
Capsule Enqueue
SQE
NVMe controller
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 65
NVMe-oF Queuing Interface to Transports
Host
1.Host Driver encapsulates SQE into 1
NVMe Host Driver
Transport-Dependent Message
2
network/Fabric Send
Capsule
NVMe controller
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 66
NVMe-oF Queuing Interface to Transports
Host
1.Host Driver encapsulates SQE into NVMe Host Driver
4
NVMe controller
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 67
NVMe-oF Queuing Interface to Transports
Host
1. Host Driver encapsulates SQE into an NVMe Host Driver
Transport-Dependent Message
network/Fabric
3. Fabric enqueues the SQE into the remote
NVMe SQ
Send
Capsule
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 68
NVMe-oF Queuing Interface to Transports
Host
1. Host Driver encapsulates SQE into an NVMe- NVMe Host Driver
6
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 69
NVMe-oF Queuing Interface to Transports
Host
1. Host Driver encapsulates SQE into an NVMe- 1
NVMe Host Driver
6
Transport-Dependent Message
Transport-Dependent Message
2
NVMe SQ Send
Capsule
3
4
NVMe controller
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 70
NVMe Data Transfers
NVMe Command Data Transfers
Command Id
OpCode Physical Region Page (PRP)
NSID Physical Memory Page Address
Used by Memory Transport (PCIe)
Buffer
or
Address
(PRP/SGL)
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 72
NVMe-oF Data Transfers (Memory + Messages)
Host
• Command and Response Capsules are Host Memory
NVMe Host Driver
transferred using messages Buffer
NVMe controller
NVMe controller
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 76
Benefits of RDMA User
S/W
User Application
IO Library
• Bypass of system software stack components that
processes network traffic Kernel Apps
• For user applications (outer rails), RDMA bypasses the Kernel OS Stack
kernel altogether S/W
Sys Driver
• For kernel applications (inner rails), RDMA bypasses
the OS stack and the system drivers
PCIe
• Direct data placement of data from one machine (real or H/W
virtual) to another machine – without copies Transport &
Network (L4/ L3)
• Increased bandwidth while lowering latency, jitter, and
CPU utilization Ethernet (L1/ L0)
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 77
Queues, Capsules, and More Queues
Example of Host Write To Remote Target
• NVMe Host Driver encapsulates the NVMe Submission Queue Entry (including
data) into a fabric-neutral Command Capsule and passes it to the NVMe RDMA
Transport
• Capsules are placed in Host RNIC RDMA Send Queue and become an
RDMA_SEND payload
• Target RNIC at a Fabric Port receives Capsule in an RDMA Receive Queue
• RNIC places the Capsule SQE and data into target host memory
• RNIC signals the RDMA Receive Completion to the target’s NVMe RDMA
Transport
• Target processes NVMe Command and Data
• Target encapsulates the NVMe Completion Entry into a fabric-neutral Response
Capsule and passes it to NVMe RDMA Transport
Source: SNIA
#CLMEL
BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 78
NVMe Multi-Queue Host Interface Map to RDMA
Queue-Pair
Standard (local) NVMe
Model
• NVMe Submission and Completion
Queues are aligned to CPU cores
• No inter-CPU software locks
• Per CQ MSI-X interrupts enable source
core interrupt steering
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 79
Varieties of RDMA-based NVMe-oF
• RoCE is based on InfiniBand transport over Ethernet
• RoCEv2
• Enhances RoCE with a UDP header (not TCP) and
Internet routability
• Uses InfiniBand transport on top of Ethernet
• iWARP is layered on top of TCP/IP
• Offloaded TCP/IP flow control and management
• Both iWARP and RoCE (and InfiniBand) support verbs
• NVMe-oF using Verbs can run on top of either
transport
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 80
NVMe over Fabrics and RDMA
• InfiniBand
• RoCE v1 (generally
not used)
• RoCE v2 (most
popular vendor
option)
• iWARP
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 81
Compatibility Considerations
Choose one
• iWARP and RoCE are software-compatible if written
to the RDMA Verbs
• iWARP and RoCE both require rNICs
• iWARP and RoCE cannot talk RDMA to each other
because of L3/L4 differences
• iWARP adapters can talk RDMA only to iWARP
adapters
• RoCE adapters can talk RDMA only to RoCE
adapters
No mix and match!
#CLMEL
BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 82
Key Takeaways
Things to remember about RDMA-based NVMe-oF
• NVMe-oF requires the low network latency that RDMA can provide
• RDMA reduces latency, improves CPU utilization
• NVMe-oF supports RDMA verbs transparently
• No changes to applications required
• NVMe-oF maps NVMe queues to RDMA queue pairs
• RoCE and iWARP are software compatible (via Verbs) but do not
interoperate because their transports are different
• RoCE and iWARP
• Different vendors and ecosystem
• Different network infrastructure requirements
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 83
FC-NVMe: Fibre Channel-
based NVMe-oF
Fibre Channel Protocol
Fibre Channel has layers, just like
Upper Layer
• FC-4 Protocol Interface
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 85
What Is FCP?
• What’s the difference between FCP and
“FCP”?
• FCP is a data transfer protocol that carries
other upper-level transport protocols
(e.g., FICON, SCSI, NVMe)
• Historically FCP meant SCSI FCP, but
other protocols exist
• NVMe “hooks” into FCP
• Seamless transport of NVMe traffic
• Allows high performance HBA’s to work
with FC-NVMe
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 86
FCP Mapping
• The NVMe Command/Response capsules, and for some commands,
data transfer, are directly mapped into FCP Information Units (IUs)
• A NVMe I/O operation is directly mapped to a Fibre Channel Exchange
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 87
FC-NVMe Information Units (IUs)
#CLMEL
BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 88
Zero Copy
• Zero-copy
• Allows data to be sent to user application with minimal
copies
• RDMA is a semantic which encourages more efficient
data handling, but you don’t need it to get efficiency
• FC has had zero-copy years before there was RDMA
• Data is DMA’d straight from HBA to buffers passed to
user
• Difference between RDMA and FC is the APIs
• RDMA does a lot more to enforce a zero-copy
mechanism, but it is not required to use RDMA to get
zero-copy
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 89
FCP Transactions NVMe Initiator NVMe Target
similar to RDMA
• For Read
• FCP_DATA from Target
IO Read IO Write
• For Write Initiator Target Initiator Target
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 90
RDMA Transactions NVMe-oF NVMe-oF Target
Initiator
• RDMA Write
• RDMA Read with RDMA Read
Response
IO Read IO Write
Initiator Target Initiator Target
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 91
FC-NVMe: More Than The Protocol
• Dedicated Storage Network
• Run NVMe and SCSI Side-by-Side
• Robust and battle-hardened
discovery and name service
• Zoning and Security
• Integrated Qualification and Support
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 92
TCP-based NVMe-oF
NVMe-TCP NVMe™ Host Software
• NVMe™ block storage protocol over standard Host Side Transport Abstraction
TCP/IP transport
• Enables disaggregation of NVMe™ SSDs without
Fibre Channel
changes to networking infrastructure
RDMA
TCP
• Independently scale storage & compute to
maximize resource utilization and optimize for
specific workload requirements
• Maintains NVMe™ model: sub-systems, controllers
namespaces, admin queues, data queues Controller Side Transport Abstraction
9
BRKDCN-2494
4
Why NVMe™/TCP?
• Ubiquitous - runs on everything everywhere…
• Well understood - TCP is probably the most common transport
• High performance - TCP delivers excellent performance scalability
• Well suited for large scale deployments and longer distances
• Actively developed - maintenance and enhancements are developed by major players
• Inherently supports in-transit encryption
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 95
NVMe/TCP PDUs and Capsules Segmentation
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 96
NVMe/TCP Controller Association
NVMe-oF Host
NVMe-oF Controller
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 97 97
NVMe-TCP Data Path Usage
• Enables NVMe-oF I/O operations in
existing IP Datacenter environments
• Software-only NVMe Host Driver
with NVMe-TCP transport
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 100
What’s New in NVMe
1.3
Source: NVM Express, Inc.
#CLMEL
BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 102
NVMe Status and Feature Roadmap
2014 2015 2016 2017 2018 2019
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
NVM Express
(NVMe-oF)
• Out-of-band management
• Device discovery • In-band Mechanism
• Health & temp monitoring • Storage Device Extension
• Firmware Update
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 104
Summary
or, How I Learned to Stop Worrying and Love the Key Takeaways
Summary
• NVMe and NVMe-oF
• Treats storage like memory, just with permanence
• Built from the ground up to support a consistent model for NVM interfaces, even
across network fabrics
• No translation to or from another protocol like SCSI (in firmware/software)
• Inherent parallelism of NVMe multiple I/O Queues is exposed to the host
• NVMe commands and structures are transferred end-to-end, and architecture is
maintained across a range of fabric types
Q&A
#CLMEL
#CLMEL BRKDCN-2494 © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public 108
Complete Your Online Session Evaluation
• Give us your feedback and receive a
complimentary Cisco Live 2019
Power Bank after completing the
overall event evaluation and 5 session
evaluations.
• All evaluations can be completed via
the Cisco Live Melbourne Mobile App.
• Don’t forget: Cisco Live sessions will
be available for viewing on demand
after the event at:
https://ciscolive.cisco.com/on-demand-library/
#CLMEL © 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public
Thank you
#CLMEL
#CLMEL