4. Application
Virtual Filesystem
Block Layer
NVMe Core
IOCTL
Multipath
PCIe NVMe Over Fabrics
PCI Stack
FC RDMA LOOPTCP
RDMA
Stack
FC
Stack
TCP
Stack
NVM Express
Subsystem LightNVM
Open
Channel
Stack
Target
5. NVMe Multipathing
• NVMe 1.1+ introduced “subsystems”
• A collection of controllers that can
access common storage for namespaces
• Controllers may be reachable through
different paths and connect to different
hosts
• Uses for multipath:
• Aggregate bandwidth over multiple
connections
• Redundancy/failover
• Locality
6. NVMe Multipathing
• Asymmetric namespace access
(ANA) ratified and publicly available
• Allows controllers to report access
status to namespace
Optimized
Non-optimizied
Inaccessible
Persistent Loss
• Logical equivalent to SCSI ALUA
7. NVMe Multipathing
Multipathing with Device-Mapper
• Individual device nodes to each disk path:
# ls /dev/{nvme,dm}*
/dev/dm-0 /dev/nvme0n1 /dev/nvme1n1
• A kernel stacking block driver, agnostic to
underlying protocols
• multipathd userspace component provides
path management
• Lots of user knobs to tune behavior:
• Optimal, failover, round-robin path selection
• Measurable software overhead in IO
path(~5usecs)
Block Layer
NVMe Driver
NVMe Multipath
Target
multipathd
DM-Multipath
8. NVMe Multipathing
Multipathing with native nvme
• Paths managed by the nvme driver
CONFIG_NVME_MULTIPATH=y
• Built-in ANA support
• Single visible /dev/ entry for each a shared
namespace
• Kernel block stack support:
• New ‘disk’ flag: GENHD_FL_HIDDEN
• Optimized recursive IO request
• Negligible IO path software overhead
• Pre-ANA, nvme defaults to “failover”
• Future: Add round-robin for path bandwidth
aggregation
Block Layer
NVMe Driver
NVMe Multipath
Target
9. NVMe Host Code
● Total: 15k LoC for host driver
● Common code is likely to take on
more roles individual drivers
currently provide in the future
● Much protocol specific work is
offloaded to their respective
subsystems
NVMe Host
Common PCIe RDMA FC LightNVM
10. NVMe Target Code
NVMe Target
Common FC FC-Loop Loop RDMA
• Total: 11k LoC for all target drivers
• Initially desired to expose NVMe SSDs
in passthrough mode
• That idea was not embraced by Linux
community, and uses generic block
devices instead
• New to 4.19: use file backed NVMe target!
• Loop targets provided easy testing
• Setup through ‘configfs’
• ‘nvmetcli’ provides convenience target
management
12. NVMe Target Loop:
# nvmetcli restore loop.json
nvmet: adding nsid 1 to subsystem testnqn
nvmet: adding nsid 2 to subsystem testnqn
# nvme connect -t loop -n testnqn
nvmet: creating controller 1 for subsystem testnqn for NQN nqn.2014-08.org.nvmexpress:uuid:af21599e-…
nvme nvme2: ANA group 1: optimized.
nvme nvme2: creating 112 I/O queues.
nvme nvme2: new ctrl: "testnqn"
# nvme list
Node SN Model Namespace Usage Format FW Rev
------------- ------------------- -------------------- --------- -------------------------- --------------- --------
/dev/nvme0n1 CVFT50850022400GGN INTEL SSDPE2MD400G4 1 400.09 GB / 400.09 GB 512 B + 0 B 8DV101J0
/dev/nvme1n1 CVMD4215002W1P6DGN INTEL SSDPEDME012T4 1 1.20 TB / 1.20 TB 512 B + 0 B 8DV101B0
/dev/nvme2n1 9786d12e2da33ab0 Linux 1 200.00 MB / 200.00 MB 4 KiB + 0 B 4.18.0+
/dev/nvme2n2 9786d12e2da33ab0 Linux 2 400.09 GB / 400.09 GB 512 B + 0 B 4.18.0+
13. RDMA NVMe Target Write Data Flow
● The target driver stages data
destined for the storage end
device in System RAM
● The initiator side then sends a
command to the storage
controller
● The controller DMA reads the
data in and commits to non-
volatile memory
● Can we eliminate the staging
buffer?
CPU
NVMe
Controller
Memory
RDMA NIC Switch
0101001100101001011
0101001100101001011
14. RDMA Copy Offload: Peer to Peer (P2P)
● NVMe has an optional feature
where the controller provides a
memory region for staging write
data
● RDMA NIC may bypass the host
entirely and DMA write the data
directly into the controllers staging
area
● Linux status:
○ On version 4 as of August 2018
○ Need more hardware to verify
limitations: may need to blacklist
some switches and avoid root ports
under some conditions
CPU
NVMe
Controller
Memory
RDMA NIC Switch
0101001100101001011
16. Write Streams
● IO hints create opportunities for
improved endurance, performance,
latency, and decreased GC and WAF
● The host may provide different
stream identifiers based on data
characteristics
● An enlightened NVMe Controller
can optimize NAND block allocation
from those characteristic hints
No
Streams
With
Streams
17. Write Streams
● Enable nvme streams with kernel parameter: nvme_core.streams=1
● Checking if your device supports streams (nvme-cli)
# nvme id-ctrl -H /dev/nvme0 | grep Directives
[5:5] : 1 Directives Supported
● Write hints inform relative expected lifetime of writes on a given inode,
F_SET_RW_HINT,or an open file description F_SET_FILE_RW_HINT
fcntl([file descriptor], [cmd], [hint])
● Supported hints:
RWH_WRITE_LIFE_NOT_SET, RWH_WRITE_LIFE_NONE
RWH_WRITE_LIFE_SHORT, RWH_WRITE_LIFE_MEDIUM
RWH_WRITE_LIFE_LONG, RWH_WRITE_LIFE_EXTREME
● Testing with fio:
write_hint=[none|short|medium|long|extreme]
18. Beyond NAND: Next Generation Media
Scalable Resistive Memory Element
● Orders of magnitude lower
latency than the fastest flash
memory
● Various families of this
technology have been prototyped
● Currently Available: Intel + Micron
3DXPoint
○ Intel branded Optane SSDs
● The low latency presents
interesting software challenges
19. IO Polling
IO Submit CS Sleep CS IO Complete
Command Execute
Time
Host
Device
• Typical IO is asynchronous: task sleeps while device is processing a command,
incurring context switch (CS) software overhead
• An idle processor may go into deep sleep states, further increasing latency
InterruptCMD post
20. IO Polling
IO Submit Poll IO Complete
Command Execute
Time
Host
Device
• IO Polling, though, removes all context switch overhead, reducing latency
• But at the expense of consuming 100% of CPU cycles
• So … probably not a good fit for use with higher latency devices!
CMD post
21. IO Hybrid Polling
IO Submit Poll IO Complete
Command Execute
Time
Host
Device
• Linux provides option to set timed sleep to relinquish CPU cycles, but still get polling
benefit
• The kernel learns what time to sleep should be used for different transfer sizes
CMD post
CS Timed
Sleep CS
22. IO Polling Configuration
● Enable through sysfs
○ /sys/block/<blk>/queue/io_poll
● Configuring polling:
○ /sys/block/<blk>/queue/io_poll_delay
-1 Classic pure polling
0 Hybrid Learned Polling
>0 Fixed delay in nanoseconds
23. Using IO Polling
● Use recently created syscalls: pvread2 and pvwrite2
ssize_t preadv2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
ssize_t pwritev2(int fd, const struct iovec *iov, int iovcnt,
off_t offset, int flags);
● Set “flags” to “HI_PRI”
● Testing with ‘fio’:
ioengine=psyncv2
hipri
● Tests demonstrate latency reduction from 2 – 10 microseconds!
● Much software overhead is removed, but this is still not utilizing the media to
its capabilities! Must get the PCIe interface out of the way!
25. Persistent Memory
• Media availability is still limited
• But software development can’t wait for hardware!
• Testing with emulation:
• Kernel system ram:
memmap=16G!4G
• qemu file-backed ‘nvdimm’:
-object memory-backend-file,
id=mem1,share,mem-path=/home/images/nvdimm-1,
size=16G,align=128M
-device nvdimm,memdev=mem1,id=nv1
label-size=2M
26. Using the block stack with PMEM
Filesystem
PMEM Driver
NVDIMMs
Block
Application
• Applications, filesystems, and
other block storage service work
with persistent memory in this
mode without modification
• Implemented as a RAM disk
driver, but with specific
implementations of memcpy for
persistent memory
Standard filesystem API.
ex:read/write
Read:memcpy_mcsafe
Write:memcpy_flushcache
28. Persistent Memory: Block Translation Table
BTT: Provides atomic sector updates
• Similar implementation to RAW mode, but has logical->physical indirection
layer
• For filesystems and apps that can’t tolerate torn sectors
• Introduces <1% capacity and <10% performance
"dev":"namespace0.0",
"mode":"sector",
"size":17162027008,
"uuid":"f1a0d4ea-e880-4efb-87ca-0883dd6ee153",
"blockdev":"pmem0s"
29. Using the block stack with PMEM
While access is quite fast in either mode, it STILL has too much
software overhead to use the media to its capabilities!
31. Persistent Memory: DAX
DAX Filesystem
PMEM Driver
NVDIMMs
Application
mmap
Page fault
• Application storage access may bypass kernel
entirely, directly reading and writing to media
within application memory space
• Kernel only needs to provide persistent handles
(region namespaces and files) and initialize
memory mapping
• Map either through filesystem (recommended) or
direct to device
• struct page: in System RAM or NVDIMM?
• Consumes 1/64th of the available capacity on the
chosen location of this metadata
• NVDIMM’s are still slower than RAM …
• New challenges for applications:
atomic/transactional object updates
32. DAX aware applications
• Application memory access stored
in volatile CPU caches
• Memory changes must be robust
to unexpected power loss
• Need ACID-like semantics
• Library and reference for handling
persistent memory in application
space freely provided by PMDK
(pmem.io)
Application
CPU
L1 L1
L2
L3
NVDIMMs
WPQ
MOV
33. Persistent Memory: FS DAX
FSDAX
• Filesystem support for direct access, handles naming, permissions
and file semantics
• In-kernel support provided by XFS and EXT4
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":16909336576,
"uuid":"5f5efe29-3ece-4444-ad94-7a9eac363620",
"blockdev":"pmem0"
34. Persistent Memory: Device DAX
Use if your applications need direct control over the location on DIMM
"dev":"namespace0.0",
"mode":"devdax",
"map":"dev",
"size":16909336576,
"uuid":"d812941b-6de6-48b1-82bc-5ce6c87cbccd",
"chardev":"dax0.0"