Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
|
|

Updates on storage standards

By Jake Edge
April 1, 2025

LSFMM+BPF

As he has in some previous editions of the Linux Storage, Filesystem, Memory-Management, and BPF Summit (LSFMM+BPF), Fred Knight gave an update on the status of various storage standards this year. In it, he looked at changes to the NVM Express (NVMe) standards in some detail. He also updated attendees on the fairly small changes that have come to the SCSI (T10) and ATA (T13) standards over the last few years.

He began with a bit of NVMe history. In May 2021, the 2.0 base specification was released, which split out several pieces of the standard into their own specifications. In addition to the base specification, there were three command-set specifications, three transport specifications, and the management-interface specification all released at the same time. Over the next three years, several revisions were released (2.0a-2.0e) that contained "corrections and clarifications" without any new features.

[Fred Knight]

In August 2024, NVMe 2.1 was released, which incorporated those revisions and added new features. Several of the command-set specifications were revised in the process and two more were added: the computational programs command set and, going along with it, the subsystem local memory command set. The transport and management-interface specifications were revised, as well. A new boot specification was added to describe how NVMe would interface with EFI and the operating systems in order to boot over the network.

The 2.2 base specification had just been ratified earlier in March, Knight said. It had updates for several command-set specifications, the PCIe transport specification, and the boot specification.

Since the time period from August to March was not all that long, he said, there are not a lot of changes in 2.2; there are just a few technical proposals (TPs) and clarifications that were approved. TP 4194 for the base specification was aimed at better describing the scatter-gather list (SGL) feature; "the goal of this TP was not to change anything" but to make some of the assumptions about the interpretation of the existing specification more concrete. In particular, the validation of entries in an SGL only needs to be done for those entries that are actually used in the transfer, not all that are on the list. In addition, the errors that should be returned for various conditions are now fully described; "there could have been some confusion on how people interpreted which error is returned".

Engineering change notes (ECNs) tend to be clarifications to the specifications, while the TPs are functional changes, he said. The SGL changes added "should" and "shall" language, so those were considered functional changes. Several clarifications to the base specification came with the adoption of ECN 123. The clarification on the interaction between reservations and security send/receive is one that he thought might be of particular interest. The process for reading the boot partition has also changed to eliminate some timing-related problems with some devices.

In the zoned-namespace (ZNS) specification and several others, there is a change to specify all of the opcodes in a command set. Earlier, two commands ended up with the same opcode in the subsystem-local-memory (SLM) specification, which required breaking compatibility by changing the SLM opcode. In order to avoid that kind of problem in the future, ECN 125 changes all of the command sets to list all of the opcodes so there can be no mistaken duplication. In the PCIe transport specification, the eye-opening-measurement feature, which evaluates the signal quality on the bus, that had been added in 2.1 specification was changed for 2.2 to expand the eye data length (EDLEN) field from 16 to 32 bits. The boot specification was clarified to ensure that all of the universally unique IDs (UUIDs) used are specified in all lower case as part of ECN 123.

The T10 committee has been making a few changes to the SCSI specifications, Knight said. There has been work on the command duration limit (CDL) feature and on depopulation. The addition of DMTF security protocols and data models (SPDM) support is perhaps the most interesting part, he said. It will allow authentication between the host and the device, "so you can establish a secure channel and know that you're talking to the right device, not an impostor". The same set of features is being worked on for the ATA command-set (ACS) standards, he said, though there is also some work being done on resetting all of the write pointers at once for zoned-storage devices.

Knight closed by noting that the increased frequency of NVMe releases is intentional. More than three years elapsed between 2.0 and 2.1, so there were a lot of changes that required a great deal of work for the community to put it together. So, it has been decided that the committee will be releasing smaller, more frequent versions, "so that they're easier to digest, easier to understand, and easier to follow". The 2.2 release is the first and the hope is that more will be coming soon.


Index entries for this article
ConferenceStorage, Filesystem, Memory-Management and BPF Summit/2025


to post comments

offload to ssd? neat

Posted Apr 1, 2025 21:33 UTC (Tue) by dankamongmen (subscriber, #35141) [Link] (8 responses)

https://nvmexpress.org/wp-content/uploads/NVM-Express-Com...

this is pretty cool! i'd worked with some computational memory experimental hardware, but hadn't seen any kind of standard for offloading computation to the memory hierarchy. anyone know devices supporting this?

offload to ssd? neat

Posted Apr 1, 2025 22:06 UTC (Tue) by willy (subscriber, #9762) [Link] (5 responses)

When I was at Intel, it was a fairly common request from customers "Can we offload $feature to the SSD? There's this great paper from $ResearchGroup showing improvements".

The reply from our SSD group was always the same: We have designed our SSD to fit in a certain power/performance/cost envelope. We don't have "spare cycles" on the drive's CPU to process the data. Indeed, we go out of our way to avoid touching the user data with the drive's CPU.

I don't expect this effort to go anywhere unless something has fundamentally changed. Certainly not on consumer devices. Maybe you'll find a research device, or cloud vendors will offer it as part of their virtualized storage devices.

offload to ssd? neat

Posted Apr 1, 2025 22:36 UTC (Tue) by andresfreund (subscriber, #69562) [Link]

Funnily enough, I've been repeatedly in the reverse position. Various storage vendors trying to convince us (various postgres services companies) that we really need to offload parts of postgres storage to their fancy new drives.

offload to ssd? neat

Posted Apr 2, 2025 3:40 UTC (Wed) by kpmckay (subscriber, #134608) [Link]

I tend to agree with the SSD group. In a $/GB or Watts/GB dogfight, it's hard to justify spending extra die area on something without well understood value. Even if there is a $feature that's a net positive for some use case, where's the 2nd or 3rd source going to come from and will $feature behave the same way across vendors? I think that there are a handful of compute functions that make sense to do within a storage/NVMe controller, but they have to be essentially invisible to applications. Nobody thinks of encryption as a "computational storage" function, but I think it's a good example of a widely deployed compute function in storage devices that makes sense. IMO, DPU-like devices are probably the right place to do any real heavy lifting with storage offloads because their resources/functions are amortized/applied over a number of drives, those drives can come from multiple vendors, and they're not necessarily bound to the block device abstraction.

Device ā†’ device copy offload?

Posted Apr 3, 2025 7:25 UTC (Thu) by DemiMarie (subscriber, #164188) [Link] (1 responses)

What about offloading device to device copies? The device already needs to be able to do such copies for GC.

Device ā†’ device copy offload?

Posted Apr 3, 2025 13:31 UTC (Thu) by willy (subscriber, #9762) [Link]

I assume you mean intra-device copying (as opposed to one device sending data to another device, which is functionality that exists).

Funnily, it's a completely different operation from the device's point of view. The GC operation copies the data block intact and updates the FTL so that lookups of LBA 45678 now point to the new location on flash. An offloaded copy needs to read in the data block, decrypt, update the tags, encrypt, write it out and update the LBA. That's because both the encryption and tag verification use the LBA as the seed, not the location on the flash.

This is why I was never able to get the REMAP command into NVMe. It looks cheap from the host point of view, but it's very expensive for the drive. It saves PCIe bandwidth, but that's not generally the limiting factor.

offload to ssd? neat

Posted Apr 3, 2025 20:14 UTC (Thu) by kbusch (subscriber, #171715) [Link]

You don't really need the computation and storage to coexist on the same device. I have some gpu type devices that look remarkably like nvme, but they use a vendor specific command set (and don't have flash). Not sure how closely you're tracking nvme driver happenings, but the uring_cmd support with the "nvme-generics" (ex: /dev/ng0n1) created some interesting ways to leverage the protocol. For some extra spice, add device direct io queues (Stephan Bates' lsfmm talk), and you can get peer-to-peer communication among many devices all talking nvme.

offload to ssd? neat

Posted Apr 2, 2025 10:35 UTC (Wed) by kurogane (subscriber, #83248) [Link] (1 responses)

I was very excited about this class of SSD say about 4 years ago. For the main database engine types it's the only thing in any credible research that can raise performance up by the next 10x factor. And reduce latency volatility. The volatility point is especially important, as the more optimized a software-only db engine gets the more horribly the gears get jammed when the I/O channels become saturated.

But when I tried to get my hands on one of them, no luck. I had some great phone calls, promised access to a datacenter with some units of SSDs that had come out of the factory line which were implemented some of the earlier specs being discussed in the article. At the time I represented a company with hundreds of customers of our own, too. But they never came through.

A comment from the SSD sales guy later was they were mainly aiming for hyperscaler orders. But that strategy depends on hyperscalers buying into something that will _reduce_ revenue in their DBaaS services and oblige them to provision more network IO around user's database servers.

Why offload to SSD?

Posted Apr 6, 2025 19:37 UTC (Sun) by DemiMarie (subscriber, #164188) [Link]

Why is this such a huge performance win? Reducing round-trips?

Is SCSI dead?

Posted Apr 3, 2025 7:30 UTC (Thu) by DemiMarie (subscriber, #164188) [Link] (6 responses)

Is SCSI useful nowadays for anything other than legacy? Does it offer useful features NVMe does not?

Is SCSI dead?

Posted Apr 3, 2025 19:39 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link] (5 responses)

SAS spinning rust drives are pretty common.

Is SCSI dead?

Posted Apr 3, 2025 19:47 UTC (Thu) by zdzichu (subscriber, #17118) [Link] (4 responses)

Yeah, but nvme got rotational drives support in 6.13. So what good for scsi is now (apart from connecting scanners and tape changers)?

Is SCSI dead?

Posted Apr 3, 2025 19:56 UTC (Thu) by kbusch (subscriber, #171715) [Link]

We just need the HDD vendors to ship rotational NVMe! Linux is ready for these, but I haven't seen any beyond demos and samples.

Is SCSI dead?

Posted Apr 6, 2025 19:49 UTC (Sun) by DemiMarie (subscriber, #164188) [Link] (2 responses)

USB Attached SCSI and Universal Flash Storage use SCSI, but Iā€™m unsure what the advantage of using SCSI in these fields is.

Is SCSI dead?

Posted Apr 7, 2025 4:30 UTC (Mon) by cladisch (✭ supporter ✭, #50193) [Link] (1 responses)

All other storage standards (ATAPI, SATA, SAS, USB MSC) are just mechanisms to move SCSI commands.

Is SCSI dead?

Posted Apr 8, 2025 9:35 UTC (Tue) by farnz (subscriber, #17727) [Link]

SATA's not just a transport for SCSI commands; for SATA HDDs and SSDs, there's a specific SATA command set, separate to SCSI.

ATAPI over SATA is, however, just a transport for SCSI commands.


Copyright © 2025, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds