Explicit block device plugging

Posted Apr 29, 2011 7:40 UTC (Fri) by dlang (guest, #313)
In reply to: Explicit block device plugging by dlang
Parent article: Explicit block device plugging

or possibly a better way of putting it

assume that each disk action takes 15ms and data arrives every 10ms

with plugging of up to 15ms or a second item you write

2 blocks starting at 10ms finishing at 25ms
2 blocks starting at 30ms finishing at 45ms
2 blocks starting at 50ms finishing at 65ms
2 blocks starting at 70ms finishing at 85ms
etc

without plugging you write

1 block starting at 0ms finishing at 15ms
1 block starting at 15ms finishing at 30ms
2 blocks starting at 30ms finishing at 45ms
1 block starting at 45 ms finishing at 60ms
2 blocks starting at 60ms finishing at 75ms
1 block starting at 75ms finishing at 90ms
etc

does this really make a difference? yes, in the second case the disk is busy continuously rather than having a 5ms pause between activity but does that matter?

say the data arrives twice as fast (ever 5ms)

with plugging
2 blocks starting at 5ms finishing at 20ms
3 blocks starting at 20ms finishing at 35ms
3 blocks starting at 35ms finishing at 50ms

without plugging
1 block starting at 0ms finishing at 15ms
3 blocks starting at 15ms finishing at 30ms
3 blocks starting at 30ms finishing at 45ms

where is the gain?

(Log in to post comments)

Explicit block device plugging

Posted Apr 29, 2011 9:29 UTC (Fri) by neilbrown (subscriber, #359) [Link] (2 responses)

While I don't disagree with your logic, I do disagree with its relevance.

In the linux kernel, plugging is not timer based.
(There was a timer in the previous implementation, but it was only a last-ditch unplug in case there were bugs: slow is better than frozen).

In the old code a device would plug whenever it wanted to which was typically when a new request arrived for an empty queue. It would then unplug as soon as some thread started waiting for a request on that device to complete. I think it also would unplug explicitly in some cases after submitting lots of writes that were expected to by synchronous, but I'm not 100% certain.

So in the read case for example a read syscall would submit a request to read a page, then another request to read the next page (Because it was an 8K read), then maybe a few more requests to read-ahead some more pages, then wait for that first read to complete. Waiting for the read-ahead requests maybe isn't critical, but waiting for that second page would reduce latency. Now to be fair, if the two pages were adjacent on the disk they would probably have been combined into a single request before begin submitted, and if there aren't then maybe keeping them together isn't so important. But as soon as you get 3 non-adjacent pages in the read, there is a real possible gain from sorting before starting IO.

The new plugging code is quite different. The unplug happens when the thread submitting requests has finished submitting a bunch of requests. It is explicit rather than using the heuristic of 'unplug when someone waits' (hence the title of the article). This means it happens a little bit sooner - there is never any timer delay at all.

Rather than thinking of it as 'plugging' it is probably best to think of it as early-aggregation. hch has suggested that this be even more explicit. i.e. the thread generates a collection of related requests (quite possibly several files full of writes in the write-back case) and submits them all to the device at once. Not only does this clearly give a good opportunity to sort requests - more importantly it means we only take the per-device lock once for a large number of requests. If multiple threads are writing to a device concurrently, this will reduce lock contention making it useful even when the device queue is fairly full (when normal plugging would not apply at all).

The equivalent logic in a 'syslogd' style program would be to simply always service read requests before write requests.

So when a log message comes in, it is queued to be written.
Before you actually write it though you check if another log message is ready to be read from some incoming socket. If it is you read it and queue it. You only write when there is nothing to be read, or your queue is full.

I agree that having a timed unplug event doesn't make much sense.

Explicit block device plugging

Posted Apr 29, 2011 14:57 UTC (Fri) by dlang (guest, #313) [Link] (1 responses)

note that in my example, the timer was only used to indicate the max amount of time to wait for the next item to be submitted.

how can the kernel know when the application has finished submitting a bunch of requests?

or is it that the application submits one request, but something in the kernel is breaking it into a bunch of requests that all get submitted at once, and plugging is an attempt to allow the kernel to recombine them? (but that doesn't match your comment about sorting 3 non-adjacent requests being a win, how can one action by an application generate 3 non-adjacent requests?)

I'm obviously missing something here.

if the application is doing multiple read/write commands, I don't see how the kernel can possibly know how soon the next activity will be submitted after each command is run.

if the application is doing something with a single command, it seems like the problem is that it shouldn't be broken up to begin with, so there would be no need to plug to try and combine them

Explicit block device plugging

Posted Apr 29, 2011 22:58 UTC (Fri) by neilbrown (subscriber, #359) [Link]

The actual times between plug and unplug are typically microseconds (I suspect). The old timeout was set at 3 milliseconds and that was very slow. It is almost nothing compared to device IO times.

Actions of the application and requests to devices are fairly well disconnected thanks to the page cache. An app writes to the page cache and the page cache doesn't even think about writing to the device for typically 30 seconds. Of course if the app calls fsync, that expedites things. So a partial answer to "how can the kernel know when the application has finished submitting a bunch of requests" is "the application calls 'fsync' - if it cares".

On the read side, the page cache performs read-ahead so that hopefully every read request can be served from cache - and certainly the device gets large read requests even if the app is making lots of small read requests.

Also the kernel does break things into a bunch of requests which then need to be sorted. If a file is not contiguous on disk, then you need at least one request each separate chunk. Plugging allows this sorting to happen before the first request is started.

There is a good reason why the page cache submits lots of individual requests rather than a single list with lots of requests. Every request requires an allocation. when memory gets tight (which affects writes more than reads) it could be that I cannot allocate memory for another request until the previous ones have been submitted and completed. So we submit the requests individually, but combine them at a level a little lower down, and 'unplug' that queue either when all have been submitted or when the thread 'schedules' - which it will typically only do if it blocks on a memory allocation.

So there are two distinct things here that could get confused.

Firstly there is the page cache which deliberately delays writes and expedites reads to allow large requests independent of the request size used by the application.

Then there is the fact that the page cache sends smallish requests to the device, but tends to send a lot in quick succession. These need to be combined when possible, but also flushed as soon as there is any sign of any complication. This last is what "plugging" does.