Lock-free, MPSC-queue based, fast-path serializing Observer.#1235
Conversation
|
RxJava-pull-requests #1136 SUCCESS |
|
One additional node. It is possible the JVM will eliminate the |
There was a problem hiding this comment.
Could the return value just be added to the accept method without a performance hit instead of having the awkwardly named, almost exactly the same accept2 method?
|
Your research on this is awesome. I look forward to playing with this queue implementation and seeing it's performance compared to alternatives. However, this slows down the normal non-contended use cases: With MPSC Master branch Test using: I don't think
This applies in this case because we are NOT exchanging events between threads, we are actively trying to not do that. A high-performance queue (particularly of the lock-free kind) will make great sense however in places such as
I suggest we get this queue and counter into |
|
Here is my branch where I merged this PR and moved the code into |
|
By the way, I'll spend more time testing the contended cases. Don't close this PR, I'm not done reviewing it all. |
|
The JVM can remove a synchronized block if it detects a single-threaded use; I can't compete with that by using atomics. This is a tradeoff scenario: get very fast synchronous behavior or get double throughput in contended case. If we value the fast synchronous behavior more, then there is no need to go atomic just make sure we don't use wait/notify. |
|
I know that, we determined that in #1190. If we're going to change the decision on what tradeoff to make then we need to go back and revisit why we made the decision we did. The expected common case is when there is not contention because of how I think we need a broader set of use cases to determine which direction the tradeoff should be made. |
|
This was automatically closed as I merged the queue implementation. We still need to finish discussing the SerializedObserver part. |
|
I have my doubts now. Some simple benchmarks show improvements, other jmh benchmarks don't. Without industrial use cases, I think synchronized constructs are enough because if either low contention or single thread use. In the latter, JVM optimizations will always win. For example, I tried double checked locking with Composite and barely got 10Mops/s. The current version gives 1.3Gops/s if run within a 100k loop long enough. |
|
I think if in doubt, let's just not outsmart the JVM :) |
|
Here is an adhoc conversation on Twitter with some experts in this: https://twitter.com/benjchristensen/status/472362740510494721 I'd like to improve our perf testing and have cases for uncontended, normal occasional contention and highly contended and test it in a few places. I'll do the upgrade of JMH today so we have the latest code. Once I have those in place I'll put it out there for review and guidance. |
|
For what it is worth, I find the statemachine approach leads to hard to read code, so if there is no obvious win, I'd go for simple. |
I've rewritten the
SerializedObserverto improve performance. It is now lock-free, uses a multiple-producer-single-consumer queue based on Netty's implementation and employs a fast-path logic for single-threaded write-through.Benchmarked by measuring how fast 500k integers can get through it, if running 1-8 producers at the same time. For a single threaded case, master gives about 18 MOps/s, this implementation gives ~36 MOps/s (would be ~16 MOps/s on the slow path). For producers > 2, master gives ~5.5 MOps/s and this gives ~11.5 MOps/s. For 2 producers, aka 1 producer - 1 consumer, master gives ~4.5 MOps and this gives ~8.5 MOps/s.
The two new class,
PaddedAtomicIntegerandMpscPaddedQueuewill come in handy with other lock-free structures such as Schedulers, etc. We may consider adding back therx.utilor some other sub-package to store these helper classes: they don't need to be part of the public API but can be leftpublicto enable cross-package access internally.Things I learned during the implementation:
sun.misc.Unsafecan add 8-10% more throughput. To avoid platform issues, I stayed with the FieldUpdaters.getAndIncrementanddecrementAndGetare intrinsified in Java 8 and are compiled to a single x86 instruction, which generally outperforms any CAS loop. Same is true for thegetAndSet.tailin theMpscPaddedQueueagain helps separate producers trashing on the tail and a consumer reading the head. Without it, the throughput would decrease by about ~1.5 MOps/smergeand co which serialize multiple sources. Note however, that if the source speed isn't that high as in the benchmark, this implementation still provide less latency than the alternatives because the fast-path would be most likely open if the source emission is interleaved.