Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to main content

Get the Reddit app

Scan this QR code to download the app now
Or check it out in the app stores
r/rust icon
r/rust icon
Go to rust
r/rust

A place for all things related to the Rust programming language—an open-source systems language that emphasizes performance, reliability, and productivity.


Members Online

Multithreaded unsafe access to a `Vec`

I have two threads, one that is regularly updating a Vec<u8> and one that is reading it.

The important point here is: it doesn't matter if the reading thread is reading from the vector while it's in the middle of being updated. In other words, if the writing thread gets scheduled off in the middle of updating that vector, it's totally fine: the values read will still be useful, even if they were paused midway through an update.

Right now, Rust is forcing me to use a Mutex or RwLock to avoid this kind of partial update. How can I relax this restriction?

Share
Sort by:
Best
Open comment sort options
u/thisismyfavoritename avatar

youre at risk that a write (e.g. push) causes a memory reallocation, in which case your reads could dangle

u/dkopgerpgdolfg avatar
Edited

"Some" kind of synchronization is necessary for the hardware (and language-level UB too), otherwise there are several ways it goes wrong that are worse than what you describe.

Can the Vec become longer/shorter, or are you just changing existing u8 to other values? Some specific access pattern, size, frequency, ...?

No particular guarantee that the writes are even atomic i.e. you could change some bits in both threads, leading to completely wrong data (unlikely, but not impossible thanks to undefined behaviour right?)

u/dkopgerpgdolfg avatar
Edited

Yes, I know this.

And while it might be unlikely for one single u8, in general it can and does happen. With eg. some larger thing that goes over a cache line boundary, having neither old nor new state is quite likely to happen.

Yeah, sorry, I figured you'd know, comment was for OP as a follow on

More replies
More replies
More replies
u/Disastrous_Bike1926 avatar

Sounds like you want atomics. And if the size is known and fixed, maybe an array.

The solution will be very different if the data can grow or shrink over its lifetime, versus remaining a fixed size. I don’t think anything will get you out of using a lock if the size can change.

u/bskceuk avatar

You could consider something like https://docs.rs/left-right/latest/left_right/

Not sure if that is highly used, but anyway the oversimplified idea is that it would have 2 vectors. The reader would read from one and the writer would mutate the other and safely sync the versions when the reader isn’t reading.

u/Wicpar avatar

It is never safe to write while read as you will have in the best case stale data in one thread and not the other (if in cache) and the worst case garbled values because only half of one of your entries was committed to main ram. You need some sync. Could be an array of atomics, could be a mutex or rwlock, there are also crates for multi threaded collections that are optimized for large amounts of concurrency.

u/devraj7 avatar
Edited

It is never safe to write while read

That's an overly broad statement that is easily proven wrong.

Imagine a location provider and the requirement is that the reader needs to know where the location is RIGHT NOW.

In other words, every time the provider publishes a new location, all past locations are obsolete and should be discarded.

Edit: Downvoted for expressing my requirements... Why is it so hard for downvoters to entertain a scenario they've never encountered? Oh well.

u/dkopgerpgdolfg avatar

How does this prove anything? You're just describing some business logic.

Unless the value is an atomic, there is no guarantee that the value read is any of the values that were ever written. It could be a garbage mix of old and new values.

u/devraj7 avatar

And that's totally okay for my scenario: a mix of new values and old values is just fine.

It's not guaranteed to be either an old or new value. If the old value is 5 and the new value is 7, you could read 42. At that point, you just have a bad RNG. You need some kind of sync.

The way that some databases handle your scenario is to have a write ahead log. In other words, you never overwrite the old data, just append to it. That way, you can read from old data without any sync whilst the new data is being written. The pointer to the head of the log will still need to be an atomic.

Are you sure that's ok? I'm not just talking about if you read 2 separate u8s you might get one old and one new.

You could actually read a u8 value that was never written.

If your code can genuinely cope with that, then you can use unsafe pointer reads/writes to let the compiler know to trust you.

more replies More replies
u/kyraa_x avatar

no. this is never what you want. since the compiler can, and will, compile your code to something different than what you wrote.

More replies
More replies
u/Inevitable-Cicada603 avatar

Okay. But if you’re reading mid write there’s no guarantee that you’re getting correct data. It could be garbled.

u/war-armadillo avatar
Edited

With all due respect, I believe that you're not fully understanding what the commenters are saying. The issue is not that it will read potentially stale data (which you've already said is fine). It's that it causes undefined behavior in which case all bets are off, including but not limited to causing a segmentation fault.

The solution (already mentioned elsewhere) would be to use a Vec<AtomicU8> with relaxed ordering, which achieves what you want without causing any performance penalty or undefined behavior (and doesn't use unsafe).

It seems to me like you've already made up your mind about using unsafe without actually testing or benchmarking the proposed solutions.

u/devraj7 avatar

I've actually never used unsafe in many years of Rust and I'd really rather not. I've heard a few good suggestions in this thread which I'll try.

More replies
More replies
More replies

It depends. If your threads operate with the Vec through shared ref, then you just can use Vec<AtomicU8>. Judging by the statement that it's ok to read during updates, consistency is not a requirement, so you can use relaxed ordering. Be advised, though, that memory ordering is quite a non-trivial topic, so you may want to enforce some additional guarantees.

In case you want to do mutating operations on Vec itself (mainly operations that can change its size), there is really no other way to use either a lock or some alternative data structure. Mutating Vec itself will lead to use-after-free because of storage reallocation.

u/juanfnavarror avatar

It might actually be more performant to RwLock the Vec, and simpler to implement.

u/SkiFire13 avatar

RwLock requires an atomic write even for read operations, it's pretty difficult for that to end up being faster.

More replies
More replies
Edited

Use an array or slice of AtomicU8s, and share that between the threads, if your byte vector doesn't grow. Otherwise, you probably should use Mutex or Rwlock. You might also have some luck searching crates.io for crates providing safe implementations of data structures made for your exact use case (There are none that I know of).

Edit: (after looking at some of the other comments)

Never, ever, share any data that some thread has mutable (in the Rust sense i. e. exclusive) access to with another thread without any kind of synchronisation mechanism.

It's not as simple as "Oh, but I can just pass in a pointer and read from it. I don't care if the data is partially written". Memory is complicated, and the compiler and your CPU, who, theoretically, have full awareness of it's intricacies, may do anything to optimise your code, without some way to restrict them, such as the things the compiler suggested to you.

For example, that pointer you pass in might not point to where you think it does. Those writes you perform might each result in several writes, in several disjoint locations, in different kinds of memory, or none at all. What you're trying to do results in immediate undefined behavior (please read up a bit more deeply on what that means), will not result in anything useful, only completely random problems, anywhere in your program, when you least expect it, that are way worse than just "partially updated values".

u/XMLHttpWTF avatar

look into using a lock free ring buffer

u/Chadshinshin32 avatar
Edited

For mutating the current values while another thread is reading it, you can just use a Vec<AtomicU8>.

For adding new values, you can use something like triple buffer or use arc_swap and copy and swap whenever you need to reallocate.

For the latter method, you could make your own Vec(wrapped in an Arc) with an AtomicUsize length that the writer bumps when pushing new elements. When you need to reallocate, the writer allocates a new Arc<CustomVec>, copies the elements over, and swaps the old Arc<CustomVec> with the newly created one. The reader will deallocate the old one if it's currently holding onto it.

u/FreeKill101 avatar

Sharing access to an actual Veclike this is never okay, because the writer can trigger reallocation.

But can you just use something like &Box<[A tomicU8]>? Then you can read and write via shared references.

u/Darksonn avatar

Use Vec<AtomicU8>. Relaxed reads and writes compile down to completely ordinary reads and writes in machine code, so there's no atomic overhead from that.

what you want is a channel

std mpsc channel docs

u/devraj7 avatar

That was my first implementation but they are not an option in this case because the consumer gets behind the producer very quickly.

In other words... lag that quickly gets out of hand.

On top of that, all the reader cares about is reading the latest state of the vector, and each new write obsoletes the previous write, so channels are the wrong approach for this.

u/juanfnavarror avatar

RwLock is good enough, it protects you of a race condition. Remember that vectors get reallocated if they reach capacity, therefore invalidating a read while writing, possibly even causing a segmentation fault! For that reason it will be undefined behavior to read while you write. If you see any performance issue then go after unsafe solutions. You should probably learn safe Rust first!

Also look at watch channels

u/devraj7 avatar

Watch channels look very promising, thank you!

u/sonthonaxrk avatar

I enjoy love how people come up with complex multithreading problems and the solution is always to use a simple construct.

More replies

ah, if writes rewrite the whole thing, and you want the reader to just get the latest update

then you want a synchronization mechanism, like a mutex

u/devraj7 avatar

Mutexes turn out to be expensive.

In other words, if the provider is writing "1 2 3 4 5", the mutex will only let my reader read until all these five values have been written.

But I'm totally fine if the reader reads "1 2 3 <old value> <old value>".

u/prolapsesinjudgement avatar

Not sure if it's your answer, but it sounds quite similar to https://lib.rs/crates/left-right - a nifty data structure that allows reading an old version while writes are pushed to an alternate copy, and then shuffled.

The nice thing about that is it's safe (well, "proven" i believe. I imagine it uses unsafe lol).

See the attached video in that crate for more details. Jon does a great job explaining it

more reply More replies

oh i see, that's an odd behavior

i guess you could have a vec of mutexes, instead of a mutex of a vec

this would let you write to each individual element on it's own while not blocking the whole vec

you'd probably also need to wrap the vec in an arc to share it between threads

the thing is, how would the reader know when the new values stop and the old values start?

if the writer is in the middle of writing a value, trying to read it will return garbage

you need some synchronization so that the reader never tries to read invalid data (that the writer is in the process of writing)

if the writer always rewrites the whole vec at once, and this process isn't slow (like, waiting on IO to get the next value to write), then just wrap the vec in a mutex and handle the synchronized reads. it won't be that slow

if writing is a long process, you'll need to break it up to allow the reader to get the data in between the writing part

more replies More replies

I think you seek for pattern which is known in Linux kernel as Read/Copy/Update its based on atomics and reads are lock free.

More replies
More replies
u/Individual_Place_532 avatar

If you are using tokio, check out tokio::sync::watch::channel, it is for exactly this behavior

all the reader cares about is reading the latest state of the vector

Then use a channel that doesn't have a buffer, i.e. only allows reading the latest value, for example Tokio's watch channel.

More replies
More replies
u/null_reference_user avatar

If you really want to have no synchronization at all for performance reasons, I recommend first not using Vec and instead using something like a Box<[u8]>, and second whatever unsafe code you will have to write, test it with miri.

You might think your unsafe code is sound, I come from a C background and was sure my code manually using pointers had no issues and worked properly, yet when I ran my code with miri it raised some peculiar issues (what in the name of fuck is a "stacked borrow"?) that could (wasn't, but could) result in undefined behavior.

u/dkopgerpgdolfg avatar

I recommend reading about things like eg. "data race". No sync. for performance reasons is not a valid choice.

And if Miri complains about something, and assuming Miri has no bug, then it "is" UB. No "could", no "wasn't". And it doesn't matter that you haven't seen anything going wrong, UB doesn't meant that it always must go wrong.

More replies
u/LiesArentFunny avatar
Edited

If you're guaranteed not to be reading and writing the same element of the vector simultaneously, vec.as_mut_ptr() returns a pointer to the first element and then unsafe{ *ptr.offset(i) } for reads and unsafe{ *ptr.offset(i) = val } for writes. Don't access the vec in any other way while you are storing a pointer returned by that operation and you won't violate aliasing rules. Do test with miri.

If not, use Vec<AtomicU8> or use synchronization. A unsynchronized read through a ptr is a promise to the compiler that nothing is writing to that piece of memory, not a way to get an arbitrary value back. The compiler will take that promise and rely on it in performing optimizations that will make your program not do at all what you think it does.

The first one is very fragile for many reasons and one should not do it this way. Fortunately, it is not straightforwardly implementable due to raw pointers not being Send and Vec::as_mut_ptr requiring &mut self.

In this case, there are really no reasons not to use vec of atomics.

u/andwass avatar
Edited

In this case, there are really no reasons not to use vec of atomics.

Doubly so on x86_64; Relaxed atomic load/store uses literally the same instructions as non-atomic load/stores.

thats described here https://llvm.org/docs/Atomics.html

More replies
u/LiesArentFunny avatar

I believe in answering questions, not hiding information because I think the asker isn't smart enough to use it responsibly.

Pointers can be moved across threads simply by telling the compiler that's allowed (unsafe impl Send {}/Sync {} on a struct they are in), or you can send the Vec across threads and use as_mut_ptr() twice since that method guarantees it doesn't create a reference to the underlying slice.

More replies
More replies

The problem here is having a non exclusive mut reference to the same Vec, which is undefined behavior. The limitation itself is easy to overcome by using pointers you cast to a (mut) reference as needed, but be really mindful of undefined behavior.

It is not undefined behaviour to have several non exclusive mut reference to the same Vec.

Mut references are always exclusives, what the hell do you mean.

If you have a vector of length 10

One thread modifies the first 5 elements One thread modifies the last 5 elements

You end up with two mutable references yet no data races or undefined behaviour.

It's not about the language it's about semantics.

Mutable references are not always exclusive. Shall it be so one would have very much trouble providing rust ecosystem thread safe data structure like mutexes or SPSC Queues.

This is a use case for the unsafe keyword.

I then propose we use the term pointers instead of references, because we're mixing the term reference between its concept in Rust (where mutable references are always exclusives) and its concept in general.

more replies More replies
More replies
More replies
More replies
More replies
u/DisasterReasonable34 avatar

In case you do not append elements to the vector you can use AtomicU8 with Ordering::Relaxed instead of u8 as the vector element. On arm for sure loads/stores to these would be compiled to the same instructions as the regular loads/stores and you avoid the locking overhead

&[AtomicU8] and Ordering::Relaxed accesses. A non-atomic read that races a non-atomic write is just undefined behavior. Most likely this will just compile to a simple read without any extra memory synchronization, but it prevents things like spuriously duplicated reads causing issues.

Could you shard the vector and use individual mutexes so that the whole vector isn't ever locked by a reader or writer at any given moment?

u/tylian avatar
Edited

Here's your footgun:

Get a raw pointer to the vector:

let vec_ptr: *mut Vec<u8> = &mut vec;

Pass that to your threads, and then use unsafe to turn it back into a reference.

let vec = unsafe { &mut *vec_ptr };
vec[i] = i as u8;

// or

let vec = unsafe { &*vec_ptr };
println!("{:?}", vec);

Have fun.

For a more serious answer, I'd consider u/bskceuk's suggestion, using something like this: https://lib.rs/crates/left-right

u/LiesArentFunny avatar

Do not do this. This results in undefined behavior if you simultaneously access two different parts of the vector simultaneously.

You need a ptr to the data, vec.as_mut_ptr(), not a ptr to the vec.

u/tylian avatar

Yeah. My code is entirely unsafe and unsound and should never be used. It was a joke answer, because the OP refuses to listen to us when we say what they want will result in UB.

Vec::as_mut_ptr would help, if OP didn't want to access it for reading and writing at the same time across multiple threads.

More replies
More replies

Claiming that it doesn’t matter if the reading thread operates at the same time as the writing thread is demonstrating either a misunderstanding of what that means, or that your reading thread is doing nothing with the data and probably shouldn’t even exist.

u/Kilobyte22 avatar
Edited

Please note that you cannot do this at all with types where not every possible bit combination is valid (like bool, char or str), or you will have UB which can lead to crashes, code not running that should be, code running which should not be. The Compiler doing optimisations which are completely broken. As soon as you encounter UB you might as well randomise all instructions of your program.

Even if that's the case you will still get corrupted data. Before doing any of this I highly recommend reading the nomicon, even though it isn't up-to-date

Generally: before using unsafe always ask yourself (and potentially others): is there any way I can solve this problem without unsafe.

Oh and some example what undefined behaviour means (I'm using C here, but the same would apply to rust. The code example would just be more complex):

int a;
if (a == a) {
    puts("This code will likely never run and is in fact probably optimised away");
}