Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall filtering #14

Open
shlomopongratz opened this issue Aug 28, 2017 · 4 comments
Open

syscall filtering #14

shlomopongratz opened this issue Aug 28, 2017 · 4 comments

Comments

@shlomopongratz
Copy link

Hi,
I have read the syscall filtering code and I have two comments.

One, In windows we have up to two SSDT (from 4 possible one bits 12-13) each of wich may have up to 4K entries (bits 0-11) and in Linux we have much less. So we may have up to 8K possible syscalls on the worst case. Now you have declared the hash table to be of size 4K entries (12 bits) so why not just use an 8K table and remove the hash table overhead.

Two, I see that you protect the insertion and deletion with mutex but you don't protect the "find" shouldn't you use RW lock, or better RCU. I also suggest using spinlock instead of mutex as mutex can cruse context switch.
Best regards,

S.P.

@Wenzel
Copy link
Member

Wenzel commented Aug 31, 2017

Hi @shlomopongartz ,

thanks for your review, it's very appreciated !

first implementation using a table

One, In windows we have up to two SSDT (from 4 possible one bits 12-13) each of wich may have up to 4K entries (bits 0-11) and in Linux we have much less. So we may have up to 8K possible syscalls on the worst case. Now you have declared the hash table to be of size 4K entries (12 bits) so why not just use an 8K table and remove the hash table overhead.

My first implementation was an array of integers: KVM-VMI/kvm@9c3e588#diff-e4915d9c60ed6f8fec205d6349ae7ea3R21

And at each syscall, i had to find the syscall in this array: KVM-VMI/kvm@9c3e588#diff-4ab90127bb787cca2b557234276fbe47R162

So the more syscalls you had, the slower it gets to process a new syscall through the filter.

An 8K table

What you suggest is a big 8K table for every possible syscall
How do you index the syscall in the table to find out if it is filtered or not ?

I guess that you define an offset for Windows and Linux maybe ? Windows has no offset, starts at 0, and Linux starts at 4K (right after Windows table).

-> This forces us to add more Windows / Linux / OS logic in the kernel, and i wanted the implementation to be as simple and OS agnostic as possible.

Why i choose an hastable ?

At the beginning, what i wanted was a simple set structure.
It's easy, you just have a set interface, where you can test if a syscall is already in the set, and add a new syscall.

But i couldn't find any. (maybe could you, if you know the kernel better ?)

So i fallback on the hashtable to get a constant O(1) access time, even if there is hash function overhead.

Another reason is that i assume that the user only wants to monitor a small subset of all the syscalls available. (and it is the main use case for everyone using the filter)
That's why i designed this 12 bits, 4K hashtable.

-> The best solution here would be to have a set that automatically resizes itself.
If you come up with this, i would be very happy to review it ! ;)

mutexes

Two, I see that you protect the insertion and deletion with mutex but you don't protect the "find" shouldn't you use RW lock, or better RCU. I also suggest using spinlock instead of mutex as mutex can cruse context switch.

It's correct, i should protect the find also !
If there is a modification of the filters while a find is ongoing, the find will be broken.

-> i will lock into the differences between a spinlocks and mutextes

Can you explain a bit about the RCU lock ? (instead of RW ?)

Thanks.

@shlomopongratz
Copy link
Author

Hi,

I don't think you need to have different code for Linux and Windows guests as both pass the system call number in the RAX register and if you use a array of booleans then you are done. If you want you can pass a mask from user space which makes the kernel code generic, that is the kernel code will always mask the RAX value with the mask to obtain the syscall number.

RCU is a little bit overkill as the updates are not done in the data path so RW locks are just fine.
A good explanation on RCU can be found in the folder Documentation/RCU/ but again in second thought it is an overkill.

Best regards,

S.P>

@noxdafox
Copy link

noxdafox commented Sep 4, 2017

Regarding the use of a simple array vs a hash table you have quite a valid point. Yet I realized a corner case to be considered which is affecting both the implementations.

What would happen if the guest would be running a HW assisted virtualized guest itself?
What would happen if the guest (windows 10 anniversary update) would be running some software on the Linux subsystem?

To pose the question in a more generic way: how do we tackle cases in which the guest OS might map more than a SysCall to the same ID? This is something the kernel can solve by simply looking at the context. In our case it might be a bit more complex though.

@Wenzel
Copy link
Member

Wenzel commented Sep 4, 2017

I fixed nitro_find_syscall by surrounding the search with mutexes.

Also, I will keep the hashtable implementation for now because it simplifies how we process an event.

With an 8K table, we would have to maintain a way to differentiate between Linux or Windows guest, and i don't see right now how this can be done easily.

@noxdafox
Supporting nested virtualization in Nitro is not on our scope for now. (but it could be, and if someone makes a PR, i will be happy to review it)

Regarding the Windows 10 Linux subsystem:
https://blogs.msdn.microsoft.com/wsl/2016/06/08/wsl-system-calls/

Briefly: the Linux binary makes a syscall getdents, and the NT kernel forwards it to the lxss.sys driver.

-> In Nitro, we would intercept it, but treat it as a Windows syscall unfortunately, since the logic that differentiate between the two is done after the syscall, in the kernel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants