syscall filtering #14

shlomopongratz · 2017-08-28T07:28:00Z

Hi,
I have read the syscall filtering code and I have two comments.

One, In windows we have up to two SSDT (from 4 possible one bits 12-13) each of wich may have up to 4K entries (bits 0-11) and in Linux we have much less. So we may have up to 8K possible syscalls on the worst case. Now you have declared the hash table to be of size 4K entries (12 bits) so why not just use an 8K table and remove the hash table overhead.

Two, I see that you protect the insertion and deletion with mutex but you don't protect the "find" shouldn't you use RW lock, or better RCU. I also suggest using spinlock instead of mutex as mutex can cruse context switch.
Best regards,

S.P.

Wenzel · 2017-08-31T08:39:05Z

Hi @shlomopongartz ,

thanks for your review, it's very appreciated !

first implementation using a table

One, In windows we have up to two SSDT (from 4 possible one bits 12-13) each of wich may have up to 4K entries (bits 0-11) and in Linux we have much less. So we may have up to 8K possible syscalls on the worst case. Now you have declared the hash table to be of size 4K entries (12 bits) so why not just use an 8K table and remove the hash table overhead.

My first implementation was an array of integers: KVM-VMI/kvm@9c3e588#diff-e4915d9c60ed6f8fec205d6349ae7ea3R21

And at each syscall, i had to find the syscall in this array: KVM-VMI/kvm@9c3e588#diff-4ab90127bb787cca2b557234276fbe47R162

So the more syscalls you had, the slower it gets to process a new syscall through the filter.

An 8K table

What you suggest is a big 8K table for every possible syscall
How do you index the syscall in the table to find out if it is filtered or not ?

I guess that you define an offset for Windows and Linux maybe ? Windows has no offset, starts at 0, and Linux starts at 4K (right after Windows table).

-> This forces us to add more Windows / Linux / OS logic in the kernel, and i wanted the implementation to be as simple and OS agnostic as possible.

Why i choose an hastable ?

At the beginning, what i wanted was a simple set structure.
It's easy, you just have a set interface, where you can test if a syscall is already in the set, and add a new syscall.

But i couldn't find any. (maybe could you, if you know the kernel better ?)

So i fallback on the hashtable to get a constant O(1) access time, even if there is hash function overhead.

Another reason is that i assume that the user only wants to monitor a small subset of all the syscalls available. (and it is the main use case for everyone using the filter)
That's why i designed this 12 bits, 4K hashtable.

-> The best solution here would be to have a set that automatically resizes itself.
If you come up with this, i would be very happy to review it ! ;)

mutexes

Two, I see that you protect the insertion and deletion with mutex but you don't protect the "find" shouldn't you use RW lock, or better RCU. I also suggest using spinlock instead of mutex as mutex can cruse context switch.

It's correct, i should protect the find also !
If there is a modification of the filters while a find is ongoing, the find will be broken.

-> i will lock into the differences between a spinlocks and mutextes

Can you explain a bit about the RCU lock ? (instead of RW ?)

Thanks.

shlomopongratz · 2017-08-31T10:25:32Z

Hi,

I don't think you need to have different code for Linux and Windows guests as both pass the system call number in the RAX register and if you use a array of booleans then you are done. If you want you can pass a mask from user space which makes the kernel code generic, that is the kernel code will always mask the RAX value with the mask to obtain the syscall number.

RCU is a little bit overkill as the updates are not done in the data path so RW locks are just fine.
A good explanation on RCU can be found in the folder Documentation/RCU/ but again in second thought it is an overkill.

Best regards,

S.P>

noxdafox · 2017-09-04T10:32:11Z

Regarding the use of a simple array vs a hash table you have quite a valid point. Yet I realized a corner case to be considered which is affecting both the implementations.

What would happen if the guest would be running a HW assisted virtualized guest itself?
What would happen if the guest (windows 10 anniversary update) would be running some software on the Linux subsystem?

To pose the question in a more generic way: how do we tackle cases in which the guest OS might map more than a SysCall to the same ID? This is something the kernel can solve by simply looking at the context. In our case it might be a bit more complex though.

Wenzel · 2017-09-04T14:29:35Z

I fixed nitro_find_syscall by surrounding the search with mutexes.

Also, I will keep the hashtable implementation for now because it simplifies how we process an event.

With an 8K table, we would have to maintain a way to differentiate between Linux or Windows guest, and i don't see right now how this can be done easily.

@noxdafox
Supporting nested virtualization in Nitro is not on our scope for now. (but it could be, and if someone makes a PR, i will be happy to review it)

Regarding the Windows 10 Linux subsystem:
https://blogs.msdn.microsoft.com/wsl/2016/06/08/wsl-system-calls/

Briefly: the Linux binary makes a syscall getdents, and the NT kernel forwards it to the lxss.sys driver.

-> In Nitro, we would intercept it, but treat it as a Windows syscall unfortunately, since the logic that differentiate between the two is done after the syscall, in the kernel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syscall filtering #14

syscall filtering #14

shlomopongratz commented Aug 28, 2017

Wenzel commented Aug 31, 2017

shlomopongratz commented Aug 31, 2017

noxdafox commented Sep 4, 2017 •

edited

Loading

Wenzel commented Sep 4, 2017

syscall filtering #14

syscall filtering #14

Comments

shlomopongratz commented Aug 28, 2017

Wenzel commented Aug 31, 2017

first implementation using a table

An 8K table

Why i choose an hastable ?

mutexes

shlomopongratz commented Aug 31, 2017

noxdafox commented Sep 4, 2017 • edited Loading

Wenzel commented Sep 4, 2017

noxdafox commented Sep 4, 2017 •

edited

Loading