Cannot be done properly on a pure syscall basis at this point.
A whitelist is almost certainly too restrictive, which means user
has to manually adjust the policy anyway. Then the default is not
of much use. Or too permissive.
A blacklist has to play catchup with new kernel versions. This may
be be improved upon by blocking all unknown (too new) syscall
numbers. However, in light of the fact we drop caps and set no_new_privs,
it's debtable how much we can gain from a blacklist anyway.
So best to leave it to the user. We also need to allow checking args
too in order to make it easier to build policies. Perhaps get
inspiration from pledge() in OpenBSD.
Classify syscalls into groups, for x86_64 only for now.
Up to date for 5.15, generate some #ifndef for syscalls
introduced since 5.10. Only support x86_64 therefore at this point.
Switch from blacklisting to a default whitelist.
Refactor the test logic. Seccomp tests that can be
killed run in their own subprocess now.
All test functions now return 0 on success. Therefore,
the shell script can be simplified.
Instead of having a blacklist and whitelist, we now allow
setting a policy that runs as a chain.
This adds qssb_append_syscalls_policy()
Furthermore, add a feature to decide per syscall which action to take.
This allows now to return an error instead of just killing the process.
In the future, it may allow us to set optimize/shrink the BPF filter.
The arch field is the same for x86_64 and x32, thus checking it
is not enough.
Simply using x32 system calls would allow a bypass. Thus,
we must check whether the system call number is in __X32_SYSCALL_BIT.
This is of course a lazy solution, we could also add the
same system call number + _X32_SYSCALL_BIT to our black/whitelists.
For now however, this however will do.
The filter was missing this check for arch, allowing bypasses
by using different calling conventions of other architectures.
A trivial example is execve() of x86 from and x86_64 process.
The purpose of these new functions is to make it simpler for users
to add new syscalls to the whitelist and blacklist.
The current approach uses a user-supplied pointer which however
was difficult to manage with "no_fs", which may add systemcalls
to the blacklist. Then we must resize arrays, and suddenly
it's our job to free them.
As a bonus, implementing them here allows easier data structure
changes and decreases the chances tgat users of this API
do something wrong, like forgetting -1 at then end, etc.
Landlock can handle write access without it implying read access,
in contrast to the existing bind mounts solution. Hence, remove
ALLOW_READ from ALLOW_WRITE bitmask.