Cgroups limit how much you can do. Namespaces limit how much you can see.
Linux containers are based on cgroups and namespaces and can be privileged or unprivileged. A privileged container is one without the “User Namespace” implying it has direct visibility into all users of the underlying host. The User Namespace allows remapping the user identities in the container so even if the process thinks it is running as root, it is not.
$ sudo su # echo 2 > /sys/fs/cgroup/pids/parent/pids.max # echo $$ > /sys/fs/cgroup/pids/parent/cgroups.procs // put current pid # cat /sys/fs/cgroup/pids/parent/pids.current 2 # (echo "foobar" | cat ) bash: for retry: No child processes
Link to the 122 page paper on linux containers security here. Includes this quote on linux kernel attacks.
“Kernel vulnerabilities can take various forms, from information leaks and Denial of Service (DoS) risks to privilege escalation and arbitrary code execution. Of the roughly 400 Linux system calls, a number have contained privilege escalation vulnerabilities, as recently as of 2016 with keyctl(2). Over the years this included, but is not limited to: futex(2), vmsplice(2), mremap(2), unmap(2), do_brk(2), splice(2), and modify_ldt(2). In addition to system calls, old or obscure networking code including but not limited
to SCTP, IPX, ATM, AppleTalk, X.25, DECNet, CANBUS, Econet and NETLINK has contributed to a great number of privilege escalation vulnerabilities through various use cases or socket options. Finally, the “perf” subsystem, used for performance monitoring, has historically contained a number of issues, such as perf_swevent_init (CVE-2013-2094).”
Which makes the case for seccomp, as containers both privileged and unprivileged can lead to bad things –
““Containers will always (by design) share the same kernel as the host. Therefore, any vulnerabilities in the kernel interface, unless the container is forbidden the use of that interface (i.e. using seccomp)”- LXC Security Documentation by Serge Hallyn, Canonical”
The paper has several links on restricting access, including grsecurity, SELinux, App Armor and firejail. A brief comparison of the first three is here. SELinux has a powerful access control mechanism – it attaches labels to all files, processes and objects; however it is complex and often people end up making things too permissive, instead of taking advantage of available controls. AppArmor works by labeling files by pathname and applying policies to the pathname – it is recommended with SUSE/OpenSUSE, not CentOS. Grsecurity policies are described here, its ACLs support process–based resource restrictions, including memory/cpu/files open, etc.