Easy mitigation for container escape / CVE-2022-0185 Linux kernel vulnerability

I recently wrote an article focusing specifically on Kubernetes YAML for deploying a single NGINX web application using Checkov. Out of Checkov’s 1,000+ infrastructure as code (IaC) policies, I only needed to address a handful of them to secure my deployment. Even then, it can be a challenge to accurately explain the importance of each when it comes to applying infrastructure best practices and mitigating security risks.

To indulgently quote my own blog with regards to CKV_K8S_31, which ensures a seccomp profile is applied to the deployment, “Seccomp is a Linux security profile that prohibits the use of certain system calls and should be associated with your deployment but is often left undefined. The byproduct of leaving it is that it will run containers with seccomp set to “unconfined,” which means the container has the capability to run a rather dangerous breadth of system calls.”

One of those systems calls is:

unshare

Or more specifically, in this use case:

unshare [--map-root-user | -r]

To clarify why this is so important, we need to explain the recent vulnerability in the Linux kernel CVE-2022-0185. It would be more than sufficient to read the disclosure here if you want to really geek out on it. The tl;dr for Kubernetes users is that a buffer overflow vulnerability (yes, that old gem) committed around Feb 2019, and thus a part of all Linux Ubuntu versions since then, allows for abuse that means an “attacker may freely write data out-of-bounds.” Subsequently, the disclosing parties proved that “functional LPE exploits against Ubuntu 20.04 and container escape exploits against Google’s hardened COS” were possible.

LPE = Local Privilege Escalation
COS = Container-Optimized OS

CAP_SYS_ADMIN

Here’s the caveat, and it’s a big one. You need to have the CAP_SYS_ADMIN capability enabled for this to work. So we’re all good, right? This would only be present by default in a privileged pod or if explicitly added to a pod. In this modern age of security, most security professionals would not allow that and certainly would explicitly drop that capability. Or perhaps not if we consider the findings from our Helm security research.

NOTE: CAP_SYS_ADMIN flies in the face of least privilege as it represents all privileges. “CAP_SYS_ADMIN has become the new root. If the goal of capabilities is to limit the power of privileged programs to be less than root, then once we give a program CAP_SYS_ADMIN the game is more or less over.” – From LWN.net

There is a catch that extends the danger of this exploit beyond simply not being a privileged pod and not having the all-powerful CAP_SYS_ADMIN. This is due to the fact that (quoting again from the disclosure): “the permission only needs to be granted in the current namespace. An unprivileged user can use unshare(CLONE_NEWNS|CLONE_NEWUSER) to enter a namespace with the CAP_SYS_ADMIN permission, and then proceed with exploitation to root the system.”

Our above shortcut example for unshare using the parameter --map-root-user elevates the shell after the current effective user and group IDs have been mapped to the superuser UID and GID (the equivalent to --map-user=0 --map-group=0) in the newly created user namespace. Presto, we have CAP_SYS_ADMIN again in our namespace, opening the door to execute our container breakout.

You can test it yourself in a KinD cluster. Simply execute a kubectl run using the default ubuntu:20.04 container. Then install libcap-ng-utils so you can check your capabilities with pscap (another nice hacker trick):

root@ubutest2:/# apt-get update
root@ubutest2:/# apt-get install libcap-ng-utils
…
Setting up libcap-ng-utils (0.7.9-2.1build1) ...
root@ubutest2:/# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap

Now run the unshare command from the disclosure:

root@ubutest2:/# unshare -r
# pscap -a
ppid  pid   name        command           capabilities
0     1     root        bash              chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpcap, net_bind_service, net_raw, sys_chroot, mknod, audit_write, setfcap
1     265   root        sh                full

If root with full capabilities sounds bad, that’s because it is.

Seccomp is important!

In addition to unshare, seccomp blocks over 40 critical system calls attackers absolutely love like:

  • mount (host filesystems),
  • ptrace (watch everything),
  • reboot (the host!),
  • setns (change Linux namespace), and
  • quotactl (mess with CPU limits).

In the past, an excuse for not having this defence in depth was that deployments would add a security context like:

       securityContext:
          readOnlyRootFilesystem: true
          allowPrivilegeEscalation: false
          capabilities:
            drop:
              - ALL

No CAP_SYS_ADMIN means none of those commands would be allowed anyway—until somebody finds a way to put it back into the namespace post-deployment, that is.

A seccomp profile can be added to your Kubernetes manifest as easily as :

 
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: myapplication
      annotations:
        seccomp.security.alpha.kubernetes.io/pod: "docker/default"

How to mitigate this vulnerability

First, this is a great lesson in choosing base operating systems for our images. Ubuntu is both familiar and capable. Unfortunately, it does have a security flipside in that it’s packed with many Linux commands, which are both surplus to requirements for most applications running in containers and represent a built-in toolkit for attackers. This is why seccomp is designed to prevent the usage of such commands. Using a container OS like Alpine whenever possible instead of Ubuntu is a best practice that is encouraged to avoid an attacker having excessive tools at their disposal.

Second, Checkov to the rescue!

With these three built-in Checkov policies, you’ll be covered:

  • CKV_K8S_37 is a great start as that will check if all capabilities are dropped.
  • CKV_K8S_39 will ensure that CAP_SYS_ADMIN is not added again. But, as we saw above, this is not enough.
  • CKV_K8S_31 will ensure that you have a seccomp profile installed by default in your deployment manifest.

Thankfully just running Checkov with its default list of policies against any Kubernetes manifests will already include all of these security best practices, so if you’re already using Checkov, you’ve already mitigated this CVE. If you’re not already using it, it’s free, so you know what to do.