rkt/rkt

Investigating user namespaces

Open

#986 opened on Jun 3, 2015

View on GitHub
 (12 comments) (0 reactions) (0 assignees)Go (8,871 stars) (865 forks)batch import
depends-on/externalhelp wantedkind/questiontechnology/userns

Description

In this issue we try to investigate user namespaces support in rkt.

For the moment unprivileged user namespaces are out of scope.

Summary of tasks:

  • kernel uidshifts at mount time #1057
    • proof of concept for tmpfs
    • the rest of the implementation
  • uid shift with chown without kernel support #1027 #1250
    • functional test: files belong to the correct user/group inside and outside the container
    • functional test: root directory belongs to the correct user/group (with RW access) #1531
  • dynamic uid-locking scheme #1090

Problems

1) CAP_SYS_ADMIN and cgroup filesystems:

rkt pods have CAP_SYS_ADMIN this may reduce rkt's security: containers could remount cgroups in read-write mode and since cgroups are not namespaced, containers can change cgroup settings for services on the host. Getting rid of CAP_SYS_ADMIN is difficult with the current architecture. Files in the cgroup filesystems are writable only by root (in the system slice) or by a specific user (in user slices). If the root user and other users from the host are not mapped in the user namespace of the container, it becomes a non-issue. please see this issue "stage1: rkt pods should not be given CAP_SYS_ADMIN" https://github.com/coreos/rkt/issues/576

TODO: check that the cgroup filesystem access rights behave as described as above when running in a user namespace.

2) Separation and isolation

2.1) Capabilities:

user namespace aware kernel interfaces are handled with the ns_capable() check against the current userns. We may take advantage of this isolation to give some capabilities to container X and give other capabilities to container Y, at the same these capabilities may not be effective on the host in other words the init_userns. This gives advantage to allow some file system capabilities on a specific container for its internal operations without affecting other containers or even the host.

Please note that some kernel interface still use the plain old capable() check, if the check succeed then the caps are effective globally.

2.2) Per-user limits:

Each user on the system has its own “struct user_struct” to count resources (processes, signals, etc.) When several pods are used they share the same user, any operation on a pod may affect other pods. To improve separation and add an extra layer of resource isolation we may use user namespace and assign a range of global kuid_t to pod X and assign another range to pod Y, this may improve the situation and prevent some pods from DoS'ing each other.

Please note that kuid_t is not uid_t. kuid_t is the global kernel UID used to identify process's credentials.

userns_heirarchy

What we are trying to do:

1) User namespace mapping:

rkt will use only 1 level of user namespace. The schema above is just to illustrate the general concept of user namespaces.

Each running pod will have a range of uid assigned. For example, pod1 will have uids 200000 to 200999 (mapped to 0-999) and pod2 will have 201000 to 201999 (mapped to 0-999). rkt should not reuse uids assigned to other system users (e.g. don’t reuse www-data, geoclue or sshd users!). rkt will be assigned a big range of uids it is allowed to use in /etc/subuid (see manpage subuid(5) and useradd(8)). Example: rkt:200000:65536.

2) User namespace locking:

Since rkt does not have a central daemon to assign uids within the global range allowed by /etc/subuid, we will need some rkt-specific locking to avoid multiple running pods to reuse the same uids. Example: pod1 will lock on /run/rkt/uid-locks/uid-200000; pod2 will lock on /run/rkt/uid-locks/uid-201000. Or maybe something smarter to express the range.

Challenges / things to check:

1) uid-shifts for rootfs:

1.1) At extract time:

the same ACI could be used in several running pods. The rootfs trees are cached in the CAS (Content Addressable Storage) in /var/lib/rkt/cas/ and have specific uid/gid owners and changing them all (recursive chown) is too costly. We could shift uids on the fly at extract time, then it would not be so costly.

1.2) At mount time:

When a rootfs tree is used by overlayfs, we would need some vfs_uid= shifting option. This was already mention in this lwn article "UID/GID identity and filesystems" http://lwn.net/Articles/637431/ https://lists.linux-foundation.org/pipermail/containers/2014-June/034630.html

volumes bind-mounting: similar issue

2) Network namespace:

On Linux, a network namespace belongs to the user namespace under which it was created, see field user_ns in struct net: http://lxr.free-electrons.com/source/include/net/net_namespace.h#L44.

Rkt currently creates the network namespace in network plugins before systemd-nspawn is launched. If the user namespace is to be created by systemd-nspawn, it means that the network namespace will belong to the host user namespace rather than the pod user namespace. AFAIU, it has unwanted consequences on the access to /sys/class/net/. See also https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=87a8ebd637dafc255070f503909a053cf0d98d3f

3) Bind mounting:

Bind mounting a socket file from the host to the container? E.g. sdnotify socket (https://bugs.freedesktop.org/show_bug.cgi?id=89844).

4) Passing file descriptors:

Passing fds in the other direction from the container to the host? E.g. the journal dirfd that @krnowak is working on https://github.com/coreos/rkt/issues/947.

Model

Our model may touch several layers from rkt, nspawn to the kernel

1) kernel: uidshift on the rootfs

As noted in the Challenges section, we may add a uidshift mount option for containers but also to implement dynamic uids for services. This may allow running unprivileged daemons with dynamically assigned uids without leaking into the persistent file system.

Implementation:

1.1) A generic vfs mount option ?

mount(source, target, “bind”, MS_BIND|MS_REMOUNT, vfs_uidshifts)

1.2) An overlayfs mount option or a completely new overlayfs-like fs

Some file systems like NFS are already doing some mapping.

TODO: continue the investigation.

Thanks to @alban for his help on the subject.

Contributor guide