Overview
Rootless containers refers to the ability for an unprivileged user to create, run and otherwise manage containers. This term also includes the variety of tooling around containers that can also be run as an unprivileged user. The goal of this website is to catalogue all of the open questions and unresolved issues that need to be solved to bring us closer to rootless containers “for all”.
“Unprivileged user” in this context refers to a user who does not have any administrative rights, and is “not in the good graces of the administrator” (in other words, they do not have the ability to ask for more privileges to be granted to them, or for software packages to be installed).
Objectives
The objectives of this project are to allow an unprivileged user to:
- (
O0) Create, run, and manage containers on their local machine. - (
O1) Create, modify, distribute, and extract container images on their local machine so they may run said container images through (O0). - (
O2) Create, manage, and use a container orchestrator on either their local machine (effective a single-node cluster) or on a set of machines they would ordinarily be able to communicate with and run unprivileged programs on.
While security is not a primary objective of this project (the goal is user enablement), the intention is that any kernel-level improvements required to facilitate this project will involve collaboration with the Linux Container Hardening Project. We hope that the end result is that rootless containers will be secure as well as useful, thus allowing enterprises to support this usage of containers.
Status
This is a living document, as is maintained by humans. If you wish to add more entries to this list, or have an update on a particular section, please open a pull request.
(O0) Runtime
- Support rootless containers in a standardised container runtime.
- Support emulating system calls
that are not allowed in regular rootless containers.
(e.g.
setgroups(2),seteuid(2),chown(2)…) i.e. getaptandyumto work. - Support checkpointing and restoring containers in a standard container runtime. This is separate from the “standardised container runtime” requirement because it is an auxilliary feature, and is a more generic problem.
- Support rootless containers in an orchestration engine’s container runtime. “Container runtime” in this context contains more features and tasks than the technically precise definition would include.
- Write some documentation for users about how
they can use rootless containers. It would be really nice if we had some
allegory for Jessie’s awesome
dockerfilesrepo but using rootless containers (she actually uses them as well but some more descriptive documentation would really help).
The de-facto implementation of the Open Container Initiative’s runtime
specification, runc, supports rootless containers out of the
box. The main restrictions are the following:
cgroupsare not supported in the general case, because on the kernel side, unprivileged subtree management has not been implemented. There is a newnsdelegatemount option for the host cgroup mount (which allows for sub-tree delegation of cgroupv2 when you create a cgroup namespace), but sinceruncsupports neither cgroupv2 nor cgroup namespaces this feature is not useful to us at the moment. However, recentlyrunccan opportunistically use cgroups if the system was configured to allow users to use them (seelxcfs’s PAM module for an example of how such a system can be configured).criusupports unprivileged dumping (checkpoint) butrestoreis not supported and thecheckpointing aspects have not been well tested in the context of rootless containers.Network namespaces aren’t used in most cases because currently unprivileged network namespaces are not very useful (in the standard usage of containers) because a new network namespace doesn’t have any network interfaces (other than
lo). The standard way of linking network namespaces (vethbridges) requires privileges in the host, which is not available in rootless containers. Some ideas for solving this are to implement unprivilegedvethbridges in the kernel, or to implement some sort of unprivileged userspacevethbridge implementation with TAP. The jury is out on this one at the moment, so we just use the host’s network namespace and call it a day.Some system calls such as
setgroups(2),seteuid(2), andchown(2)are known not to work in rootless containers, unless multiple UID/GID mappings are configured with SUID utility binaries (newuidmap(1),newgidmap(1)). Programs such asaptandyumare known to require such system calls to work.
While setgroups(2) and seteuid(2) are only “temporary” for the process
which executes them, syscalls like chown(2) and mknod(2) actually
modify persistent storage and thus rootless container runtimes may wish
to persist this data in a interoperable fashion. For this purpose we define
the user.rootlesscontainers xattr attribute.
This is also useful for emulating “real root” with tools like PRoot
or remainroot. runROOTLESS provides a forked
version of PRoot that implements the user.rootlesscontainers xattr
attribute.
(O1) Images
- Add support to a tool for creating and modifying standardised container images as an unprivileged user.
- Add support to a tool for extracting and manipulating a root filesystem extracted from a standardised container image as an unprivileged user.
- Add support to a tool for
distributing standardised container images as an unprivileged user. Note that
currently the Open Container Initiative does not define a distribution
format, so “distribution” refers to effectively doing
rsyncor something similar. Once distribution is standardised, this requirement will be extended. - Add support to a tool for building standardised container images as an unprivileged user.
- Add support to a storagedriver implementation to operate as an unprivileged user. This is separate from “extract and manipulate” because the naive implementation (just dump everything to a directory) works in principle but is not as efficient as using filesystem features to reduce the footprint of extracted images.
These requirements are implemented by several different tools, that each solve a piece of the puzzle. These tools can then be combined to create a “full-fledged” implementation of an image store or handler.
Distribution is the simplest problem to solve in this respect, and is
accomplished by skopeo. In principle an Open Container Initiative
image can just be copied with rsync, but skopeo also has several features
related to conversion between different image formats that makes it incredibly
useful for a variety of other tasks. skopeo does not require privileges for
any of this core functionality (and in fact, is normally used as an
unprivileged user).
Extraction, modification, and creation are implemented by umoci. The
main issue with these operations is the limitations of the filesystem as an
unprivileged user, since there are a wide variety of operations that are not
permitted (for security reasons). As a result, umoci’s rootless support comes
with some caveats:
While
umociwill recreate the precise (or as close as possible) root filesystem state of the container image as a privileged user, as an unprivileged user certain hacks are required. Examples include creating dummy files in place of device nodes that an unprivileged user cannot create, and forcing everything to be owned by the current unprivileged user. As a result, the results of the image extraction may be counter-intuitive, but a lot of work has gone into reducing the weirdness of these hacks. In addition,umoci repackis implemented in such a way that it should not (in normal usage) result in an image that hasumoci’s hacks baked into it (if you run an image modified by an unprivileged user as a privileged user it should work normally). However,umocisupports the usage of the previously mentioneduser.rootlesscontainersxattrattribute, which means that programs that acceptuser.rootlesscontainersattributes will be able to interoperate correctly.Because of how discretionary access control works,
umocimay have to modify the root filesystem of an extracted image when doing “read-only” operations on the extracted image. This means that root filesystems onread-onlymedia (whileumociis operating on them) may run into issues.
Building of images is implemented by orca-build, which is a
proof-of-concept wrapper around runc, umoci, and skopeo. There are
several other container builder projects which may implement this feature in
the future, but right now orca-build is fast enough and is quite minimal. It
also has the additional feature of being compatible with the Dockerfile
format for specifying build steps.
(O2) Orchestration
Orchestration is currently the biggest unknown. It’s not quite clear what is complete list of tasks necessary in order to make a container orchestrator run with the majority of its functionality working as an unprivileged user.
- Implement a way for a container networking interface plugin to run as an unprivileged user. This may involve just creating alternative plugins and swapping out the privileged versions.
These tasks do not have much significant progress. However, Cloud Foundry have implemented experimental rootless container support. The downside to their implementation is that it still requires privileged setup steps, and a privileged networking binary during the container lifecycle (thus not making it truly rootless, but it’s a great first step).
Prior Art
There has been some work in this space before, but (to our knowledge) nothing this ambitious. The following are a list of projects that have some kind of rootless container support.
bubblewrap(formerlyxdg-app) implements parts of a container runtime as an unprivileged user, but to our knowledge it doesn’t implement enough of the Open Container Initiative standard (even withbwrap-oci) to be a full container runtime.lxchas implemented quite a substantial portion of rootless containers, but their implementation requires privileged helper binaries to set upcgroupsand network interfaces. This is a reasonable compromise for a container runtime (andrunc’s solution was to just ignore those usecases), andlxcdoes have knobs to disable the use of those binaries, but it still means that it is still more of a hassle to uselxcin a way that would be required by this project.binctrwas a proof-of-concept written by Jessie Frazelle and was the inspiration for the upstream rootless container work in runc. It has the interesting additional feature of creating a single static binary that contains both the container runtime and the root filesystem of the container.singularitywas developed in parallel to theruncimplementation of rootless containers. It is a runtime that was developed for HPC requirements, and so has quite a few odd design desisions based on those requirements. However, one point of note is that in its default configuration it has severalsetuidbinaries that it uses for core operations (such as mounting the loopback device used as the rootfs), which make it impractical for the use-cases rootless containers were meant to solve. It appears thatsingularitydoes support user namespaces, but it’s not clear whether a fully unprivileged setup is commonly used or has enough features to contend withrunc’s rootless containers.
FAQ
Who are we?
At the moment, a party of one. But if this sounds cool to you, feel free to join me! I’m on Mastodon, and can be reached at ([email protected]).