Demystifying Docker Internals: Linux Namespaces, Cgroups, and Container Isolation
An exhaustive kernel-level masterclass on how Docker containers achieve isolation — exploring Linux namespaces, control groups (cgroups), Union File Systems (OverlayFS), the container runtime stack, and the security model that separates containers from virtual machines.
A Docker container is not a virtual machine. This is one of the most important conceptual clarifications in modern DevOps education. A virtual machine runs a complete guest operating system on top of a hypervisor, with its own kernel, device drivers, and system libraries — all duplicated from the host. A container, by contrast, shares the host Linux kernel and uses a collection of kernel primitives to create the illusion of isolation: its own process tree, network stack, filesystem view, and resource limits.
Understanding this distinction is critical because it informs every architectural decision you make with containers — from security hardening and resource allocation to debugging kernel-level failures. When a container crashes unexpectedly, the root cause is almost always a kernel primitive boundary being violated: a namespace boundary permitting resource leakage, a cgroup limit being exceeded, or a capability restriction blocking a system call.
In this masterclass, Professor Pixel will walk you through the exact Linux kernel features Docker uses to build containers from scratch: Namespaces (process isolation), Control Groups / cgroups (resource limits), OverlayFS (layered filesystems), the Container Runtime Interface (CRI), and the capability and seccomp security model that governs what system calls a container process is permitted to make.
1. Containers vs Virtual Machines: A Kernel Perspective
The architectural difference between a container and a VM can be stated precisely in terms of kernel sharing:
| Layer | Virtual Machine | Docker Container |
|---|---|---|
| Kernel | Guest kernel (separate, full OS) | Host kernel (shared via syscalls) |
| Isolation Mechanism | Hypervisor hardware virtualization (VT-x, AMD-V) | Linux namespaces + cgroups |
| Startup Time | 10–60 seconds (kernel boot) | 100–500 ms (process fork) |
| Memory Overhead | 512 MB – 2 GB per guest OS | Megabytes (shared kernel) |
| Security Boundary | Hardware-level hypervisor boundary | Kernel namespaces + seccomp + capabilities |
Both containers and VMs provide isolation, but the security strength differs: a VM escape requires exploiting the hypervisor (extremely rare), while a container escape requires exploiting a kernel namespace boundary or a privileged capability (more documented). This is why privileged containers (--privileged) are highly dangerous in production environments — they effectively disable namespace boundaries.
2. Linux Namespaces: The Foundation of Process Isolation
Linux Namespaces are a kernel feature that wraps certain global resources in an abstraction layer so that processes inside a namespace see a private copy of the resource, independent from the host. Docker uses seven distinct namespace types to create container isolation:
*Mermaid Diagram: The seven Linux namespace types used by Docker for container isolation.
2.1 PID Namespace
The PID namespace gives a container its own process ID number space. The first process started in a new PID namespace gets PID 1, regardless of what PID the host kernel assigned to it. This means a container's init process (e.g., Nginx or Node.js) sees itself as PID 1 and believes it is the system's root process. From the host's perspective, the same process might be PID 8472.
PID isolation prevents a container process from seeing, signaling, or killing processes belonging to the host or other containers. A call to kill -9 1 from inside a container terminates the container's init process (PID 1 in its namespace), not the host's init process.
2.2 Network Namespace
Each container receives its own network stack: private network interfaces, IP addresses, routing tables, firewall rules (iptables), and socket tables. The container's eth0 interface is a virtual Ethernet pair (veth pair) connected to a Linux bridge (docker0) on the host. This architecture means containers are completely isolated from the host's network by default, and inter-container communication requires explicit configuration through Docker networks or port mapping.
2.3 Mount Namespace
The mount namespace gives each container its own view of the filesystem mount table. When Docker starts a container, it sets up a layered filesystem (OverlayFS, described in section 4) and mounts it as the root filesystem for the container's mount namespace. The container cannot see any of the host's other mount points unless explicitly bind-mounted with -v or --mount.
2.4 UTS, IPC, and User Namespaces
The UTS namespace isolates the container's hostname. Each container can have its own hostname independent of the host machine. The IPC namespace isolates System V IPC resources (message queues, semaphores, shared memory segments) so containers cannot share memory with each other unintentionally. The User namespace maps container UIDs to host UIDs, enabling containers to run as root internally (UID 0) while being mapped to an unprivileged host UID externally — a key rootless container security feature.
3. Control Groups (cgroups): Resource Limits and Accounting
Namespaces provide isolation of what a process can see. Control Groups (cgroups) provide limits on what a process can consume. Without cgroups, a single container could exhaust the host's entire CPU, RAM, or I/O bandwidth, causing starvation for all other containers and the host itself.
3.1 cgroup Subsystems
Docker attaches each container to a cgroup hierarchy with the following resource controllers:
- cpu / cpuacct: Limits CPU time allocation.
--cpus=2.0sets a container's quota to 2 CPU cores worth of processing time per scheduling period (default 100ms). Internally, Docker writes to/sys/fs/cgroup/cpu/docker/<container-id>/cpu.cfs_quota_us. - memory: Limits RAM usage.
--memory=512msets the hard limit. When a container's processes exceed this limit, the kernel OOM (Out-of-Memory) killer terminates the largest consumer inside the cgroup. The formula for effective memory pressure: $P_{\text{oom}} = \frac{M_{\text{used}}}{M_{\text{limit}}}$ — when this ratio approaches 1.0, OOM kill is imminent. - blkio: Limits read/write throughput and IOPS for block devices. Prevents a single container from saturating disk I/O, protecting database containers from noisy neighbor applications.
- pids: Limits the number of processes a container can fork. Prevents fork bombs: a malicious or buggy process that calls
fork()in a tight loop, exhausting the host's PID table and crashing the system. - net_cls / net_prio: Tags network packets with class IDs for traffic shaping, allowing QoS policies to be applied to container network traffic.
3.2 cgroups v1 vs v2
Modern Linux kernels (5.8+) default to cgroups v2, which provides a unified hierarchy (all controllers under one tree) versus cgroups v1's per-controller trees. cgroups v2 also introduces better memory accounting (including page caches), improved I/O cost model, and proper support for rootless containers.
where $N_{\text{cores}}$ is the number of physical CPU cores on the host. A container with cpu.cfs_quota_us = 200000 and cpu.cfs_period_us = 100000 gets a CPU quota of 2.0 cores.
4. OverlayFS: Layered Union Filesystems
A Docker image is not a monolithic disk image. It is a stack of read-only layers, each representing one RUN, COPY, or ADD instruction in the Dockerfile. This architecture is implemented by OverlayFS (Overlay Filesystem), the default storage driver in modern Docker installations.
*Mermaid Diagram: Docker image layers stacked by OverlayFS, with a thin R/W container layer on top.
OverlayFS merges these layers into a single unified directory using three directories:
- lowerdir: The read-only stack of image layers, merged together. Processes can read files from any lower layer.
- upperdir: The writable layer private to this running container. Any file modification (create, delete, update) goes here, not into the lower read-only layers.
- merged: The unified view of lowerdir + upperdir presented to container processes as their root filesystem (
/).
The Copy-on-Write (CoW) mechanism governs writes. When a container modifies a file that exists in a lower layer (e.g., editing /etc/nginx/nginx.conf), OverlayFS copies the entire file up to the upperdir on first write, then applies the modification. Subsequent writes to the same file in the same container session go directly to the upperdir without additional copying. This means container startup is fast (no duplication of image layers) and image layer storage is shared between all containers based on the same image.
A critical developer pitfall: when a container is removed (docker rm), the upperdir is deleted along with all changes made during the container's lifetime. To persist data, you must use Docker Volumes or bind mounts, which bypass OverlayFS entirely and write directly to the host filesystem.
5. The Container Runtime Stack: containerd, runc, and the OCI Specification
Docker is not a single monolithic binary. It is a layered stack of components, each with a defined responsibility:
*Mermaid Diagram: Docker's runtime stack from CLI to kernel-level process creation.
- Docker CLI: The user-facing command tool. Sends REST API calls to the Docker daemon over a Unix socket (
/var/run/docker.sock). - dockerd (Docker Daemon): Manages the high-level Docker API: building images, managing networks, handling volumes, and delegating container lifecycle to containerd.
- containerd: An industry-standard container runtime that manages the full lifecycle of container images and containers. Kubernetes uses containerd directly via the CRI (Container Runtime Interface), bypassing dockerd entirely.
- containerd-shim: A lightweight per-container supervisor process that remains alive after runc exits. It holds the container's stdio file descriptors and reports container exit status to containerd without requiring containerd itself to stay as the container's parent process.
- runc: The low-level OCI (Open Container Initiative) runtime that performs the actual kernel calls: creating namespaces with
clone(2), configuring cgroups, setting up OverlayFS mounts, dropping capabilities, applying seccomp filters, and then exec-ing the container's entrypoint binary.
The Open Container Initiative (OCI) defines the Image Specification (how container images are structured as JSON manifests and content-addressed layers) and the Runtime Specification (the JSON configuration that runc consumes to create a container). This standardization means any OCI-compliant runtime (runc, crun, kata-containers) can run any OCI-compliant image.
6. Dockerfile Optimization: Layer Caching and Build Context
Every Dockerfile instruction that modifies the filesystem creates a new OverlayFS layer with a content-addressable SHA256 hash. The Docker build cache stores these hashes, and on subsequent builds, Docker reuses any layer whose instruction and input content have not changed. Understanding the cache invalidation rules is essential for keeping CI/CD build pipelines fast.
6.1 Cache Invalidation Rules
Docker's cache invalidation algorithm works as follows: starting from the first instruction, Docker compares the current instruction's hash against the cached layer. If they match, the cache is reused for that layer and all layers below it. If they differ, the cache is invalidated for that layer and every subsequent layer. This means the order of instructions directly controls build performance.
6.2 Multi-Stage Builds for Smaller Images
A common mistake is shipping build toolchains (compilers, build systems, test frameworks) inside production images. Multi-stage builds solve this by using an intermediate build stage that compiles the application, then copying only the compiled artifact into a minimal runtime image:
The final image contains only the compiled Go binary and the distroless base (approximately 5 MB), compared to a Go builder image (around 800 MB). This reduces the attack surface of the deployed image by eliminating all build tools, package managers, and shell utilities. A container with no shell cannot be exploited through shell injection attacks.
6.3 .dockerignore: Controlling Build Context
When running docker build, Docker sends the entire build context directory to the Docker daemon before reading the Dockerfile. Without a .dockerignore file, large directories like node_modules/, .git/, and build artifacts are included in this context transfer, causing unnecessary network overhead and potentially leaking secrets into the build cache.
7. Security Model: Capabilities and Seccomp
Docker's security model beyond namespaces and cgroups consists of two additional mechanisms: Linux Capabilities and Seccomp (Secure Computing Mode).
6.1 Linux Capabilities
The traditional Unix model is binary: either a process runs as root (UID 0, all privileges) or as a non-root user (limited). Linux Capabilities break root's privileges into 41 distinct capabilities that can be individually granted or revoked. Docker drops most capabilities from containers by default and grants only those necessary for common application workloads:
CAP_NET_BIND_SERVICE— allows binding to ports below 1024 (e.g., port 80).CAP_CHOWN— allows changing file ownership.CAP_SETUID/CAP_SETGID— allows changing process user/group identity.
Dangerous capabilities like CAP_SYS_ADMIN (broad system administration powers), CAP_SYS_PTRACE (process tracing), and CAP_NET_ADMIN (network configuration) are dropped. Running a container with --privileged re-grants all capabilities, effectively disabling the capability model entirely.
6.2 Seccomp Filters
Seccomp allows Docker to restrict which Linux system calls (syscalls) a container process can invoke. Docker ships a default seccomp profile that blocks ~44 dangerous system calls, including ptrace (process inspection), keyctl (kernel key management), unshare (namespace creation), and reboot. When a blocked syscall is attempted, the kernel returns EPERM to the calling process, preventing the exploit from succeeding.
Custom seccomp profiles are specified as JSON and passed with --security-opt seccomp=/path/to/profile.json. High-security environments often maintain custom profiles that whitelist only the exact syscalls their specific application uses, following the principle of least privilege:
7. Resource Usage Comparison
The ApexCharts below compare memory overhead and startup time between virtual machines and Docker containers running equivalent workloads:
8. Interactive Namespace Isolation Simulator
Select a Linux namespace type, then run the command to observe how the container's isolated view differs from the host's global view:
9. Frequently Asked Questions (FAQ)
Q1: Why are containers faster to start than virtual machines?
Containers share the host kernel and do not need to boot a guest operating system. Starting a container is essentially a kernel clone(2) system call with namespace flags, plus OverlayFS mount setup — completing in milliseconds. A VM must POST, load a bootloader, unpack and initialize a guest kernel, and mount its own filesystems, which takes seconds to minutes.
Q2: What happens when a container exceeds its memory limit?
The Linux kernel's OOM (Out-of-Memory) killer selects a process inside the cgroup to terminate. Docker reports this as a container exit with code 137 (128 + SIGKILL). To debug OOM kills: check docker inspect --format='{{.State.OOMKilled}}' on the container and review dmesg for OOM kernel log entries.
Q3: How does Docker networking work at the kernel level?
Docker creates a virtual Ethernet pair (veth pair) for each container: one end (eth0) goes into the container's network namespace, the other end stays on the host and is attached to the docker0 Linux bridge. IP packets from the container travel through the bridge, hit iptables NAT rules that Docker configures, and are then routed to the host's external interface — appearing to come from the host's IP address.
Q4: What is the difference between runc and containerd?
runc is the low-level OCI runtime that directly creates namespaces, sets up cgroups, and exec's the container process. It exits after the container starts. containerd is the higher-level daemon that manages the container's full lifecycle (start, stop, pause, resume, image pulling, snapshot management), calling runc as a subprocess when creating containers.
Q5: What is a Docker layer and why does layer order matter?
Each Dockerfile instruction (RUN, COPY, ADD) that modifies the filesystem creates a new OverlayFS layer. Layer order matters because each layer can only see the layers below it. Placing frequently changing layers (like application code copies) at the bottom forces all subsequent layers to rebuild on every code change, eliminating cache reuse. Best practice: copy dependency files first, install dependencies (rarely changes), then copy application code (frequently changes).
Q6: What does --privileged actually do and why is it dangerous?
The --privileged flag grants the container all Linux capabilities, disables seccomp filtering, mounts all host device files into the container, and gives full access to the host's filesystem via /dev. A process inside a privileged container can escape to the host by mounting the host's root filesystem, modifying host kernel modules, or creating new namespaces. It should never be used in production environments and should be replaced with specific capability grants.
Q7: How do Docker volumes differ from bind mounts?
Docker volumes are managed by the Docker daemon and stored in /var/lib/docker/volumes/. They are portable, can use volume driver plugins for networked storage, and are the recommended way to persist container data. Bind mounts directly link a host directory path into the container, giving the container access to host filesystem paths. Bind mounts are simpler but less portable and can expose sensitive host filesystem areas.
Q8: What is a fork bomb and how does Docker prevent it?
A fork bomb is a malicious process (e.g., the classic :(){ :|:& };: shell function) that recursively spawns child processes until the system's PID table is exhausted, freezing the entire machine. Docker prevents this using the pids cgroup controller: --pids-limit 100 restricts the total number of processes a container can create to 100. When the limit is reached, further fork(2) calls return EAGAIN.