containers are not magic. containers are a small pile of linux primitives wearing a trench coat. you can build one in an afternoon if you don't care about correctness, security, performance, or your friendships.
i didn't care about any of those things, so i built one.
the four ingredients
- namespaces — for isolation (pid, mount, net, uts, ipc, user)
- cgroups v2 — for resource limits (cpu, mem, io)
- an overlay filesystem — for the rootfs
- vibes — for everything else
the entire fork-and-exec
cmd := exec.Command("/proc/self/exe", append([]string{"child"}, args...)...)
cmd.SysProcAttr = &syscall.SysProcAttr{
Cloneflags: syscall.CLONE_NEWUTS |
syscall.CLONE_NEWPID |
syscall.CLONE_NEWNS |
syscall.CLONE_NEWNET |
syscall.CLONE_NEWIPC |
syscall.CLONE_NEWUSER,
}
cmd.Stdin, cmd.Stdout, cmd.Stderr = os.Stdin, os.Stdout, os.Stderr
must(cmd.Run())
in the child you mount /proc inside the new pid namespace, pivot_root onto your overlay, set the hostname to something embarrassing (mine is 'localhorst'), and exec the user's command. that's it. that is the whole thing. you have invented a sad runc.
do not use this in production. do not use this in staging. do not let it within ten meters of an audit. i love you.
what's actually hard
cgroups v2 was a delight after years of fighting v1's hierarchies. user namespaces remain a special kind of nightmare — uid mappings, capability semantics, the small but persistent voice in the back of your head whispering 'are you sure'. networking, of course, is its own continent. veth pairs and a bridge get you basically nowhere; routes, iptables, dns, all of it has to be built. give up early and shell out to slirp4netns. don't be a hero.