Dive into K8s: Hand-Rolling CNI with Kind, From CrashLoop to VXLAN Packet Capture
This weekend, I built a “networkless” cluster from scratch on Manjaro using kind, manually debugged missing kernel modules and CNI plugins, and finally witnessed the VXLAN encapsulation process with tcpdump. This post documents the entire journey.
In the cloud-native world, CNI (Container Network Interface) is often a blind spot for developers. We’re used to kubectl apply -f flannel.yaml for one-click setup, rarely exploring what’s happening underneath.
Especially for engineers pivoting to AI Infra, high-performance networking is the lifeblood of distributed training. If you don’t even know where the Overlay network overhead comes from, you can’t begin to think about RDMA or eBPF optimizations.
1. Environment Setup: Manufacturing the “Crime Scene”
To gain deep understanding, I didn’t use default configuration. Instead, I forcibly disabled Kind’s default CNI, simulating a “bare” cluster with only skeleton, no nerves.
Environment: Manjaro Linux + Docker + Kind
kind-config.yaml:
| |
After cluster startup, as expected, nodes are NotReady.
Looking at ip addr inside nodes, there’s only lo and eth0 — no cni0 or flannel.1 bridge interfaces. At this point, Kubernetes is like a vegetable patient: heartbeat exists (Kubelet running) but can’t move (Pods can’t communicate).
2. Step-by-Step Debug Journey
I chose to install the classic Flannel plugin to “activate” networking, but the process wasn’t smooth.
1. Ghost Pods and Namespace Trap
After install command, habitually checked kube-system namespace — empty.
| |
Investigation: Checking DaemonSet showed DESIRED replica count was normal. Turns out newer Flannel, for isolation purposes, migrated to its own kube-flannel namespace.
Lesson: When troubleshooting missing resources, always use kubectl get pods -A.
2. Kernel Rejection: Missing br_netfilter
Found the Pods, but they’re all in CrashLoopBackOff. Log revealed the core error:
| |
Deep Dive:
This is the classic Linux kernel vs K8s networking conflict. Linux Bridge operates at Layer 2 (data link) by default, bypassing iptables. But K8s Services (ClusterIP) rely heavily on iptables for NAT. The br_netfilter module’s job is to bridge the gap, forcing traffic through bridges to enter iptables processing.
Solution (run on host Manjaro):
| |
Since Kind containers share host kernel, module becomes available inside containers immediately after loading on host.
3. Missing Construction Crew: CNI Chaining Failure
Flannel finally Running, but my test Pod (Nginx) stuck at ContainerCreating.
kubectl describe pod revealed new problem:
| |
Deep Dive:
This involves CNI Chaining. Flannel is just the “project manager”, responsible for subnet allocation (IPAM) and route sync. The actual heavy lifting (creating cni0 bridge, connecting veth pairs) is done by the Bridge plugin from CNI standard library. Kind’s minimal node image doesn’t include these base binaries.
Solution: We need to manually “airdrop” these plugins.
- Download official
cni-plugins-linux-amd64package. - Copy extracted
bridge,loopbacketc. binaries to Kind node’s/opt/cni/bin/directory.
| |
After this, all Pods turned green (Running). Checking inside nodes, subnet.env generated successfully, network connected.
3. Ultimate Dissection: Tcpdump on Overlay Network
Network connected is just the beginning. As an AI Infra prospect, I must see what packets actually look like. I captured packets on Control-plane node, observing cross-node traffic.
Experiment Topology:
- Client: Netshoot Pod (on Control-plane node)
- Server: Nginx Pod (on Worker node)
Capture Command: Listen on host NIC for UDP port 8472 (VXLAN standard port).
| |
Capture Result Analysis:
| |
Hardcore Conclusions:
With -v parameter, the “packet-in-packet” structure is crystal clear.
- Outer layer (Underlay): UDP communication between node IPs, target port 8472.
- Inner layer (Overlay): ICMP communication between Pod IPs.
- Performance Tax: Outer packet 134 bytes, inner packet 84 bytes. That’s 50 bytes encapsulation overhead per packet.
4. Summary
This hands-on exercise didn’t just fix CrashLoops — more importantly, it quantified Overlay network costs.
For web scenarios, 50 bytes is negligible. But in AI large model training (e.g., AllReduce), massive gradient sync is extremely latency-sensitive. These 50 bytes of encapsulation/decapsulation overhead plus CPU context switching could be the bottleneck limiting GPU cluster performance.
This is why in high-end AI Infra, we often abandon VXLAN and explore HostNetwork, MacVLAN, or even eBPF-based Cilium solutions, pursuing zero-overhead networking.
Next Step: Next, I’ll replace Flannel with Cilium and try using Hubble to visualize network topology.