Contents

Dive into K8s: Hand-Rolling CNI with Kind, From CrashLoop to VXLAN Packet Capture

This weekend, I built a “networkless” cluster from scratch on Manjaro using kind, manually debugged missing kernel modules and CNI plugins, and finally witnessed the VXLAN encapsulation process with tcpdump. This post documents the entire journey.

In the cloud-native world, CNI (Container Network Interface) is often a blind spot for developers. We’re used to kubectl apply -f flannel.yaml for one-click setup, rarely exploring what’s happening underneath.

Especially for engineers pivoting to AI Infra, high-performance networking is the lifeblood of distributed training. If you don’t even know where the Overlay network overhead comes from, you can’t begin to think about RDMA or eBPF optimizations.

1. Environment Setup: Manufacturing the “Crime Scene”

To gain deep understanding, I didn’t use default configuration. Instead, I forcibly disabled Kind’s default CNI, simulating a “bare” cluster with only skeleton, no nerves.

Environment: Manjaro Linux + Docker + Kind

kind-config.yaml:

1
2
3
4
5
6
7
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
networking:
  disableDefaultCNI: true # Key: disable default network, simulate bare metal
nodes:
- role: control-plane
- role: worker # Multi-node for observing cross-node communication

After cluster startup, as expected, nodes are NotReady. Looking at ip addr inside nodes, there’s only lo and eth0 — no cni0 or flannel.1 bridge interfaces. At this point, Kubernetes is like a vegetable patient: heartbeat exists (Kubelet running) but can’t move (Pods can’t communicate).

2. Step-by-Step Debug Journey

I chose to install the classic Flannel plugin to “activate” networking, but the process wasn’t smooth.

1. Ghost Pods and Namespace Trap

After install command, habitually checked kube-system namespace — empty.

1
2
kubectl get pods -n kube-system
# Output: No resources found.

Investigation: Checking DaemonSet showed DESIRED replica count was normal. Turns out newer Flannel, for isolation purposes, migrated to its own kube-flannel namespace.

Lesson: When troubleshooting missing resources, always use kubectl get pods -A.

2. Kernel Rejection: Missing br_netfilter

Found the Pods, but they’re all in CrashLoopBackOff. Log revealed the core error:

1
Failed to check br_netfilter: stat /proc/sys/net/bridge/bridge-nf-call-iptables: no such file or directory

Deep Dive: This is the classic Linux kernel vs K8s networking conflict. Linux Bridge operates at Layer 2 (data link) by default, bypassing iptables. But K8s Services (ClusterIP) rely heavily on iptables for NAT. The br_netfilter module’s job is to bridge the gap, forcing traffic through bridges to enter iptables processing.

Solution (run on host Manjaro):

1
2
3
4
5
6
# Load kernel module
sudo modprobe br_netfilter
# Enable forwarding parameter
echo 1 | sudo tee /proc/sys/net/bridge/bridge-nf-call-iptables
# Recreate Pods
kubectl delete pod -n kube-flannel --all

Since Kind containers share host kernel, module becomes available inside containers immediately after loading on host.

3. Missing Construction Crew: CNI Chaining Failure

Flannel finally Running, but my test Pod (Nginx) stuck at ContainerCreating. kubectl describe pod revealed new problem:

1
failed to find plugin "bridge" in path [/opt/cni/bin]

Deep Dive: This involves CNI Chaining. Flannel is just the “project manager”, responsible for subnet allocation (IPAM) and route sync. The actual heavy lifting (creating cni0 bridge, connecting veth pairs) is done by the Bridge plugin from CNI standard library. Kind’s minimal node image doesn’t include these base binaries.

Solution: We need to manually “airdrop” these plugins.

  1. Download official cni-plugins-linux-amd64 package.
  2. Copy extracted bridge, loopback etc. binaries to Kind node’s /opt/cni/bin/ directory.
1
docker cp cni-plugins/. kind-worker:/opt/cni/bin/

After this, all Pods turned green (Running). Checking inside nodes, subnet.env generated successfully, network connected.

3. Ultimate Dissection: Tcpdump on Overlay Network

Network connected is just the beginning. As an AI Infra prospect, I must see what packets actually look like. I captured packets on Control-plane node, observing cross-node traffic.

Experiment Topology:

  • Client: Netshoot Pod (on Control-plane node)
  • Server: Nginx Pod (on Worker node)

Capture Command: Listen on host NIC for UDP port 8472 (VXLAN standard port).

1
2
3
# Use nsenter on host to borrow container network namespace for capture (Pro Tip!)
PID=$(docker inspect --format '{{.State.Pid}}' kind-control-plane)
sudo nsenter -t $PID -n tcpdump -i eth0 port 8472 -n -v

Capture Result Analysis:

1
2
3
4
5
# Outer layer (The Envelope)
IP 172.29.0.2.59431 > 172.29.0.3.8472: OTV, flags [I] (0x08), overlay 0, instance 1 ... length 134

# Inner layer (The Letter)
IP 10.244.0.3 > 10.244.1.5: ICMP echo request ... length 64

Hardcore Conclusions: With -v parameter, the “packet-in-packet” structure is crystal clear.

  1. Outer layer (Underlay): UDP communication between node IPs, target port 8472.
  2. Inner layer (Overlay): ICMP communication between Pod IPs.
  3. Performance Tax: Outer packet 134 bytes, inner packet 84 bytes. That’s 50 bytes encapsulation overhead per packet.

4. Summary

This hands-on exercise didn’t just fix CrashLoops — more importantly, it quantified Overlay network costs.

For web scenarios, 50 bytes is negligible. But in AI large model training (e.g., AllReduce), massive gradient sync is extremely latency-sensitive. These 50 bytes of encapsulation/decapsulation overhead plus CPU context switching could be the bottleneck limiting GPU cluster performance.

This is why in high-end AI Infra, we often abandon VXLAN and explore HostNetwork, MacVLAN, or even eBPF-based Cilium solutions, pursuing zero-overhead networking.

Next Step: Next, I’ll replace Flannel with Cilium and try using Hubble to visualize network topology.