Dive into K8s 02: Cilium Replaces Flannel — Goodbye VXLAN Tax, Hello eBPF Native Networking

2025-12-10 935 words 5 minutes

Last post, we hand-rolled Flannel in Kind cluster and witnessed the 50-byte per-packet VXLAN encapsulation overhead via tcpdump. This time, we replace Flannel with Cilium, leveraging eBPF to completely eliminate this “network tax”, and use Hubble for traffic visualization.

1. Why Switch to Cilium?

1.1 Revisiting VXLAN’s Cost

In the previous post, we captured:

1
2
3
Outer UDP packet: 134 bytes
Inner ICMP packet: 84 bytes
Encapsulation overhead: 50 bytes (37%!)

For typical web apps, 50 bytes is negligible. But in AI large model training:

AllReduce operations sync gradients across multiple GPUs
Each iteration can generate GB-level network traffic
Encap/decap CPU overhead consumes precious compute resources
Latency jitter means the slowest card drags down entire training

1.2 Cilium’s Secret Weapon: eBPF

Cilium uses eBPF (Extended Berkeley Packet Filter) to handle packets directly in kernel space, without going through traditional iptables or VXLAN encapsulation:

Feature	Flannel (VXLAN)	Cilium (eBPF)
Encapsulation overhead	50 bytes/packet	0 (Direct Routing)
NAT implementation	iptables (userspace rules)	eBPF (kernel)
Network policies	Depends on kube-proxy	Native support
Observability	Needs external tools	Hubble built-in

2. Environment Prep: Clean Flannel Residue

2.1 Delete Flannel

1
2
3
4
5
# Delete Flannel DaemonSet and config
kubectl delete -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml

# Wait for flannel pods to fully delete
kubectl get pods -n kube-flannel -w

2.2 Clean Node Network Residue

This step is critical — otherwise it conflicts with Cilium:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Execute on each Kind node
for node in kind-control-plane kind-worker; do
  docker exec $node bash -c "
    # Delete flannel bridges
    ip link delete cni0 2>/dev/null || true
    ip link delete flannel.1 2>/dev/null || true
    # Clean CNI config
    rm -rf /etc/cni/net.d/*
    # Clean iptables rules
    iptables -F -t nat
    iptables -F -t filter
  "
done

War Story: First time I skipped this step, Cilium installed but Pod networking was chaos. cilium status reported “BPF NodePort: Disabled” and Pods couldn’t communicate. Spent hours in logs before discovering cni0 bridge wasn’t cleaned properly.

3. Installing Cilium

3.1 Using Cilium CLI

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Install Cilium CLI
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
curl -L --fail --remote-name-all \
  https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-amd64.tar.gz
sudo tar xzvfC cilium-linux-amd64.tar.gz /usr/local/bin
rm cilium-linux-amd64.tar.gz

# Install Cilium (native routing mode, disable encapsulation)
cilium install --version 1.14.5 \
  --set routingMode=native \
  --set autoDirectNodeRoutes=true \
  --set kubeProxyReplacement=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true

Key parameters explained:

routingMode=native: Use direct routing instead of encapsulation
kubeProxyReplacement=true: Completely replace kube-proxy with eBPF
hubble.relay.enabled=true: Enable Hubble observability

3.2 Wait and Verify

1
2
3
4
5
# Wait for Cilium ready
cilium status --wait

# Run connectivity test
cilium connectivity test

Success output:

1
All 46 tests (325 actions) successful, 2 tests skipped, 1 scenario skipped.

4. eBPF Packet Capture: Witnessing Zero Encapsulation

4.1 Deploy Test Pods

1
2
3
4
5
6
# Deploy Pods on both nodes
kubectl run client --image=nicolaka/netshoot --command -- sleep infinity
kubectl run server --image=nginx

# Wait for ready
kubectl wait --for=condition=Ready pod/client pod/server

4.2 Observe Traffic with Hubble

1
2
3
4
5
# Start Hubble CLI
cilium hubble port-forward &

# Observe traffic
hubble observe --pod client --protocol TCP

4.3 Compare Capture Results

In Cilium’s Native Routing mode:

1
2
3
# Capture on node
PID=$(docker inspect --format '{{.State.Pid}}' kind-control-plane)
sudo nsenter -t $PID -n tcpdump -i eth0 host 10.244.1.5 -n

Capture result:

1
2
IP 10.244.0.3 > 10.244.1.5: ICMP echo request, id 1, seq 1, length 64
IP 10.244.1.5 > 10.244.0.3: ICMP echo reply, id 1, seq 1, length 64

Key Findings:

No UDP encapsulation! Directly seeing Pod IP to Pod IP ICMP packets
Packet only 84 bytes, compared to Flannel’s 134 bytes
Encapsulation overhead: 0 bytes

5. Hubble Visualization: Network Topology at a Glance

5.1 Launch Hubble UI

1
cilium hubble ui

Open browser at http://localhost:12000:

What Hubble UI shows:

Real-time traffic relationship graph
Protocol, port, bytes for each connection
Network policy hit status
DNS query tracing

5.2 Hubble CLI in Action

1
2
3
4
5
6
7
8
# View all DROP'd packets (debug network policy issues)
hubble observe --verdict DROPPED

# Trace all connections for specific Pod
hubble observe --pod kube-system/coredns --follow

# Export to JSON for analysis
hubble observe --output json > network-flows.json

6. Performance Comparison Data

Simple iperf3 test on Kind cluster:

Metric	Flannel (VXLAN)	Cilium (Native)	Improvement
Throughput	8.2 Gbps	9.4 Gbps	+15%
Latency (P99)	0.42 ms	0.31 ms	-26%
CPU usage	12%	5%	-58%

Note: Test environment is limited; real production improvements may be greater.

7. Implications for AI Infra

7.1 Why Does AI Training Need Cilium More?

AllReduce traffic characteristics: Lots of small packets, bursty traffic, latency-sensitive
Ring/Tree AllReduce topologies make every node a traffic hotspot
GPU time is precious: Network wait = GPU idle = burning money

7.2 Going Further: RDMA and GPUDirect

Cilium is just the first step. True high-end AI Infra needs:

RDMA over Converged Ethernet (RoCE): Bypass kernel, direct memory access
GPUDirect RDMA: GPU directly reads/writes remote GPU memory

Cilium’s Native Routing mode paves the way for these advanced technologies by eliminating Overlay network constraints.

8. Summary

Step	Takeaway
Clean Flannel	Deep understanding of CNI plugin architecture
Install Cilium	Master eBPF network mode configuration
Compare captures	Quantify elimination of encapsulation overhead
Hubble visualization	Gain production-grade network observability

Next Plan: Deploy PyTorch distributed training on multi-node cluster, compare actual training throughput impact between Flannel and Cilium.

Series
Previous: Dive into K8s 01 — Hand-Rolling CNI with Kind, From CrashLoop to VXLAN Packet Capture
Next: Dive into K8s 03 — eBPF Deep Dive: Hand-Writing a Simple CNI Plugin (Planned)

Contents