Contents

Dive into K8s 02: Cilium Replaces Flannel — Goodbye VXLAN Tax, Hello eBPF Native Networking

Last post, we hand-rolled Flannel in Kind cluster and witnessed the 50-byte per-packet VXLAN encapsulation overhead via tcpdump. This time, we replace Flannel with Cilium, leveraging eBPF to completely eliminate this “network tax”, and use Hubble for traffic visualization.

1. Why Switch to Cilium?

1.1 Revisiting VXLAN’s Cost

In the previous post, we captured:

1
2
3
Outer UDP packet: 134 bytes
Inner ICMP packet: 84 bytes
Encapsulation overhead: 50 bytes (37%!)

For typical web apps, 50 bytes is negligible. But in AI large model training:

  • AllReduce operations sync gradients across multiple GPUs
  • Each iteration can generate GB-level network traffic
  • Encap/decap CPU overhead consumes precious compute resources
  • Latency jitter means the slowest card drags down entire training

1.2 Cilium’s Secret Weapon: eBPF

Cilium uses eBPF (Extended Berkeley Packet Filter) to handle packets directly in kernel space, without going through traditional iptables or VXLAN encapsulation:

FeatureFlannel (VXLAN)Cilium (eBPF)
Encapsulation overhead50 bytes/packet0 (Direct Routing)
NAT implementationiptables (userspace rules)eBPF (kernel)
Network policiesDepends on kube-proxyNative support
ObservabilityNeeds external toolsHubble built-in

2. Environment Prep: Clean Flannel Residue

2.1 Delete Flannel

1
2
3
4
5
# Delete Flannel DaemonSet and config
kubectl delete -f https://github.com/flannel-io/flannel/releases/latest/download/kube-flannel.yml

# Wait for flannel pods to fully delete
kubectl get pods -n kube-flannel -w

2.2 Clean Node Network Residue

This step is critical — otherwise it conflicts with Cilium:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Execute on each Kind node
for node in kind-control-plane kind-worker; do
  docker exec $node bash -c "
    # Delete flannel bridges
    ip link delete cni0 2>/dev/null || true
    ip link delete flannel.1 2>/dev/null || true
    # Clean CNI config
    rm -rf /etc/cni/net.d/*
    # Clean iptables rules
    iptables -F -t nat
    iptables -F -t filter
  "
done

War Story: First time I skipped this step, Cilium installed but Pod networking was chaos. cilium status reported “BPF NodePort: Disabled” and Pods couldn’t communicate. Spent hours in logs before discovering cni0 bridge wasn’t cleaned properly.

3. Installing Cilium

3.1 Using Cilium CLI

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# Install Cilium CLI
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
curl -L --fail --remote-name-all \
  https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-amd64.tar.gz
sudo tar xzvfC cilium-linux-amd64.tar.gz /usr/local/bin
rm cilium-linux-amd64.tar.gz

# Install Cilium (native routing mode, disable encapsulation)
cilium install --version 1.14.5 \
  --set routingMode=native \
  --set autoDirectNodeRoutes=true \
  --set kubeProxyReplacement=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true

Key parameters explained:

  • routingMode=native: Use direct routing instead of encapsulation
  • kubeProxyReplacement=true: Completely replace kube-proxy with eBPF
  • hubble.relay.enabled=true: Enable Hubble observability

3.2 Wait and Verify

1
2
3
4
5
# Wait for Cilium ready
cilium status --wait

# Run connectivity test
cilium connectivity test

Success output:

1
All 46 tests (325 actions) successful, 2 tests skipped, 1 scenario skipped.

4. eBPF Packet Capture: Witnessing Zero Encapsulation

4.1 Deploy Test Pods

1
2
3
4
5
6
# Deploy Pods on both nodes
kubectl run client --image=nicolaka/netshoot --command -- sleep infinity
kubectl run server --image=nginx

# Wait for ready
kubectl wait --for=condition=Ready pod/client pod/server

4.2 Observe Traffic with Hubble

1
2
3
4
5
# Start Hubble CLI
cilium hubble port-forward &

# Observe traffic
hubble observe --pod client --protocol TCP

4.3 Compare Capture Results

In Cilium’s Native Routing mode:

1
2
3
# Capture on node
PID=$(docker inspect --format '{{.State.Pid}}' kind-control-plane)
sudo nsenter -t $PID -n tcpdump -i eth0 host 10.244.1.5 -n

Capture result:

1
2
IP 10.244.0.3 > 10.244.1.5: ICMP echo request, id 1, seq 1, length 64
IP 10.244.1.5 > 10.244.0.3: ICMP echo reply, id 1, seq 1, length 64

Key Findings:

  • No UDP encapsulation! Directly seeing Pod IP to Pod IP ICMP packets
  • Packet only 84 bytes, compared to Flannel’s 134 bytes
  • Encapsulation overhead: 0 bytes

5. Hubble Visualization: Network Topology at a Glance

5.1 Launch Hubble UI

1
cilium hubble ui

Open browser at http://localhost:12000:

What Hubble UI shows:

  • Real-time traffic relationship graph
  • Protocol, port, bytes for each connection
  • Network policy hit status
  • DNS query tracing

5.2 Hubble CLI in Action

1
2
3
4
5
6
7
8
# View all DROP'd packets (debug network policy issues)
hubble observe --verdict DROPPED

# Trace all connections for specific Pod
hubble observe --pod kube-system/coredns --follow

# Export to JSON for analysis
hubble observe --output json > network-flows.json

6. Performance Comparison Data

Simple iperf3 test on Kind cluster:

MetricFlannel (VXLAN)Cilium (Native)Improvement
Throughput8.2 Gbps9.4 Gbps+15%
Latency (P99)0.42 ms0.31 ms-26%
CPU usage12%5%-58%

Note: Test environment is limited; real production improvements may be greater.

7. Implications for AI Infra

7.1 Why Does AI Training Need Cilium More?

  1. AllReduce traffic characteristics: Lots of small packets, bursty traffic, latency-sensitive
  2. Ring/Tree AllReduce topologies make every node a traffic hotspot
  3. GPU time is precious: Network wait = GPU idle = burning money

7.2 Going Further: RDMA and GPUDirect

Cilium is just the first step. True high-end AI Infra needs:

  • RDMA over Converged Ethernet (RoCE): Bypass kernel, direct memory access
  • GPUDirect RDMA: GPU directly reads/writes remote GPU memory

Cilium’s Native Routing mode paves the way for these advanced technologies by eliminating Overlay network constraints.

8. Summary

StepTakeaway
Clean FlannelDeep understanding of CNI plugin architecture
Install CiliumMaster eBPF network mode configuration
Compare capturesQuantify elimination of encapsulation overhead
Hubble visualizationGain production-grade network observability

Next Plan: Deploy PyTorch distributed training on multi-node cluster, compare actual training throughput impact between Flannel and Cilium.


Series