One night at 2 AM, Kafka message lag alerts fired. Investigation revealed a Zookeeper cluster split-brain — two nodes both thought they were the Leader. This post documents the full debugging and recovery process.
1. Incident Symptoms
1.1 Alert Messages
1
2
3
| [CRITICAL] Kafka consumer lag > 100000
[WARN] Kafka broker 1 lost connection to zk
[WARN] Kafka broker 2 lost connection to zk
|
1.2 Initial Check
1
2
3
4
| # Check Kafka status
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --all-groups
# Result: Many consumers showing "UNKNOWN"
|
Kafka relies on Zookeeper for metadata management and Leader election. If ZK is down, Kafka is dead.
2. Zookeeper State Investigation
2.1 Cluster Topology
1
2
3
4
5
6
| ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ ZK-1 │ │ ZK-2 │ │ ZK-3 │
│ 192.168.1.1 │ │ 192.168.1.2 │ │ 192.168.1.3 │
└─────────────┘ └─────────────┘ └─────────────┘
↑ ↑ ↑
└─────── Kafka Brokers connect ─────────┘
|
2.2 Check Each Node’s Status
1
2
3
4
5
| # Normally should have 1 leader + 2 followers
for host in 192.168.1.{1,2,3}; do
echo "=== $host ==="
echo stat | nc $host 2181 | grep Mode
done
|
Abnormal output:
1
2
3
4
5
6
| === 192.168.1.1 ===
Mode: leader
=== 192.168.1.2 ===
Mode: leader ← Two Leaders! Split-brain!
=== 192.168.1.3 ===
Error: Connection refused
|
2.3 Check Logs
1
2
3
4
5
6
| # ZK-1 logs
tail -100 /var/log/zookeeper/zookeeper.log
# Found
WARN [QuorumPeer] - Cannot open channel to 3 at election address /192.168.1.3:3888
ERROR [QuorumPeer] - Unexpected exception causing shutdown while sock still open
|
1
2
3
4
| # ZK-3 logs
# Process was dead
systemctl status zookeeper
# Active: failed (Result: exit-code)
|
3. Root Cause Analysis
3.1 What is Split-Brain?
Normal Zookeeper elections require Quorum (majority) agreement:
1
2
| 3-node cluster: Quorum = 2
5-node cluster: Quorum = 3
|
Split-brain scenario:
1
2
3
4
5
6
7
8
9
10
11
| Before network partition:
ZK-1(Leader) ←→ ZK-2(Follower) ←→ ZK-3(Follower)
After network partition:
Partition A: ZK-1, ZK-2 Partition B: ZK-3
ZK-1: "I have 2 votes, I'm Leader"
ZK-3: "Can't hear Leader heartbeat, starting election"
→ Only 1 vote, can't elect new Leader
Correct behavior: ZK-3 goes read-only
|
But this incident was different…
3.2 Root Cause
Check why ZK-3 died:
1
2
3
4
5
| # System logs
journalctl -u zookeeper --since "2021-10-14 02:00"
# Found
Out of memory: Kill process 12345 (java) score 800
|
Truth: ZK-3 was killed by OOM Killer, not network partition!
Timeline reconstruction:
1
2
3
4
5
6
7
| 02:15 ZK-3 killed by OOM
02:16 ZK-1 and ZK-2 notice ZK-3 is gone
02:17 ZK-1 remains Leader (2 nodes still have Quorum)
02:30 Network hiccup between ZK-1 and ZK-2
02:31 ZK-2 loses ZK-1's heartbeat, assumes Leader is dead
02:32 ZK-2 starts election... but no Quorum!
02:33 ZK-1 reconnects, but ZK-2 already thinks it's Leader
|
Key issue: ZK-2 didn’t properly step down after network recovered.
4. Emergency Recovery
4.1 Restore Healthy Cluster
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| # 1. Stop all ZK nodes
for host in 192.168.1.{1,2,3}; do
ssh $host "systemctl stop zookeeper"
done
# 2. Check data consistency
for host in 192.168.1.{1,2}; do
ssh $host "cat /var/lib/zookeeper/version-2/currentEpoch"
done
# Ensure epochs are consistent
# 3. Restore ZK-3
ssh 192.168.1.3 "systemctl start zookeeper"
# Wait 10s for startup
# 4. Start other nodes sequentially
ssh 192.168.1.1 "systemctl start zookeeper"
sleep 5
ssh 192.168.1.2 "systemctl start zookeeper"
|
4.2 Verify Recovery
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # Check cluster status
for host in 192.168.1.{1,2,3}; do
echo "=== $host ==="
echo stat | nc $host 2181 | grep Mode
done
# Expected output
=== 192.168.1.1 ===
Mode: follower
=== 192.168.1.2 ===
Mode: follower
=== 192.168.1.3 ===
Mode: leader
# Check Kafka recovery
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
|
5. Prevention Measures
5.1 Memory Configuration
1
2
3
4
5
6
7
8
9
| # /etc/zookeeper/java.env
export JVMFLAGS="-Xmx2g -Xms2g"
# System level
# /etc/sysctl.conf
vm.swappiness=1 # Avoid swap usage
# Set OOM priority
echo -1000 > /proc/$(pgrep -f zookeeper)/oom_score_adj
|
5.2 Monitoring Alerts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| # Prometheus alert rules
groups:
- name: zookeeper
rules:
- alert: ZookeeperDown
expr: up{job="zookeeper"} == 0
for: 1m
labels:
severity: critical
- alert: ZookeeperNoLeader
expr: zk_server_leader == 0
for: 1m
labels:
severity: critical
- alert: ZookeeperTooManyLeaders
expr: sum(zk_server_leader) > 1
for: 30s
labels:
severity: critical
annotations:
summary: "Split-brain! Multiple Leaders detected"
|
5.3 Cluster Configuration Tuning
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # zoo.cfg
# Heartbeat interval (ms)
tickTime=2000
# Ticks for Leader to wait for Follower connections
initLimit=10
# Ticks for Leader-Follower sync
syncLimit=5
# 4-letter word whitelist (for monitoring)
4lw.commands.whitelist=stat,ruok,mntr,envi
# Enable Leader auto-reelection (prevent zombie Leader)
leaderServes=no
|
5.4 Network Configuration
1
2
3
4
5
6
7
8
9
| # Ensure ZK node-to-node latency < 100ms
ping -c 10 192.168.1.2
# rtt min/avg/max = 0.5/0.8/1.2 ms ✓
# Check firewall
for port in 2181 2888 3888; do
echo "Port $port:"
nc -zv 192.168.1.2 $port
done
|
6. Understanding ZAB Protocol
6.1 Election Process
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| Phase 1: Leader Election
- Each node votes for itself
- Collect votes from other nodes
- Compare (epoch, zxid, myid)
- Switch to optimal candidate
- Node with Quorum votes wins
Phase 2: Discovery
- Leader collects latest zxid from Followers
- Determines data to sync
Phase 3: Synchronization
- Leader syncs data to Followers
- All Followers reach consensus
Phase 4: Broadcast
- Leader starts accepting client requests
- Uses 2PC protocol to broadcast updates
|
6.2 Why Quorum?
1
2
3
4
5
6
7
8
9
| 3 nodes, Quorum = 2
5 nodes, Quorum = 3
7 nodes, Quorum = 4
Formula: Quorum = N/2 + 1
Any two Quorums must intersect
→ Guarantees data consistency
→ Prevents split-brain
|
6.3 Best Practices: Node Count
| Nodes | Fault Tolerance | Recommended Use |
|---|
| 1 | 0 | Dev only |
| 3 | 1 | Small production |
| 5 | 2 | Medium/large production |
| 7 | 3 | Extreme HA requirements |
7.1 Four-Letter Commands
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| # Check node status
echo stat | nc localhost 2181
# Health check
echo ruok | nc localhost 2181
# Returns "imok"
# Metrics
echo mntr | nc localhost 2181
# zk_version 3.6.3
# zk_server_state leader
# zk_num_alive_connections 10
# zk_outstanding_requests 0
# Environment info
echo envi | nc localhost 2181
|
7.2 zkCli Operations
1
2
3
4
5
6
7
8
9
| # Connect to cluster
zkCli.sh -server 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181
# View metadata
ls /brokers/ids
get /controller
# Check ACL
getAcl /kafka
|
7.3 Log Analysis Tips
1
2
3
4
5
6
7
8
| # Find election-related logs
grep -E "LOOKING|LEADING|FOLLOWING|Election" zookeeper.log
# Find connection issues
grep "Cannot open channel" zookeeper.log
# Find session timeouts
grep "Session expired" zookeeper.log
|
8. Summary
| Step | Action |
|---|
| Detect | Monitoring alerts + Kafka anomalies |
| Locate | Four-letter commands to check each node’s Mode |
| Analyze | ZK logs + system logs to find root cause |
| Fix | Ordered restart, prioritize Quorum recovery |
| Prevent | Monitoring + memory tuning + network checks |
Biggest lessons:
- Zookeeper is memory-sensitive — configure JVM heap properly
- Set OOM protection so ZK process isn’t easily killed
- Monitor Leader count — alert if more than 1
- Practice failure recovery procedures regularly