Postmortem: Debugging a Zookeeper Split-Brain in Production

2021-10-14 1255 words 6 minutes

One night at 2 AM, Kafka message lag alerts fired. Investigation revealed a Zookeeper cluster split-brain — two nodes both thought they were the Leader. This post documents the full debugging and recovery process.

1. Incident Symptoms

1.1 Alert Messages

1
2
3
[CRITICAL] Kafka consumer lag > 100000
[WARN] Kafka broker 1 lost connection to zk
[WARN] Kafka broker 2 lost connection to zk

1.2 Initial Check

1
2
3
4
# Check Kafka status
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --all-groups

# Result: Many consumers showing "UNKNOWN"

Kafka relies on Zookeeper for metadata management and Leader election. If ZK is down, Kafka is dead.

2. Zookeeper State Investigation

2.1 Cluster Topology

1
2
3
4
5
6
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│    ZK-1     │    │    ZK-2     │    │    ZK-3     │
│ 192.168.1.1 │    │ 192.168.1.2 │    │ 192.168.1.3 │
└─────────────┘    └─────────────┘    └─────────────┘
      ↑                   ↑                   ↑
      └─────── Kafka Brokers connect ─────────┘

2.2 Check Each Node’s Status

1
2
3
4
5
# Normally should have 1 leader + 2 followers
for host in 192.168.1.{1,2,3}; do
  echo "=== $host ==="
  echo stat | nc $host 2181 | grep Mode
done

Abnormal output:

1
2
3
4
5
6
=== 192.168.1.1 ===
Mode: leader
=== 192.168.1.2 ===
Mode: leader      ← Two Leaders! Split-brain!
=== 192.168.1.3 ===
Error: Connection refused

2.3 Check Logs

1
2
3
4
5
6
# ZK-1 logs
tail -100 /var/log/zookeeper/zookeeper.log

# Found
WARN  [QuorumPeer] - Cannot open channel to 3 at election address /192.168.1.3:3888
ERROR [QuorumPeer] - Unexpected exception causing shutdown while sock still open

1
2
3
4
# ZK-3 logs
# Process was dead
systemctl status zookeeper
# Active: failed (Result: exit-code)

3. Root Cause Analysis

3.1 What is Split-Brain?

Normal Zookeeper elections require Quorum (majority) agreement:

1
2
3-node cluster: Quorum = 2
5-node cluster: Quorum = 3

Split-brain scenario:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Before network partition:
  ZK-1(Leader) ←→ ZK-2(Follower) ←→ ZK-3(Follower)

After network partition:
  Partition A: ZK-1, ZK-2     Partition B: ZK-3
  
  ZK-1: "I have 2 votes, I'm Leader"
  ZK-3: "Can't hear Leader heartbeat, starting election"
       → Only 1 vote, can't elect new Leader

Correct behavior: ZK-3 goes read-only

But this incident was different…

3.2 Root Cause

Check why ZK-3 died:

1
2
3
4
5
# System logs
journalctl -u zookeeper --since "2021-10-14 02:00"

# Found
Out of memory: Kill process 12345 (java) score 800

Truth: ZK-3 was killed by OOM Killer, not network partition!

Timeline reconstruction:

1
2
3
4
5
6
7
02:15 ZK-3 killed by OOM
02:16 ZK-1 and ZK-2 notice ZK-3 is gone
02:17 ZK-1 remains Leader (2 nodes still have Quorum)
02:30 Network hiccup between ZK-1 and ZK-2
02:31 ZK-2 loses ZK-1's heartbeat, assumes Leader is dead
02:32 ZK-2 starts election... but no Quorum!
02:33 ZK-1 reconnects, but ZK-2 already thinks it's Leader

Key issue: ZK-2 didn’t properly step down after network recovered.

4. Emergency Recovery

4.1 Restore Healthy Cluster

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# 1. Stop all ZK nodes
for host in 192.168.1.{1,2,3}; do
  ssh $host "systemctl stop zookeeper"
done

# 2. Check data consistency
for host in 192.168.1.{1,2}; do
  ssh $host "cat /var/lib/zookeeper/version-2/currentEpoch"
done
# Ensure epochs are consistent

# 3. Restore ZK-3
ssh 192.168.1.3 "systemctl start zookeeper"
# Wait 10s for startup

# 4. Start other nodes sequentially
ssh 192.168.1.1 "systemctl start zookeeper"
sleep 5
ssh 192.168.1.2 "systemctl start zookeeper"

4.2 Verify Recovery

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Check cluster status
for host in 192.168.1.{1,2,3}; do
  echo "=== $host ==="
  echo stat | nc $host 2181 | grep Mode
done

# Expected output
=== 192.168.1.1 ===
Mode: follower
=== 192.168.1.2 ===
Mode: follower
=== 192.168.1.3 ===
Mode: leader

# Check Kafka recovery
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list

5. Prevention Measures

5.1 Memory Configuration

1
2
3
4
5
6
7
8
9
# /etc/zookeeper/java.env
export JVMFLAGS="-Xmx2g -Xms2g"

# System level
# /etc/sysctl.conf
vm.swappiness=1  # Avoid swap usage

# Set OOM priority
echo -1000 > /proc/$(pgrep -f zookeeper)/oom_score_adj

5.2 Monitoring Alerts

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Prometheus alert rules
groups:
- name: zookeeper
  rules:
  - alert: ZookeeperDown
    expr: up{job="zookeeper"} == 0
    for: 1m
    labels:
      severity: critical
      
  - alert: ZookeeperNoLeader
    expr: zk_server_leader == 0
    for: 1m
    labels:
      severity: critical
      
  - alert: ZookeeperTooManyLeaders
    expr: sum(zk_server_leader) > 1
    for: 30s
    labels:
      severity: critical
      annotations:
        summary: "Split-brain! Multiple Leaders detected"

5.3 Cluster Configuration Tuning

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# zoo.cfg

# Heartbeat interval (ms)
tickTime=2000

# Ticks for Leader to wait for Follower connections
initLimit=10

# Ticks for Leader-Follower sync
syncLimit=5

# 4-letter word whitelist (for monitoring)
4lw.commands.whitelist=stat,ruok,mntr,envi

# Enable Leader auto-reelection (prevent zombie Leader)
leaderServes=no

5.4 Network Configuration

1
2
3
4
5
6
7
8
9
# Ensure ZK node-to-node latency < 100ms
ping -c 10 192.168.1.2
# rtt min/avg/max = 0.5/0.8/1.2 ms ✓

# Check firewall
for port in 2181 2888 3888; do
  echo "Port $port:"
  nc -zv 192.168.1.2 $port
done

6. Understanding ZAB Protocol

6.1 Election Process

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
Phase 1: Leader Election
  - Each node votes for itself
  - Collect votes from other nodes
  - Compare (epoch, zxid, myid)
  - Switch to optimal candidate
  - Node with Quorum votes wins

Phase 2: Discovery
  - Leader collects latest zxid from Followers
  - Determines data to sync

Phase 3: Synchronization
  - Leader syncs data to Followers
  - All Followers reach consensus

Phase 4: Broadcast
  - Leader starts accepting client requests
  - Uses 2PC protocol to broadcast updates

6.2 Why Quorum?

1
2
3
4
5
6
7
8
9
3 nodes, Quorum = 2
5 nodes, Quorum = 3
7 nodes, Quorum = 4

Formula: Quorum = N/2 + 1

Any two Quorums must intersect
→ Guarantees data consistency
→ Prevents split-brain

6.3 Best Practices: Node Count

Nodes	Fault Tolerance	Recommended Use
1	0	Dev only
3	1	Small production
5	2	Medium/large production
7	3	Extreme HA requirements

7. Debugging Toolkit

7.1 Four-Letter Commands

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Check node status
echo stat | nc localhost 2181

# Health check
echo ruok | nc localhost 2181
# Returns "imok"

# Metrics
echo mntr | nc localhost 2181
# zk_version  3.6.3
# zk_server_state leader
# zk_num_alive_connections 10
# zk_outstanding_requests 0

# Environment info
echo envi | nc localhost 2181

7.2 zkCli Operations

1
2
3
4
5
6
7
8
9
# Connect to cluster
zkCli.sh -server 192.168.1.1:2181,192.168.1.2:2181,192.168.1.3:2181

# View metadata
ls /brokers/ids
get /controller

# Check ACL
getAcl /kafka

7.3 Log Analysis Tips

1
2
3
4
5
6
7
8
# Find election-related logs
grep -E "LOOKING|LEADING|FOLLOWING|Election" zookeeper.log

# Find connection issues
grep "Cannot open channel" zookeeper.log

# Find session timeouts
grep "Session expired" zookeeper.log

8. Summary

Step	Action
Detect	Monitoring alerts + Kafka anomalies
Locate	Four-letter commands to check each node’s Mode
Analyze	ZK logs + system logs to find root cause
Fix	Ordered restart, prioritize Quorum recovery
Prevent	Monitoring + memory tuning + network checks

Biggest lessons:

Zookeeper is memory-sensitive — configure JVM heap properly
Set OOM protection so ZK process isn’t easily killed
Monitor Leader count — alert if more than 1
Practice failure recovery procedures regularly

Contents