Contents

Go Lock Performance Traps: From sync.Mutex to Spinlock Tuning

A seemingly simple concurrent Map access, performance crashes under high concurrency. pprof shows 90% of time spent in sync.Mutex.Lock. This post documents the investigation and performance comparison of different lock strategies.

1. The Problem

Try It Yourself

Want to run the benchmarks locally? Clone the repo and give it a try:

1
2
3
git clone https://github.com/uzqw/uzqw-blog-labs.git
cd uzqw-blog-labs/250402-golang-interview
make bench

1.1 Background

In the NAS system’s disk status cache module, we use a global Map to store disk health info:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
type DiskCache struct {
    mu    sync.RWMutex
    disks map[string]*DiskStatus
}

func (c *DiskCache) Get(id string) *DiskStatus {
    c.mu.RLock()
    defer c.mu.RUnlock()
    return c.disks[id]
}

func (c *DiskCache) Update(id string, status *DiskStatus) {
    c.mu.Lock()
    defer c.mu.Unlock()
    c.disks[id] = status
}

Looks standard, but on a 32-core machine, when concurrent read requests hit 100K QPS, response time shot from 0.1ms to 50ms.

1.2 pprof Analysis

1
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

Flame graph shows:

1
2
3
4
runtime.lock2 (45%)
  └── sync.(*RWMutex).RLock
runtime.unlock2 (42%)
  └── sync.(*RWMutex).RUnlock

87% of CPU time spent on lock contention!

2. Deep Dive into sync.Mutex

2.1 Go’s Hybrid Lock Strategy

Go’s sync.Mutex isn’t pure spinlock nor pure blocking lock — it’s hybrid:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// src/sync/mutex.go (simplified)
func (m *Mutex) Lock() {
    // Fast path: try to acquire directly
    if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
        return
    }
    m.lockSlow()
}

func (m *Mutex) lockSlow() {
    var waitStartTime int64
    var starving bool
    var awoke bool
    var iter int  // Spin count
    
    for {
        // Spin phase: brief spin wait
        if canSpin(iter) {
            if !awoke && atomic.CompareAndSwapInt32(&m.state, old, old|mutexWoken) {
                awoke = true
            }
            runtime_doSpin()  // Execute PAUSE instruction
            iter++
            continue
        }
        
        // Block phase: spin failed, go to sleep
        runtime_SemacquireMutex(&m.sema)
    }
}

2.2 Spin Conditions

1
2
3
4
5
6
7
8
func canSpin(iter int) bool {
    // 1. Multi-core CPU (single-core spin is pointless)
    // 2. GOMAXPROCS > 1
    // 3. Spin count < 4
    // 4. At least one P is running
    return iter < active_spin && 
           runtime_canSpin(iter)
}

Go only spins 4 times, about 30 CPU cycles each. If still can’t get lock, enters block queue.

2.3 When RWMutex Helps (and When It Doesn’t)

For read-heavy workloads, RWMutex can outperform Mutex:

1
2
3
// Read-heavy benchmark (24 cores)
BenchmarkMutexRead-24       24505413    142.1 ns/op
BenchmarkRWMutexRead-24     80448913     45.4 ns/op  // 3x faster for reads!

However, under write-heavy or mixed workloads, RWMutex’s readerCount atomic operations cause Cache Line Contention, reducing its advantage.

3. Solution Comparison

3.1 Option 1: Sharded Locks

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
const ShardCount = 32

type ShardedCache struct {
    shards [ShardCount]struct {
        mu    sync.RWMutex
        disks map[string]*DiskStatus
    }
}

func (c *ShardedCache) getShard(id string) int {
    h := fnv.New32a()
    h.Write([]byte(id))
    return int(h.Sum32()) % ShardCount
}

func (c *ShardedCache) Get(id string) *DiskStatus {
    shard := &c.shards[c.getShard(id)]
    shard.mu.RLock()
    defer shard.mu.RUnlock()
    return shard.disks[id]
}

Benchmark:

1
BenchmarkShardedRead-24    173875340    20.5 ns/op  // 7x faster than Mutex!

3.2 Option 2: sync.Map

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
type SyncMapCache struct {
    disks sync.Map
}

func (c *SyncMapCache) Get(id string) *DiskStatus {
    v, ok := c.disks.Load(id)
    if !ok {
        return nil
    }
    return v.(*DiskStatus)
}

Benchmark:

1
BenchmarkSyncMapRead-24    343054987    10.2 ns/op  // Excellent for reads

sync.Map is good for “write once then read” patterns; not as good as sharded locks for frequent updates.

3.3 Option 3: Spinlock + Minimal Critical Section

When critical section is short enough (just reading a pointer), pure spinlock might be faster:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
type SpinLockCache struct {
    lock  int32
    disks map[string]*DiskStatus
}

func (c *SpinLockCache) Get(id string) *DiskStatus {
    // Spin acquire
    for !atomic.CompareAndSwapInt32(&c.lock, 0, 1) {
        runtime.Gosched()  // Yield CPU to avoid starvation
    }
    // Very short critical section
    status := c.disks[id]
    atomic.StoreInt32(&c.lock, 0)
    return status
}

Benchmark:

1
BenchmarkSpinLockRead-24    27466878    134.6 ns/op  // Not always fastest!

But this is risky — if critical section gets slightly longer, performance collapses.

3.4 Option 4: Copy-on-Write

Ultimate solution when reads vastly outnumber writes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
type COWCache struct {
    disks atomic.Value // stores map[string]*DiskStatus
}

func (c *COWCache) Get(id string) *DiskStatus {
    m := c.disks.Load().(map[string]*DiskStatus)
    return m[id]  // Read is completely lock-free!
}

func (c *COWCache) Update(id string, status *DiskStatus) {
    for {
        old := c.disks.Load().(map[string]*DiskStatus)
        // Copy entire map (write becomes slow)
        new := make(map[string]*DiskStatus, len(old)+1)
        for k, v := range old {
            new[k] = v
        }
        new[id] = status
        if c.disks.CompareAndSwap(old, new) {
            return
        }
    }
}

Benchmark:

1
2
BenchmarkCOWRead-24     363207240      9.8 ns/op   // Read is ultra fast
BenchmarkCOWWrite-24       724800   4938.0 ns/op   // Write is slow (map copy)

Use case: Config caches, routing tables — “read lots, write rarely” data.

4. Performance Summary

SolutionRead (ns/op)Write (ns/op)Use Case
sync.Mutex142402General purpose
sync.RWMutex45401Read-heavy workloads
Sharded locks2157High concurrency R/W
sync.Map1098Write once, read often
Spinlock135353Risky, not recommended
Copy-on-Write104938Read lots, write rarely
Hybrid2156Best of both worlds

5. Final Solution

For disk status cache, we adopted sharded locks + COW hybrid:

1
2
3
4
5
6
7
8
9
type HybridCache struct {
    // Hot data: sharded lock protection (21 ns read, 56 ns write)
    hot [32]struct {
        mu   sync.RWMutex
        data map[string]*DiskStatus
    }
    // Cold data: COW (10 ns read, but 4938 ns write - only for rarely-updated data)
    cold atomic.Value
}

Why not just use sync.Map? While sync.Map achieves 10 ns reads and 98 ns writes, it’s optimized for “write once, read many” patterns. Our disk cache has frequent updates to hot data (health checks every second), making sharded locks a better fit for the hot path.

When to use this pattern:

  • Hot data with frequent reads AND writes → sharded locks (21/56 ns)
  • Cold data with rare writes → COW (10/4938 ns)
  • If your writes are infrequent across all data → consider sync.Map instead

Result: Reduced P99 latency from 50ms to sub-millisecond by eliminating lock contention on hot paths.

6. Debugging Toolchain

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# 1. CPU Profile
go tool pprof http://localhost:6060/debug/pprof/profile

# 2. Lock contention analysis
go tool pprof http://localhost:6060/debug/pprof/mutex

# 3. Block analysis
go tool pprof http://localhost:6060/debug/pprof/block

# 4. View goroutine status
curl http://localhost:6060/debug/pprof/goroutine?debug=2

# 5. Race detection
go test -race ./...

7. Summary

PrincipleExplanation
Shorten critical sectionOnly do essential work inside lock
Distribute contentionSharded locks reduce single point
Separate read/writeCOW makes reads lock-free
Avoid false sharingWatch Cache Line alignment
Measure firstUse pprof to locate, don’t guess

Core lesson: In high-concurrency scenarios, lock choice isn’t just “knowing how to use” — you need to fine-tune based on actual access patterns.