A seemingly simple concurrent Map access, performance crashes under high concurrency. pprof shows 90% of time spent in sync.Mutex.Lock. This post documents the investigation and performance comparison of different lock strategies.
1. The Problem
Try It Yourself
Want to run the benchmarks locally? Clone the repo and give it a try:
1
2
3
| git clone https://github.com/uzqw/uzqw-blog-labs.git
cd uzqw-blog-labs/250402-golang-interview
make bench
|
1.1 Background
In the NAS system’s disk status cache module, we use a global Map to store disk health info:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| type DiskCache struct {
mu sync.RWMutex
disks map[string]*DiskStatus
}
func (c *DiskCache) Get(id string) *DiskStatus {
c.mu.RLock()
defer c.mu.RUnlock()
return c.disks[id]
}
func (c *DiskCache) Update(id string, status *DiskStatus) {
c.mu.Lock()
defer c.mu.Unlock()
c.disks[id] = status
}
|
Looks standard, but on a 32-core machine, when concurrent read requests hit 100K QPS, response time shot from 0.1ms to 50ms.
1.2 pprof Analysis
1
| go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
|
Flame graph shows:
1
2
3
4
| runtime.lock2 (45%)
└── sync.(*RWMutex).RLock
runtime.unlock2 (42%)
└── sync.(*RWMutex).RUnlock
|
87% of CPU time spent on lock contention!
2. Deep Dive into sync.Mutex
2.1 Go’s Hybrid Lock Strategy
Go’s sync.Mutex isn’t pure spinlock nor pure blocking lock — it’s hybrid:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| // src/sync/mutex.go (simplified)
func (m *Mutex) Lock() {
// Fast path: try to acquire directly
if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
return
}
m.lockSlow()
}
func (m *Mutex) lockSlow() {
var waitStartTime int64
var starving bool
var awoke bool
var iter int // Spin count
for {
// Spin phase: brief spin wait
if canSpin(iter) {
if !awoke && atomic.CompareAndSwapInt32(&m.state, old, old|mutexWoken) {
awoke = true
}
runtime_doSpin() // Execute PAUSE instruction
iter++
continue
}
// Block phase: spin failed, go to sleep
runtime_SemacquireMutex(&m.sema)
}
}
|
2.2 Spin Conditions
1
2
3
4
5
6
7
8
| func canSpin(iter int) bool {
// 1. Multi-core CPU (single-core spin is pointless)
// 2. GOMAXPROCS > 1
// 3. Spin count < 4
// 4. At least one P is running
return iter < active_spin &&
runtime_canSpin(iter)
}
|
Go only spins 4 times, about 30 CPU cycles each. If still can’t get lock, enters block queue.
2.3 When RWMutex Helps (and When It Doesn’t)
For read-heavy workloads, RWMutex can outperform Mutex:
1
2
3
| // Read-heavy benchmark (24 cores)
BenchmarkMutexRead-24 24505413 142.1 ns/op
BenchmarkRWMutexRead-24 80448913 45.4 ns/op // 3x faster for reads!
|
However, under write-heavy or mixed workloads, RWMutex’s readerCount atomic operations cause Cache Line Contention, reducing its advantage.
3. Solution Comparison
3.1 Option 1: Sharded Locks
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| const ShardCount = 32
type ShardedCache struct {
shards [ShardCount]struct {
mu sync.RWMutex
disks map[string]*DiskStatus
}
}
func (c *ShardedCache) getShard(id string) int {
h := fnv.New32a()
h.Write([]byte(id))
return int(h.Sum32()) % ShardCount
}
func (c *ShardedCache) Get(id string) *DiskStatus {
shard := &c.shards[c.getShard(id)]
shard.mu.RLock()
defer shard.mu.RUnlock()
return shard.disks[id]
}
|
Benchmark:
1
| BenchmarkShardedRead-24 173875340 20.5 ns/op // 7x faster than Mutex!
|
3.2 Option 2: sync.Map
1
2
3
4
5
6
7
8
9
10
11
| type SyncMapCache struct {
disks sync.Map
}
func (c *SyncMapCache) Get(id string) *DiskStatus {
v, ok := c.disks.Load(id)
if !ok {
return nil
}
return v.(*DiskStatus)
}
|
Benchmark:
1
| BenchmarkSyncMapRead-24 343054987 10.2 ns/op // Excellent for reads
|
sync.Map is good for “write once then read” patterns; not as good as sharded locks for frequent updates.
3.3 Option 3: Spinlock + Minimal Critical Section
When critical section is short enough (just reading a pointer), pure spinlock might be faster:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| type SpinLockCache struct {
lock int32
disks map[string]*DiskStatus
}
func (c *SpinLockCache) Get(id string) *DiskStatus {
// Spin acquire
for !atomic.CompareAndSwapInt32(&c.lock, 0, 1) {
runtime.Gosched() // Yield CPU to avoid starvation
}
// Very short critical section
status := c.disks[id]
atomic.StoreInt32(&c.lock, 0)
return status
}
|
Benchmark:
1
| BenchmarkSpinLockRead-24 27466878 134.6 ns/op // Not always fastest!
|
But this is risky — if critical section gets slightly longer, performance collapses.
3.4 Option 4: Copy-on-Write
Ultimate solution when reads vastly outnumber writes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| type COWCache struct {
disks atomic.Value // stores map[string]*DiskStatus
}
func (c *COWCache) Get(id string) *DiskStatus {
m := c.disks.Load().(map[string]*DiskStatus)
return m[id] // Read is completely lock-free!
}
func (c *COWCache) Update(id string, status *DiskStatus) {
for {
old := c.disks.Load().(map[string]*DiskStatus)
// Copy entire map (write becomes slow)
new := make(map[string]*DiskStatus, len(old)+1)
for k, v := range old {
new[k] = v
}
new[id] = status
if c.disks.CompareAndSwap(old, new) {
return
}
}
}
|
Benchmark:
1
2
| BenchmarkCOWRead-24 363207240 9.8 ns/op // Read is ultra fast
BenchmarkCOWWrite-24 724800 4938.0 ns/op // Write is slow (map copy)
|
Use case: Config caches, routing tables — “read lots, write rarely” data.
| Solution | Read (ns/op) | Write (ns/op) | Use Case |
|---|
| sync.Mutex | 142 | 402 | General purpose |
| sync.RWMutex | 45 | 401 | Read-heavy workloads |
| Sharded locks | 21 | 57 | High concurrency R/W |
| sync.Map | 10 | 98 | Write once, read often |
| Spinlock | 135 | 353 | Risky, not recommended |
| Copy-on-Write | 10 | 4938 | Read lots, write rarely |
| Hybrid | 21 | 56 | Best of both worlds |
5. Final Solution
For disk status cache, we adopted sharded locks + COW hybrid:
1
2
3
4
5
6
7
8
9
| type HybridCache struct {
// Hot data: sharded lock protection (21 ns read, 56 ns write)
hot [32]struct {
mu sync.RWMutex
data map[string]*DiskStatus
}
// Cold data: COW (10 ns read, but 4938 ns write - only for rarely-updated data)
cold atomic.Value
}
|
Why not just use sync.Map? While sync.Map achieves 10 ns reads and 98 ns writes, it’s optimized for “write once, read many” patterns. Our disk cache has frequent updates to hot data (health checks every second), making sharded locks a better fit for the hot path.
When to use this pattern:
- Hot data with frequent reads AND writes → sharded locks (21/56 ns)
- Cold data with rare writes → COW (10/4938 ns)
- If your writes are infrequent across all data → consider
sync.Map instead
Result: Reduced P99 latency from 50ms to sub-millisecond by eliminating lock contention on hot paths.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| # 1. CPU Profile
go tool pprof http://localhost:6060/debug/pprof/profile
# 2. Lock contention analysis
go tool pprof http://localhost:6060/debug/pprof/mutex
# 3. Block analysis
go tool pprof http://localhost:6060/debug/pprof/block
# 4. View goroutine status
curl http://localhost:6060/debug/pprof/goroutine?debug=2
# 5. Race detection
go test -race ./...
|
7. Summary
| Principle | Explanation |
|---|
| Shorten critical section | Only do essential work inside lock |
| Distribute contention | Sharded locks reduce single point |
| Separate read/write | COW makes reads lock-free |
| Avoid false sharing | Watch Cache Line alignment |
| Measure first | Use pprof to locate, don’t guess |
Core lesson: In high-concurrency scenarios, lock choice isn’t just “knowing how to use” — you need to fine-tune based on actual access patterns.