Six Pitfalls Implementing HNSW: A Debug Journal from 0% to 98%+ Recall

2026-02-16 1748 words 9 minutes

While implementing the HNSW algorithm for my vector database project vex (initial code AI-assisted, debugging and optimization done by hand), I ran my first Recall test — every single metric was 0%. Not 50%, not 30%. Zero. The HNSW results and BruteForce results had absolutely no overlap. This post documents the entire journey from 0% to 98%+ recall, cross-referencing each bug with the original HNSW paper.

1. HNSW Paper Core Ideas

Before diving into debugging, let’s establish the foundational concepts from the HNSW paper (Malkov & Yashunin, 2018). Every bug we’ll encounter maps directly to a violation of these principles.

1.1 Layered Skip List + Proximity Graph

HNSW is a multi-layer graph. Upper layers are sparse (for long-range jumps), lower layers are dense (for precise search). Search starts from the entry point at the highest layer and greedily descends.

1
2
3
4
5
Layer 2:    [A] ──────────────── [B]              (sparse, long hops)
             │                    │
Layer 1:    [A] ── [C] ── [D] ── [B] ── [E]      (medium density)
             │      │      │      │      │
Layer 0:    [A]-[C]-[F]-[D]-[G]-[B]-[E]-[H]-...  (all nodes)

Intuition: like map navigation — first hop on the highway to get near the target city, then switch to local roads, then walk the streets to the exact address.

1.2 Five Key Parameters

Parameter	Meaning	Paper Recommendation
$M$	Max neighbors per node (non-zero layers)	12-48
$M_0$	Max neighbors at Layer 0, $M_0 = 2M$	Paper §4 explicitly requires this
$ef_{Construction}$	Candidate set size during build	100-400
$ef$	Candidate set size during search (beam width)	≥ k, typically 100-800
$m_L$	Level assignment factor, $m_L = 1/\ln(M)$	Computed automatically

1.3 Two Critical Heaps

Algorithm 2 (search) defines two heaps with strict semantics:

1
2
3
4
5
6
7
8
9
C = min-heap (candidates)   -> expand nearest candidate first
W = max-heap (working set)  -> maintain top-ef results, evict worst

Loop:
  1. Pop nearest from C (min-heap pop)
  2. If dist(c) > worst in W -> stop
  3. For each neighbor e of c:
  4.   If dist(e) < worst in W -> add to C and W
  5.   If |W| > ef -> pop worst from W (max-heap pop)

This is critical — inverting these heap semantics is the direct root cause of 0% recall.

1.4 Algorithm 4: Diversity Heuristic for SelectNeighbors

The most overlooked contribution of the paper. Neighbor selection is not simply picking the M closest — it must consider spatial diversity:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Algorithm 4: SELECT-NEIGHBORS-HEURISTIC(q, C, M)
  R ← ∅ (result set)
  W ← C (working set, sorted by distance)

  while |W| > 0 AND |R| < M:
    e ← extract nearest from W
    if dist(q, e) < dist(e, r) for all r ∈ R:  // e is not "dominated"
      R ← R ∪ {e}
    else:
      discard e  (already covered by an existing neighbor)

Intuition: if you’re placing cell towers, don’t cluster them all in the city center. Spread them across different directions so every area has coverage.

2. First Shock: 0% Recall Across the Board

2.1 Test Framework

I built a Recall@K test. The idea is straightforward: insert the same vectors into both HNSW and BruteForce, query the same vectors, compute overlap:

$$Recall@k = \frac{|HNSW_{top\text{-}k} \cap BruteForce_{top\text{-}k}|}{k}$$

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
func calculateRecallAtK(hnswResults, groundTruth []vector.SearchResult, k int) float64 {
    truthSet := make(map[string]bool, len(groundTruth))
    for _, r := range groundTruth {
        truthSet[r.Key] = true
    }
    matches := 0
    for _, r := range hnswResults {
        if truthSet[r.Key] {
            matches++
        }
    }
    return float64(matches) / float64(k)
}

Test matrix: 6 configurations (1K/10K vectors × 128/256/512 dimensions).

2.2 Total Failure

1
2
3
4
5
=== RUN   TestHNSWRecallAccuracy/1K-128D
    recall@1   = 0.0000 (0.00%)
    recall@5   = 0.0000 (0.00%)
    recall@10  = 0.0000 (0.00%)
--- FAIL

All 6 configurations, all k values: 0.0000.

Debug output revealed the smoking gun:

1
2
3
4
Query: test-query-0
  HNSW top-5:  [vec-933(-0.22), vec-071(-0.21), vec-445(-0.19), ...]
  BF   top-5:  [vec-236(+0.30), vec-560(+0.29), vec-112(+0.28), ...]
  Overlap: 0/5 (0%)

Two observations:

Zero overlap — not a sorting issue, entirely different vectors
HNSW returned negative similarities (-0.22) while BF returned positive (+0.30)

HNSW was returning the least similar vectors.

3. Bug #1: Distance vs. Similarity Confusion

3.1 Root Cause

The heap comparison logic assumed lower distance = better (like Euclidean), but the actual metric was DotProduct similarity (higher = better).

1
2
3
4
5
6
7
8
9
// Original — returns raw similarity
func distanceBetween(vec1, vec2 []float32) (float32, error) {
    return vector.DotProduct(vec1, vec2) // range [-1, 1], higher = more similar
}

// Heap — assumes lower = better
func (h candidateHeap) Less(i, j int) bool {
    return h[i].distance > h[j].distance  // max-heap: largest at root
}

The heap evicts the largest value — but the largest similarity is the best result. Search was expanding the worst candidates first.

3.2 Fix

Convert similarity to distance by negation:

1
2
3
4
5
6
7
func distanceBetween(vec1, vec2 []float32) (float32, error) {
    sim, err := vector.DotProduct(vec1, vec2)
    if err != nil {
        return 0, err
    }
    return -sim, nil  // higher similarity -> lower distance (more negative)
}

Result (recall@1): 0% -> ~20% (1K-128D), ~15% (1K-256D), ~2% (10K-128D). Progress, but recall dropped sharply with higher dimensions and more data. Far from ideal.

4. Bug #2: Inverted Candidate Selection Condition

1
2
3
4
5
// Wrong: selects candidates with LARGER (worse) distance
if distance > w[0].distance || len(w) < ef { ... }

// Correct: selects candidates with SMALLER (better) distance
if distance < w[0].distance || len(w) < ef { ... }

Result: ~20% -> ~30% (1K-128D), ~15% -> ~25% (1K-256D), ~2% -> ~5% (10K-128D). Better, but 10K was still terrible.

5. Bugs #3-#6: Systematic Audit Against the Paper

At this point, I did a full code review against the paper.

5.1 Bug #3: candidates Heap Semantics (Most Fatal)

Paper: candidates = min-heap, W = max-heap.

Implementation: Both were max-heaps. Every iteration popped the furthest candidate for expansion instead of the nearest.

Fix: Change candidates to min-heap.

5.2 Bug #4: SelectNeighbors Without Diversity Heuristic

1
2
3
4
// Original — just truncate
func getNeighbors(candidates []*HNSWNode, m int) []*HNSWNode {
    return candidates[:m]  // Picks M closest — fatal mistake
}

Fix: Implement Algorithm 4 with RNG diversity heuristic — discard candidates dominated by already-selected neighbors:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
func (h *HNSWIndex) selectNeighborsHeuristic(
    query []float32, candidates []*HNSWNode, m int,
) []*HNSWNode {
    selected := make([]*HNSWNode, 0, m)
    for _, e := range candidates {
        if len(selected) >= m { break }
        distQE, _ := distanceBetween(query, e.Vector)
        dominated := false
        for _, r := range selected {
            distER, _ := distanceBetween(e.Vector, r.Vector)
            if distER < distQE { dominated = true; break }
        }
        if !dominated { selected = append(selected, e) }
    }
    return selected
}

5.3 Bug #5: Layer 0 Not Using M₀ = 2M

Paper §4 explicitly states Layer 0 should use double the neighbor count.

5.4 Bug #6: O(M²) Bubble Sort in pruneNeighbors

Replaced with sort.Slice for O(M log M).

5.5 Additional Fixes

Global RNG race: assignLevel() used shared rand.Float64(). Fixed with per-index rand.Rand.
Delete dangling pointers: Deletion didn’t clean up neighbor references. Fixed with tombstone flags and reverse-reference cleanup.

6. Results After All Six Fixes

1
2
3
4
5
1K-128D:   recall@1  = 100.00%   recall@10 = 100.00%   recall@100 = 100.00%
1K-256D:   recall@1  = 100.00%   recall@10 = 100.00%   recall@100 = 100.00%
10K-128D:  recall@1  = 100.00%   recall@10 = 100.00%   recall@100 = 100.00%
10K-256D:  recall@1  = 99.99%    recall@10 = 99.98%    recall@100 = 99.96% 
10K-512D:  recall@1  = 98.39%    recall@10 = 98.80%    recall@100 = 98.21% 

From 0% across the board to near 100% on 1K-10K/128D-256D, ~98% on 10K-512D. The last 2% on high-dimensional large-scale data is bounded by the curse of dimensionality and ef trade-offs — within expected range.

Impact breakdown (using 1K-128D recall@1 as the observation metric):

Bug	Type	recall Trend	Paper Section
Distance/similarity confusion	Semantic	0% -> ~20%	Distance space convention
Search condition inverted	Logic	~20% -> ~30%	Algorithm 2 termination
candidates heap semantics	Data structure	~30% -> ~90%	Algorithm 2 (C = min-heap)
No SelectNeighbors diversity	Algorithm gap	~90% -> ~97%	Algorithm 4
Layer 0 missing M₀=2M	Parameter	~97% -> ~99%	§4 parameter spec
Bubble sort in pruneNeighbors	Performance	No recall impact	Engineering

7. High-Dimensional Performance: Adaptive ef

After all bugs were fixed, 10K-512D took 512+ seconds. The fix: scale ef inversely with dimension.

$$ef_{adaptive} = \frac{ef_{base}}{\sqrt{dim / dim_{ref}}}$$

1
2
3
4
5
6
7
func (h *HNSWIndex) adaptiveEf(baseEf, dim int) int {
    const refDim = 128
    if dim <= refDim { return baseEf }
    ef := int(float64(baseEf) / math.Sqrt(float64(dim)/refDim))
    if ef < h.M*2 { ef = h.M * 2 }
    return ef
}

Dim	baseEf	adaptiveEf	Build Time	Avg Recall
128D	600	600	~10s	~100%
256D	600	424	~20s	~99%
512D	600	300	87s (was 512s+)	~98%

8. Gap to Industrial Implementations

Dimension	Current	Industrial (hnswlib/Faiss/Weaviate)
SelectNeighbors	Algorithm 4 RNG heuristic (done)	Same
Layer 0 M₀	2×M (done, paper §4)	Same
Distance calc	Go ASM + AVX2/AVX512 (done)	C/C++ SIMD, 2-4× faster
Concurrency	FNV sharded locks (done)	Lock-free insert / finer-grained locks
Memory layout	`map[string]*Node` pointer chasing	Contiguous arrays, cache-friendly
Delete	Soft delete + neighbor cleanup (done)	Lazy deletion + graph reconnection
Quantization	None	Product Quantization, 4-16× memory reduction
Dynamic ef	Dimension-adaptive (done)	Per-query budget adaptive
Persistence	Full snapshot	mmap direct load, sub-second recovery

The main bottlenecks now are memory layout and quantization — Go’s map + pointer structure is CPU cache-unfriendly, and uncompressed high-dimensional vectors consume significant memory. These are the next optimization targets.

9. Debugging Methodology Takeaways

Write the test framework before the algorithm — HNSW’s Search never errors; it always “succeeds” with wrong results. Without Recall@K, you’d never know.
0% is easier to debug than 50% — it signals a systematic directional error, not random precision loss.
Cross-reference the paper line by line — one flipped comparison operator is the difference between 0% and 92% recall.
Don’t lower thresholds to “pass” tests — I once lowered the required recall from 85% to 1% to make the test green. That’s hiding the bug, not fixing it.
Fix one variable at a time — the six bugs weren’t found simultaneously. Fix one, re-run, observe the recall delta, then move to the next.

Project: github.com/uzqw/vex — A vector database implemented in Go with HNSW index, BruteForce index and snapshot persistence.
Reference: Malkov, Y.A. & Yashunin, D.A. (2018). “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.” IEEE TPAMI.

Contents