Contents

Six Pitfalls Implementing HNSW: A Debug Journal from 0% to 98%+ Recall

While implementing the HNSW algorithm for my vector database project vex (initial code AI-assisted, debugging and optimization done by hand), I ran my first Recall test — every single metric was 0%. Not 50%, not 30%. Zero. The HNSW results and BruteForce results had absolutely no overlap. This post documents the entire journey from 0% to 98%+ recall, cross-referencing each bug with the original HNSW paper.

1. HNSW Paper Core Ideas

Before diving into debugging, let’s establish the foundational concepts from the HNSW paper (Malkov & Yashunin, 2018). Every bug we’ll encounter maps directly to a violation of these principles.

1.1 Layered Skip List + Proximity Graph

HNSW is a multi-layer graph. Upper layers are sparse (for long-range jumps), lower layers are dense (for precise search). Search starts from the entry point at the highest layer and greedily descends.

1
2
3
4
5
Layer 2:    [A] ──────────────── [B]              (sparse, long hops)
             │                    │
Layer 1:    [A] ── [C] ── [D] ── [B] ── [E]      (medium density)
             │      │      │      │      │
Layer 0:    [A]-[C]-[F]-[D]-[G]-[B]-[E]-[H]-...  (all nodes)

Intuition: like map navigation — first hop on the highway to get near the target city, then switch to local roads, then walk the streets to the exact address.

1.2 Five Key Parameters

ParameterMeaningPaper Recommendation
$M$Max neighbors per node (non-zero layers)12-48
$M_0$Max neighbors at Layer 0, $M_0 = 2M$Paper §4 explicitly requires this
$ef_{Construction}$Candidate set size during build100-400
$ef$Candidate set size during search (beam width)≥ k, typically 100-800
$m_L$Level assignment factor, $m_L = 1/\ln(M)$Computed automatically

1.3 Two Critical Heaps

Algorithm 2 (search) defines two heaps with strict semantics:

1
2
3
4
5
6
7
8
9
C = min-heap (candidates)   -> expand nearest candidate first
W = max-heap (working set)  -> maintain top-ef results, evict worst

Loop:
  1. Pop nearest from C (min-heap pop)
  2. If dist(c) > worst in W -> stop
  3. For each neighbor e of c:
  4.   If dist(e) < worst in W -> add to C and W
  5.   If |W| > ef -> pop worst from W (max-heap pop)

This is critical — inverting these heap semantics is the direct root cause of 0% recall.

1.4 Algorithm 4: Diversity Heuristic for SelectNeighbors

The most overlooked contribution of the paper. Neighbor selection is not simply picking the M closest — it must consider spatial diversity:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Algorithm 4: SELECT-NEIGHBORS-HEURISTIC(q, C, M)
  R ← ∅ (result set)
  W ← C (working set, sorted by distance)

  while |W| > 0 AND |R| < M:
    e ← extract nearest from W
    if dist(q, e) < dist(e, r) for all r ∈ R:  // e is not "dominated"
      R ← R ∪ {e}
    else:
      discard e  (already covered by an existing neighbor)

Intuition: if you’re placing cell towers, don’t cluster them all in the city center. Spread them across different directions so every area has coverage.

2. First Shock: 0% Recall Across the Board

2.1 Test Framework

I built a Recall@K test. The idea is straightforward: insert the same vectors into both HNSW and BruteForce, query the same vectors, compute overlap:

$$Recall@k = \frac{|HNSW_{top\text{-}k} \cap BruteForce_{top\text{-}k}|}{k}$$

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
func calculateRecallAtK(hnswResults, groundTruth []vector.SearchResult, k int) float64 {
    truthSet := make(map[string]bool, len(groundTruth))
    for _, r := range groundTruth {
        truthSet[r.Key] = true
    }
    matches := 0
    for _, r := range hnswResults {
        if truthSet[r.Key] {
            matches++
        }
    }
    return float64(matches) / float64(k)
}

Test matrix: 6 configurations (1K/10K vectors × 128/256/512 dimensions).

2.2 Total Failure

1
2
3
4
5
=== RUN   TestHNSWRecallAccuracy/1K-128D
    recall@1   = 0.0000 (0.00%)
    recall@5   = 0.0000 (0.00%)
    recall@10  = 0.0000 (0.00%)
--- FAIL

All 6 configurations, all k values: 0.0000.

Debug output revealed the smoking gun:

1
2
3
4
Query: test-query-0
  HNSW top-5:  [vec-933(-0.22), vec-071(-0.21), vec-445(-0.19), ...]
  BF   top-5:  [vec-236(+0.30), vec-560(+0.29), vec-112(+0.28), ...]
  Overlap: 0/5 (0%)

Two observations:

  1. Zero overlap — not a sorting issue, entirely different vectors
  2. HNSW returned negative similarities (-0.22) while BF returned positive (+0.30)

HNSW was returning the least similar vectors.

3. Bug #1: Distance vs. Similarity Confusion

3.1 Root Cause

The heap comparison logic assumed lower distance = better (like Euclidean), but the actual metric was DotProduct similarity (higher = better).

1
2
3
4
5
6
7
8
9
// Original — returns raw similarity
func distanceBetween(vec1, vec2 []float32) (float32, error) {
    return vector.DotProduct(vec1, vec2) // range [-1, 1], higher = more similar
}

// Heap — assumes lower = better
func (h candidateHeap) Less(i, j int) bool {
    return h[i].distance > h[j].distance  // max-heap: largest at root
}

The heap evicts the largest value — but the largest similarity is the best result. Search was expanding the worst candidates first.

3.2 Fix

Convert similarity to distance by negation:

1
2
3
4
5
6
7
func distanceBetween(vec1, vec2 []float32) (float32, error) {
    sim, err := vector.DotProduct(vec1, vec2)
    if err != nil {
        return 0, err
    }
    return -sim, nil  // higher similarity -> lower distance (more negative)
}

Result (recall@1): 0% -> ~20% (1K-128D), ~15% (1K-256D), ~2% (10K-128D). Progress, but recall dropped sharply with higher dimensions and more data. Far from ideal.

4. Bug #2: Inverted Candidate Selection Condition

1
2
3
4
5
// Wrong: selects candidates with LARGER (worse) distance
if distance > w[0].distance || len(w) < ef { ... }

// Correct: selects candidates with SMALLER (better) distance
if distance < w[0].distance || len(w) < ef { ... }

Result: ~20% -> ~30% (1K-128D), ~15% -> ~25% (1K-256D), ~2% -> ~5% (10K-128D). Better, but 10K was still terrible.

5. Bugs #3-#6: Systematic Audit Against the Paper

At this point, I did a full code review against the paper.

5.1 Bug #3: candidates Heap Semantics (Most Fatal)

Paper: candidates = min-heap, W = max-heap.

Implementation: Both were max-heaps. Every iteration popped the furthest candidate for expansion instead of the nearest.

Fix: Change candidates to min-heap.

5.2 Bug #4: SelectNeighbors Without Diversity Heuristic

1
2
3
4
// Original — just truncate
func getNeighbors(candidates []*HNSWNode, m int) []*HNSWNode {
    return candidates[:m]  // Picks M closest — fatal mistake
}

Fix: Implement Algorithm 4 with RNG diversity heuristic — discard candidates dominated by already-selected neighbors:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
func (h *HNSWIndex) selectNeighborsHeuristic(
    query []float32, candidates []*HNSWNode, m int,
) []*HNSWNode {
    selected := make([]*HNSWNode, 0, m)
    for _, e := range candidates {
        if len(selected) >= m { break }
        distQE, _ := distanceBetween(query, e.Vector)
        dominated := false
        for _, r := range selected {
            distER, _ := distanceBetween(e.Vector, r.Vector)
            if distER < distQE { dominated = true; break }
        }
        if !dominated { selected = append(selected, e) }
    }
    return selected
}

5.3 Bug #5: Layer 0 Not Using M₀ = 2M

Paper §4 explicitly states Layer 0 should use double the neighbor count.

5.4 Bug #6: O(M²) Bubble Sort in pruneNeighbors

Replaced with sort.Slice for O(M log M).

5.5 Additional Fixes

  • Global RNG race: assignLevel() used shared rand.Float64(). Fixed with per-index rand.Rand.
  • Delete dangling pointers: Deletion didn’t clean up neighbor references. Fixed with tombstone flags and reverse-reference cleanup.

6. Results After All Six Fixes

1
2
3
4
5
1K-128D:   recall@1  = 100.00%   recall@10 = 100.00%   recall@100 = 100.00%
1K-256D:   recall@1  = 100.00%   recall@10 = 100.00%   recall@100 = 100.00%
10K-128D:  recall@1  = 100.00%   recall@10 = 100.00%   recall@100 = 100.00%
10K-256D:  recall@1  = 99.99%    recall@10 = 99.98%    recall@100 = 99.96% 
10K-512D:  recall@1  = 98.39%    recall@10 = 98.80%    recall@100 = 98.21% 

From 0% across the board to near 100% on 1K-10K/128D-256D, ~98% on 10K-512D. The last 2% on high-dimensional large-scale data is bounded by the curse of dimensionality and ef trade-offs — within expected range.

Impact breakdown (using 1K-128D recall@1 as the observation metric):

BugTyperecall TrendPaper Section
Distance/similarity confusionSemantic0% -> ~20%Distance space convention
Search condition invertedLogic~20% -> ~30%Algorithm 2 termination
candidates heap semanticsData structure~30% -> ~90%Algorithm 2 (C = min-heap)
No SelectNeighbors diversityAlgorithm gap~90% -> ~97%Algorithm 4
Layer 0 missing M₀=2MParameter~97% -> ~99%§4 parameter spec
Bubble sort in pruneNeighborsPerformanceNo recall impactEngineering

7. High-Dimensional Performance: Adaptive ef

After all bugs were fixed, 10K-512D took 512+ seconds. The fix: scale ef inversely with dimension.

$$ef_{adaptive} = \frac{ef_{base}}{\sqrt{dim / dim_{ref}}}$$

1
2
3
4
5
6
7
func (h *HNSWIndex) adaptiveEf(baseEf, dim int) int {
    const refDim = 128
    if dim <= refDim { return baseEf }
    ef := int(float64(baseEf) / math.Sqrt(float64(dim)/refDim))
    if ef < h.M*2 { ef = h.M * 2 }
    return ef
}
DimbaseEfadaptiveEfBuild TimeAvg Recall
128D600600~10s~100%
256D600424~20s~99%
512D60030087s (was 512s+)~98%

8. Gap to Industrial Implementations

DimensionCurrentIndustrial (hnswlib/Faiss/Weaviate)
SelectNeighborsAlgorithm 4 RNG heuristic (done)Same
Layer 0 M₀2×M (done, paper §4)Same
Distance calcGo ASM + AVX2/AVX512 (done)C/C++ SIMD, 2-4× faster
ConcurrencyFNV sharded locks (done)Lock-free insert / finer-grained locks
Memory layoutmap[string]*Node pointer chasingContiguous arrays, cache-friendly
DeleteSoft delete + neighbor cleanup (done)Lazy deletion + graph reconnection
QuantizationNoneProduct Quantization, 4-16× memory reduction
Dynamic efDimension-adaptive (done)Per-query budget adaptive
PersistenceFull snapshotmmap direct load, sub-second recovery

The main bottlenecks now are memory layout and quantization — Go’s map + pointer structure is CPU cache-unfriendly, and uncompressed high-dimensional vectors consume significant memory. These are the next optimization targets.

9. Debugging Methodology Takeaways

  1. Write the test framework before the algorithm — HNSW’s Search never errors; it always “succeeds” with wrong results. Without Recall@K, you’d never know.

  2. 0% is easier to debug than 50% — it signals a systematic directional error, not random precision loss.

  3. Cross-reference the paper line by line — one flipped comparison operator is the difference between 0% and 92% recall.

  4. Don’t lower thresholds to “pass” tests — I once lowered the required recall from 85% to 1% to make the test green. That’s hiding the bug, not fixing it.

  5. Fix one variable at a time — the six bugs weren’t found simultaneously. Fix one, re-run, observe the recall delta, then move to the next.


Project: github.com/uzqw/vex — A vector database implemented in Go with HNSW index, BruteForce index and snapshot persistence.

Reference: Malkov, Y.A. & Yashunin, D.A. (2018). “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.” IEEE TPAMI.