Six Pitfalls Implementing HNSW: A Debug Journal from 0% to 98%+ Recall
While implementing the HNSW algorithm for my vector database project vex (initial code AI-assisted, debugging and optimization done by hand), I ran my first Recall test — every single metric was 0%. Not 50%, not 30%. Zero. The HNSW results and BruteForce results had absolutely no overlap. This post documents the entire journey from 0% to 98%+ recall, cross-referencing each bug with the original HNSW paper.
1. HNSW Paper Core Ideas
Before diving into debugging, let’s establish the foundational concepts from the HNSW paper (Malkov & Yashunin, 2018). Every bug we’ll encounter maps directly to a violation of these principles.
1.1 Layered Skip List + Proximity Graph
HNSW is a multi-layer graph. Upper layers are sparse (for long-range jumps), lower layers are dense (for precise search). Search starts from the entry point at the highest layer and greedily descends.
| |
Intuition: like map navigation — first hop on the highway to get near the target city, then switch to local roads, then walk the streets to the exact address.
1.2 Five Key Parameters
| Parameter | Meaning | Paper Recommendation |
|---|---|---|
| $M$ | Max neighbors per node (non-zero layers) | 12-48 |
| $M_0$ | Max neighbors at Layer 0, $M_0 = 2M$ | Paper §4 explicitly requires this |
| $ef_{Construction}$ | Candidate set size during build | 100-400 |
| $ef$ | Candidate set size during search (beam width) | ≥ k, typically 100-800 |
| $m_L$ | Level assignment factor, $m_L = 1/\ln(M)$ | Computed automatically |
1.3 Two Critical Heaps
Algorithm 2 (search) defines two heaps with strict semantics:
| |
This is critical — inverting these heap semantics is the direct root cause of 0% recall.
1.4 Algorithm 4: Diversity Heuristic for SelectNeighbors
The most overlooked contribution of the paper. Neighbor selection is not simply picking the M closest — it must consider spatial diversity:
| |
Intuition: if you’re placing cell towers, don’t cluster them all in the city center. Spread them across different directions so every area has coverage.
2. First Shock: 0% Recall Across the Board
2.1 Test Framework
I built a Recall@K test. The idea is straightforward: insert the same vectors into both HNSW and BruteForce, query the same vectors, compute overlap:
$$Recall@k = \frac{|HNSW_{top\text{-}k} \cap BruteForce_{top\text{-}k}|}{k}$$
| |
Test matrix: 6 configurations (1K/10K vectors × 128/256/512 dimensions).
2.2 Total Failure
| |
All 6 configurations, all k values: 0.0000.
Debug output revealed the smoking gun:
| |
Two observations:
- Zero overlap — not a sorting issue, entirely different vectors
- HNSW returned negative similarities (-0.22) while BF returned positive (+0.30)
HNSW was returning the least similar vectors.
3. Bug #1: Distance vs. Similarity Confusion
3.1 Root Cause
The heap comparison logic assumed lower distance = better (like Euclidean), but the actual metric was DotProduct similarity (higher = better).
| |
The heap evicts the largest value — but the largest similarity is the best result. Search was expanding the worst candidates first.
3.2 Fix
Convert similarity to distance by negation:
| |
Result (recall@1): 0% -> ~20% (1K-128D), ~15% (1K-256D), ~2% (10K-128D). Progress, but recall dropped sharply with higher dimensions and more data. Far from ideal.
4. Bug #2: Inverted Candidate Selection Condition
| |
Result: ~20% -> ~30% (1K-128D), ~15% -> ~25% (1K-256D), ~2% -> ~5% (10K-128D). Better, but 10K was still terrible.
5. Bugs #3-#6: Systematic Audit Against the Paper
At this point, I did a full code review against the paper.
5.1 Bug #3: candidates Heap Semantics (Most Fatal)
Paper: candidates = min-heap, W = max-heap.
Implementation: Both were max-heaps. Every iteration popped the furthest candidate for expansion instead of the nearest.
Fix: Change candidates to min-heap.
5.2 Bug #4: SelectNeighbors Without Diversity Heuristic
| |
Fix: Implement Algorithm 4 with RNG diversity heuristic — discard candidates dominated by already-selected neighbors:
| |
5.3 Bug #5: Layer 0 Not Using M₀ = 2M
Paper §4 explicitly states Layer 0 should use double the neighbor count.
5.4 Bug #6: O(M²) Bubble Sort in pruneNeighbors
Replaced with sort.Slice for O(M log M).
5.5 Additional Fixes
- Global RNG race:
assignLevel()used sharedrand.Float64(). Fixed with per-indexrand.Rand. - Delete dangling pointers: Deletion didn’t clean up neighbor references. Fixed with tombstone flags and reverse-reference cleanup.
6. Results After All Six Fixes
| |
From 0% across the board to near 100% on 1K-10K/128D-256D, ~98% on 10K-512D. The last 2% on high-dimensional large-scale data is bounded by the curse of dimensionality and ef trade-offs — within expected range.
Impact breakdown (using 1K-128D recall@1 as the observation metric):
| Bug | Type | recall Trend | Paper Section |
|---|---|---|---|
| Distance/similarity confusion | Semantic | 0% -> ~20% | Distance space convention |
| Search condition inverted | Logic | ~20% -> ~30% | Algorithm 2 termination |
| candidates heap semantics | Data structure | ~30% -> ~90% | Algorithm 2 (C = min-heap) |
| No SelectNeighbors diversity | Algorithm gap | ~90% -> ~97% | Algorithm 4 |
| Layer 0 missing M₀=2M | Parameter | ~97% -> ~99% | §4 parameter spec |
| Bubble sort in pruneNeighbors | Performance | No recall impact | Engineering |
7. High-Dimensional Performance: Adaptive ef
After all bugs were fixed, 10K-512D took 512+ seconds. The fix: scale ef inversely with dimension.
$$ef_{adaptive} = \frac{ef_{base}}{\sqrt{dim / dim_{ref}}}$$
| |
| Dim | baseEf | adaptiveEf | Build Time | Avg Recall |
|---|---|---|---|---|
| 128D | 600 | 600 | ~10s | ~100% |
| 256D | 600 | 424 | ~20s | ~99% |
| 512D | 600 | 300 | 87s (was 512s+) | ~98% |
8. Gap to Industrial Implementations
| Dimension | Current | Industrial (hnswlib/Faiss/Weaviate) |
|---|---|---|
| SelectNeighbors | Algorithm 4 RNG heuristic (done) | Same |
| Layer 0 M₀ | 2×M (done, paper §4) | Same |
| Distance calc | Go ASM + AVX2/AVX512 (done) | C/C++ SIMD, 2-4× faster |
| Concurrency | FNV sharded locks (done) | Lock-free insert / finer-grained locks |
| Memory layout | map[string]*Node pointer chasing | Contiguous arrays, cache-friendly |
| Delete | Soft delete + neighbor cleanup (done) | Lazy deletion + graph reconnection |
| Quantization | None | Product Quantization, 4-16× memory reduction |
| Dynamic ef | Dimension-adaptive (done) | Per-query budget adaptive |
| Persistence | Full snapshot | mmap direct load, sub-second recovery |
The main bottlenecks now are memory layout and quantization — Go’s map + pointer structure is CPU cache-unfriendly, and uncompressed high-dimensional vectors consume significant memory. These are the next optimization targets.
9. Debugging Methodology Takeaways
Write the test framework before the algorithm — HNSW’s Search never errors; it always “succeeds” with wrong results. Without Recall@K, you’d never know.
0% is easier to debug than 50% — it signals a systematic directional error, not random precision loss.
Cross-reference the paper line by line — one flipped comparison operator is the difference between 0% and 92% recall.
Don’t lower thresholds to “pass” tests — I once lowered the required recall from 85% to 1% to make the test green. That’s hiding the bug, not fixing it.
Fix one variable at a time — the six bugs weren’t found simultaneously. Fix one, re-run, observe the recall delta, then move to the next.
Project: github.com/uzqw/vex — A vector database implemented in Go with HNSW index, BruteForce index and snapshot persistence.
Reference: Malkov, Y.A. & Yashunin, D.A. (2018). “Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.” IEEE TPAMI.