Tens of Millions of Vectors in RAG on a Single PostgreSQL with pgvectorscale's StreamingDiskANN and Vector Compression — 28x Faster Than Pinecone
TL;DR: pgvectorscale adds StreamingDiskANN indexing and Statistical Binary Quantization (SBQ) to PostgreSQL, enabling tens of millions of vectors to be processed at 99% recall without a dedicated vector database. According to Timescale's own benchmark (50M vectors, 768 dimensions), this yielded a 28x reduction in p95 latency and 75% cost savings compared to Pinecone.
If you're running a dedicated vector database, you've probably had the same thoughts I have. Staring at a Pinecone bill wondering "do I really need this?", or struggling to keep data in sync between PostgreSQL and Pinecone thinking "can't I just consolidate these into one?" Honestly, I used to think a dedicated vector DB was mandatory. One quick note for those new to this — embeddings are text or images converted into arrays of hundreds to thousands of numbers, and vector search is the technology for quickly finding similarities between those numbers. But moving through 2024 and 2025, the industry mood has noticeably shifted. Confident AI ripped out Pinecone and migrated to pgvector due to cost and operational complexity, and companies like NVIDIA, Reddit, and TripAdvisor are running PostgreSQL-based vector search in production.
At the center of this trend is pgvectorscale. This PostgreSQL extension built by Timescale combines two technologies — StreamingDiskANN indexing and Statistical Binary Quantization (SBQ) — to handle tens of millions of vectors with high accuracy on a single PostgreSQL server. In this article, I'll break down how these two technologies work, how to set them up in practice, and when they actually make a difference, as if talking to a colleague on the same team.
Core Concepts
StreamingDiskANN — Hundreds of Millions of Vectors Without Worrying About Memory
pgvector's HNSW index is excellent but has one critical weakness: the entire index must fit in RAM. For 50M vectors at 1536 dimensions, that's roughly 64GB of RAM. Costs skyrocket as you scale up your server specs.
StreamingDiskANN is a graph-based ANN (Approximate Nearest Neighbor) index — a reimplementation of Microsoft Research's DiskANN algorithm by Timescale, adapted for PostgreSQL. The idea itself is simple and clever.
Core Idea: Only the "hot path" nodes — those frequently visited during graph traversal — are kept in memory, while the rest of the data is streamed from NVMe SSD. As a result, data far exceeding RAM capacity can be processed.
To briefly explain how graph-based ANN works: each vector becomes a node in the graph, and similar vectors are connected by edges. When a query arrives, it starts from an entry node and moves toward candidates progressively closer to the query. DiskANN caches only the critical portions of this traversal path, making it highly memory-efficient.
| Index | RAM Required for 50M Vectors (1536 dims) | Max Dimensions | Characteristics |
|---|---|---|---|
| HNSW (pgvector) | ~64GB | 2,000 dims | Entire index resident in memory |
| StreamingDiskANN | ~1–4GB (for cache; rest on SSD) | 16,000 dims | Hot path in memory, rest on SSD |
The 16,000-dimension support is also worth noting. Recent LLMs increasingly generate high-dimensional embeddings, and HNSW's 2,000-dimension limit becomes a bottleneck in such scenarios.
Statistical Binary Quantization (SBQ) — Compress Without Sacrificing Accuracy
If StreamingDiskANN solves the memory problem, SBQ addresses index size and traversal speed. The two technologies work together, with SBQ acting as the compression layer for StreamingDiskANN.
Among vector compression techniques, there's Binary Quantization (BQ) — a method that converts each dimension value to 0 or 1, dramatically reducing storage, but with a significant downside: too much information is lost. I've tested BQ in production myself and abandoned it when recall dropped too low.
Statistical Binary Quantization (SBQ), devised by the Timescale research team, solves this problem. While standard BQ fixes the quantization boundary at 0, SBQ analyzes the statistical distribution of each dimension's values and uses the per-dimension mean as the threshold. By finding an optimized boundary for each individual dimension, it achieves far higher accuracy than standard BQ at the same compression ratio.
How SBQ Works: In the first stage of index traversal, compressed binary vectors are used to quickly narrow down candidates. In the second stage, the shortlisted candidates are reranked using the original full-precision vectors to produce the final result. A 2-pass structure that captures both speed and accuracy.
In practice, real-world benchmarks achieve 96–99% recall while significantly reducing index size.
Term clarification — recall: The proportion of truly similar vectors that appear in the search results. 99% recall means 99 out of 100 genuinely similar vectors are found. In ANN search, recall and speed are a trade-off.
Practical Application
Example 1: Building a Basic StreamingDiskANN Index
Installation requires pgvector to be installed first. pgvectorscale operates as a layer on top of pgvector. Let's start by defining the table schema.
-- Basic table structure (based on 1536-dim OpenAI embeddings)
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding VECTOR(1536) NOT NULL
);
-- Enable pgvector and pgvectorscale extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
-- Simplest form of a StreamingDiskANN index
CREATE INDEX ON documents USING diskann (embedding vector_cosine_ops);In a real production environment, it's better to tune the parameters yourself. Here's the full configuration with SBQ compression enabled:
-- It's recommended to allocate sufficient memory for PostgreSQL index builds before building the index
SET maintenance_work_mem = '4GB';
-- Detailed index with SBQ + memory-optimized layout
CREATE INDEX documents_embedding_diskann
ON documents USING diskann (embedding vector_cosine_ops)
WITH (
storage_layout = 'memory_optimized', -- Key option to enable SBQ compression
num_neighbors = 50, -- Max neighbors per graph node
search_list_size = 100, -- Candidate search list size during build
max_alpha = 1.2 -- Graph pruning coefficient
);| Parameter | Role | Default | Tuning Tip |
|---|---|---|---|
storage_layout |
Enables SBQ when set to memory_optimized |
— | Always set for large datasets |
num_neighbors |
Graph connection density | 50 | Higher = more accurate but larger index |
search_list_size |
Search range during build | 100 | Higher = better quality but longer build time |
max_alpha |
Pruning aggressiveness | 1.2 | Adjust between 1.0–1.5 |
maintenance_work_mem is the memory limit PostgreSQL allocates for maintenance operations like index builds. I once ignored this setting and was surprised by how much longer the build took than expected. For 10M or more vectors, it's recommended to increase this generously beforehand.
Example 2: Balancing Accuracy and Performance by Tuning Query Parameters
Once the index is created, there are also levers you can adjust at query time. The convenient part is that these can be changed in real time according to your service SLA.
-- Adjust search accuracy per session
-- To apply globally, you can set it at the database level as shown below:
-- ALTER DATABASE mydb SET diskann.query_search_list_size = 100;
SET diskann.query_search_list_size = 100;
-- Higher values = more accurate but increased latency (default: 100)
-- If recall is insufficient, try raising to 150–200
SET diskann.query_rescore = 50;
-- Number of candidates for the SBQ reranking stage (default: 50)
-- Setting to 0 skips reranking: faster but lower accuracy
-- ANN search based on cosine similarity
-- $1: VECTOR type (query embedding)
SELECT
id,
content,
embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;query_rescore controls the number of candidates in SBQ's second-stage reranking. In practice, the default value of 50 has worked well for most cases. If you want to apply this across the entire service rather than per-session with SET, you can also specify it at the database level using ALTER DATABASE or ALTER ROLE.
Example 3: Hybrid Search in a RAG Pipeline
Vector search alone sometimes isn't enough. For queries where keywords matter, combining with Full-Text Search (FTS) produces much better results. With PostgreSQL, you can solve this right inside the database.
-- pgvectorscale (vector search) + PostgreSQL FTS (full-text search) hybrid
-- $1: VECTOR type (query embedding)
-- $2: TEXT type (keywords for full-text search, e.g. 'postgresql & vector')
SELECT
id,
content,
ts_rank(to_tsvector('simple', content), query) AS text_score,
embedding <=> $1 AS vector_score
FROM
documents,
to_tsquery('simple', $2) query
WHERE
to_tsvector('simple', content) @@ query -- FTS narrows candidates first
ORDER BY
vector_score -- Final sort by vector similarity
LIMIT 20;Here text_score appears in the SELECT but is not used for sorting. The intent of this query is to use FTS only as a pre-filter, leaving final ranking to vector similarity. If you need more sophisticated ranking by combining both scores, consider Reciprocal Rank Fusion (RRF) or a weighted sum approach.
With a dedicated vector DB, a query like this would require bouncing between two databases at the application layer. In PostgreSQL, it's done with a single SQL query. This is a situation that comes up frequently in practice, and the simplicity makes a bigger difference than you might expect.
If you're using LangChain or LlamaIndex, you can connect pgvectorscale directly as a vector store via the TimescaleVectorStore integration. The framework handles it automatically with no extra code.
Pros and Cons
Advantages
| Item | Details |
|---|---|
| Memory efficiency | Significantly lower RAM requirements vs. HNSW — 50M vectors without exceeding RAM |
| Throughput | 28x reduction in p95 latency vs. Pinecone at 99% recall, 11.4x QPS advantage over Qdrant (Timescale's own benchmark, 768 dims) |
| Cost savings | No need to run both a dedicated vector DB and an OLTP DB — ~75% cost reduction on AWS EC2 (Timescale's own benchmark) |
| Operational simplicity | Reuse existing PostgreSQL backup, replication, and ACID transaction infrastructure as-is |
| High-dimension support | Supports embeddings up to 16,000 dimensions without HNSW's 2,000-dimension limit |
| SBQ accuracy | Higher accuracy than standard BQ — achieves 96–99% recall in general benchmarks |
Disadvantages and Caveats
| Item | Details | Mitigation |
|---|---|---|
| Index build time | ~40 min on a mid-tier instance for 10M vectors | Set maintenance_work_mem to 4GB+ before building |
| HNSW is faster at small scale | HNSW may outperform in environments with millions of records or fewer and sufficient RAM | Choose index based on data scale |
| Rust build environment required | pgvectorscale is written in Rust and requires a compilation environment | Use Docker images or package managers |
| pgvector dependency | pgvector must be installed before pgvectorscale | Can be handled automatically with CREATE EXTENSION vectorscale CASCADE |
| Rerank cost | Higher query_rescore values increase latency |
Tune to SLA; the default of 50 is generally appropriate |
| Update-heavy workloads | Bulk updates incur graph index reconstruction costs | Consider REINDEX after batch updates |
Most Common Mistakes in Practice
-
Using DiskANN on small datasets. If you have a few million records or fewer and sufficient server RAM, pgvector's HNSW is faster. DiskANN shows its true value with large-scale data that exceeds RAM.
-
Not checking
maintenance_work_membefore building the index. If this value is too small, build time increases significantly. SettingSET maintenance_work_mem = '4GB'before building makes a noticeable difference. -
Setting
query_rescoreto 0 and missing the drop in recall. Turning off reranking during speed optimization causes a visible drop in accuracy. It's strongly recommended to measure recall before and after any such change.
Closing Thoughts
The StreamingDiskANN + SBQ combination in pgvectorscale is a technology that makes you reconsider the premise that "you need a dedicated vector database." If you can maintain 99% recall at tens of millions of vectors while dramatically cutting costs and retaining all the operational benefits of PostgreSQL, there's every reason to take a closer look.
Three steps you can start right now:
-
Spin up an environment with Docker. Timescale's official Docker image includes both pgvector and pgvectorscale. You can start immediately with
docker pull timescale/timescaledb-ha:pg17and verify index behavior locally without worrying about a build environment. For the list of extensions included per tag, it's safest to check the latest installation guide on the pgvectorscale official GitHub. -
Try swapping the index on an existing pgvector table. Create a new index with
CREATE INDEX ... USING diskann, then useEXPLAIN ANALYZEto compare the query plan against your existing HNSW. If you have a small amount of data, the query planner may choose a sequential scan over the index — in that case, temporarily settingSET enable_seqscan = offhelps you verify index behavior. No data migration required. -
Tune
diskann.query_search_list_sizeanddiskann.query_rescoreand measure the recall-latency trade-off yourself. Finding the right parameter combination for your service SLA is more intuitive than you'd expect. The pgvectorscale GitHub README has clear documentation on each parameter — keep it open alongside you.
References
- pgvectorscale GitHub | timescale/pgvectorscale
- PostgreSQL and Pgvector: Now Faster Than Pinecone, 75% Cheaper, and 100% Open Source | Timescale Medium
- pgvectorscale: An Extension for Improved Vector Search in Postgres | dbvis.com
- Advanced Vector Workloads with pgvectorscale and Hybrid Search | DigitalOcean Official Docs
- Enable and use DiskANN — Azure Database for PostgreSQL | Microsoft Learn
- PostgreSQL Vector Search: VectorChord vs. pgvector vs. pgvectorscale | VectorChord Blog
- pgvectorscale — Accelerating AI development | UnfoldAI