Tens of Millions of Vectors in RAG on a Single PostgreSQL with pgvectorscale's StreamingDiskANN and Vector Compression — 28x Faster Than Pinecone

TL;DR: pgvectorscale adds StreamingDiskANN indexing and Statistical Binary Quantization (SBQ) to PostgreSQL, enabling tens of millions of vectors to be processed at 99% recall without a dedicated vector database. According to Timescale's own benchmark (50M vectors, 768 dimensions), this yielded a 28x reduction in p95 latency and 75% cost savings compared to Pinecone.

If you're running a dedicated vector database, you've probably had the same thoughts I have. Staring at a Pinecone bill wondering "do I really need this?", or struggling to keep data in sync between PostgreSQL and Pinecone thinking "can't I just consolidate these into one?" Honestly, I used to think a dedicated vector DB was mandatory. One quick note for those new to this — embeddings are text or images converted into arrays of hundreds to thousands of numbers, and vector search is the technology for quickly finding similarities between those numbers. But moving through 2024 and 2025, the industry mood has noticeably shifted. Confident AI ripped out Pinecone and migrated to pgvector due to cost and operational complexity, and companies like NVIDIA, Reddit, and TripAdvisor are running PostgreSQL-based vector search in production.

At the center of this trend is pgvectorscale. This PostgreSQL extension built by Timescale combines two technologies — StreamingDiskANN indexing and Statistical Binary Quantization (SBQ) — to handle tens of millions of vectors with high accuracy on a single PostgreSQL server. In this article, I'll break down how these two technologies work, how to set them up in practice, and when they actually make a difference, as if talking to a colleague on the same team.

Core Concepts

StreamingDiskANN — Hundreds of Millions of Vectors Without Worrying About Memory

pgvector's HNSW index is excellent but has one critical weakness: the entire index must fit in RAM. For 50M vectors at 1536 dimensions, that's roughly 64GB of RAM. Costs skyrocket as you scale up your server specs.

StreamingDiskANN is a graph-based ANN (Approximate Nearest Neighbor) index — a reimplementation of Microsoft Research's DiskANN algorithm by Timescale, adapted for PostgreSQL. The idea itself is simple and clever.

Core Idea: Only the "hot path" nodes — those frequently visited during graph traversal — are kept in memory, while the rest of the data is streamed from NVMe SSD. As a result, data far exceeding RAM capacity can be processed.

To briefly explain how graph-based ANN works: each vector becomes a node in the graph, and similar vectors are connected by edges. When a query arrives, it starts from an entry node and moves toward candidates progressively closer to the query. DiskANN caches only the critical portions of this traversal path, making it highly memory-efficient.

Index	RAM Required for 50M Vectors (1536 dims)	Max Dimensions	Characteristics
HNSW (pgvector)	~64GB	2,000 dims	Entire index resident in memory
StreamingDiskANN	~1–4GB (for cache; rest on SSD)	16,000 dims	Hot path in memory, rest on SSD

The 16,000-dimension support is also worth noting. Recent LLMs increasingly generate high-dimensional embeddings, and HNSW's 2,000-dimension limit becomes a bottleneck in such scenarios.

Statistical Binary Quantization (SBQ) — Compress Without Sacrificing Accuracy

If StreamingDiskANN solves the memory problem, SBQ addresses index size and traversal speed. The two technologies work together, with SBQ acting as the compression layer for StreamingDiskANN.

Among vector compression techniques, there's Binary Quantization (BQ) — a method that converts each dimension value to 0 or 1, dramatically reducing storage, but with a significant downside: too much information is lost. I've tested BQ in production myself and abandoned it when recall dropped too low.

Statistical Binary Quantization (SBQ), devised by the Timescale research team, solves this problem. While standard BQ fixes the quantization boundary at 0, SBQ analyzes the statistical distribution of each dimension's values and uses the per-dimension mean as the threshold. By finding an optimized boundary for each individual dimension, it achieves far higher accuracy than standard BQ at the same compression ratio.

How SBQ Works: In the first stage of index traversal, compressed binary vectors are used to quickly narrow down candidates. In the second stage, the shortlisted candidates are reranked using the original full-precision vectors to produce the final result. A 2-pass structure that captures both speed and accuracy.

In practice, real-world benchmarks achieve 96–99% recall while significantly reducing index size.

Term clarification — recall: The proportion of truly similar vectors that appear in the search results. 99% recall means 99 out of 100 genuinely similar vectors are found. In ANN search, recall and speed are a trade-off.

Practical Application

Example 1: Building a Basic StreamingDiskANN Index

Installation requires pgvector to be installed first. pgvectorscale operates as a layer on top of pgvector. Let's start by defining the table schema.

sql

-- Basic table structure (based on 1536-dim OpenAI embeddings)
CREATE TABLE documents (
    id        BIGSERIAL    PRIMARY KEY,
    content   TEXT         NOT NULL,
    embedding VECTOR(1536) NOT NULL
);
 
-- Enable pgvector and pgvectorscale extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
 
-- Simplest form of a StreamingDiskANN index
CREATE INDEX ON documents USING diskann (embedding vector_cosine_ops);

In a real production environment, it's better to tune the parameters yourself. Here's the full configuration with SBQ compression enabled:

sql

-- It's recommended to allocate sufficient memory for PostgreSQL index builds before building the index
SET maintenance_work_mem = '4GB';
 
-- Detailed index with SBQ + memory-optimized layout
CREATE INDEX documents_embedding_diskann
ON documents USING diskann (embedding vector_cosine_ops)
WITH (
    storage_layout   = 'memory_optimized',  -- Key option to enable SBQ compression
    num_neighbors    = 50,                  -- Max neighbors per graph node
    search_list_size = 100,                 -- Candidate search list size during build
    max_alpha        = 1.2                  -- Graph pruning coefficient
);

Parameter	Role	Default	Tuning Tip
`storage_layout`	Enables SBQ when set to `memory_optimized`	—	Always set for large datasets
`num_neighbors`	Graph connection density	50	Higher = more accurate but larger index
`search_list_size`	Search range during build	100	Higher = better quality but longer build time
`max_alpha`	Pruning aggressiveness	1.2	Adjust between 1.0–1.5

maintenance_work_mem is the memory limit PostgreSQL allocates for maintenance operations like index builds. I once ignored this setting and was surprised by how much longer the build took than expected. For 10M or more vectors, it's recommended to increase this generously beforehand.

Example 2: Balancing Accuracy and Performance by Tuning Query Parameters

Once the index is created, there are also levers you can adjust at query time. The convenient part is that these can be changed in real time according to your service SLA.

sql

-- Adjust search accuracy per session
-- To apply globally, you can set it at the database level as shown below:
-- ALTER DATABASE mydb SET diskann.query_search_list_size = 100;
SET diskann.query_search_list_size = 100;
-- Higher values = more accurate but increased latency (default: 100)
-- If recall is insufficient, try raising to 150–200
 
SET diskann.query_rescore = 50;
-- Number of candidates for the SBQ reranking stage (default: 50)
-- Setting to 0 skips reranking: faster but lower accuracy
 
-- ANN search based on cosine similarity
-- $1: VECTOR type (query embedding)
SELECT
    id,
    content,
    embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;

query_rescore controls the number of candidates in SBQ's second-stage reranking. In practice, the default value of 50 has worked well for most cases. If you want to apply this across the entire service rather than per-session with SET, you can also specify it at the database level using ALTER DATABASE or ALTER ROLE.

Example 3: Hybrid Search in a RAG Pipeline

Vector search alone sometimes isn't enough. For queries where keywords matter, combining with Full-Text Search (FTS) produces much better results. With PostgreSQL, you can solve this right inside the database.

sql

-- pgvectorscale (vector search) + PostgreSQL FTS (full-text search) hybrid
-- $1: VECTOR type (query embedding)
-- $2: TEXT type (keywords for full-text search, e.g. 'postgresql & vector')
SELECT
    id,
    content,
    ts_rank(to_tsvector('simple', content), query) AS text_score,
    embedding <=> $1                               AS vector_score
FROM
    documents,
    to_tsquery('simple', $2) query
WHERE
    to_tsvector('simple', content) @@ query   -- FTS narrows candidates first
ORDER BY
    vector_score                               -- Final sort by vector similarity
LIMIT 20;

Here text_score appears in the SELECT but is not used for sorting. The intent of this query is to use FTS only as a pre-filter, leaving final ranking to vector similarity. If you need more sophisticated ranking by combining both scores, consider Reciprocal Rank Fusion (RRF) or a weighted sum approach.

With a dedicated vector DB, a query like this would require bouncing between two databases at the application layer. In PostgreSQL, it's done with a single SQL query. This is a situation that comes up frequently in practice, and the simplicity makes a bigger difference than you might expect.

If you're using LangChain or LlamaIndex, you can connect pgvectorscale directly as a vector store via the TimescaleVectorStore integration. The framework handles it automatically with no extra code.

Pros and Cons

Advantages

Item	Details
Memory efficiency	Significantly lower RAM requirements vs. HNSW — 50M vectors without exceeding RAM
Throughput	28x reduction in p95 latency vs. Pinecone at 99% recall, 11.4x QPS advantage over Qdrant (Timescale's own benchmark, 768 dims)
Cost savings	No need to run both a dedicated vector DB and an OLTP DB — ~75% cost reduction on AWS EC2 (Timescale's own benchmark)
Operational simplicity	Reuse existing PostgreSQL backup, replication, and ACID transaction infrastructure as-is
High-dimension support	Supports embeddings up to 16,000 dimensions without HNSW's 2,000-dimension limit
SBQ accuracy	Higher accuracy than standard BQ — achieves 96–99% recall in general benchmarks

Disadvantages and Caveats

Item	Details	Mitigation
Index build time	~40 min on a mid-tier instance for 10M vectors	Set `maintenance_work_mem` to 4GB+ before building
HNSW is faster at small scale	HNSW may outperform in environments with millions of records or fewer and sufficient RAM	Choose index based on data scale
Rust build environment required	pgvectorscale is written in Rust and requires a compilation environment	Use Docker images or package managers
pgvector dependency	pgvector must be installed before pgvectorscale	Can be handled automatically with `CREATE EXTENSION vectorscale CASCADE`
Rerank cost	Higher `query_rescore` values increase latency	Tune to SLA; the default of 50 is generally appropriate
Update-heavy workloads	Bulk updates incur graph index reconstruction costs	Consider `REINDEX` after batch updates

Most Common Mistakes in Practice

Using DiskANN on small datasets. If you have a few million records or fewer and sufficient server RAM, pgvector's HNSW is faster. DiskANN shows its true value with large-scale data that exceeds RAM.
Not checking maintenance_work_mem before building the index. If this value is too small, build time increases significantly. Setting SET maintenance_work_mem = '4GB' before building makes a noticeable difference.
Setting query_rescore to 0 and missing the drop in recall. Turning off reranking during speed optimization causes a visible drop in accuracy. It's strongly recommended to measure recall before and after any such change.

Closing Thoughts

The StreamingDiskANN + SBQ combination in pgvectorscale is a technology that makes you reconsider the premise that "you need a dedicated vector database." If you can maintain 99% recall at tens of millions of vectors while dramatically cutting costs and retaining all the operational benefits of PostgreSQL, there's every reason to take a closer look.

Three steps you can start right now:

Spin up an environment with Docker. Timescale's official Docker image includes both pgvector and pgvectorscale. You can start immediately with docker pull timescale/timescaledb-ha:pg17 and verify index behavior locally without worrying about a build environment. For the list of extensions included per tag, it's safest to check the latest installation guide on the pgvectorscale official GitHub.
Try swapping the index on an existing pgvector table. Create a new index with CREATE INDEX ... USING diskann, then use EXPLAIN ANALYZE to compare the query plan against your existing HNSW. If you have a small amount of data, the query planner may choose a sequential scan over the index — in that case, temporarily setting SET enable_seqscan = off helps you verify index behavior. No data migration required.
Tune diskann.query_search_list_size and diskann.query_rescore and measure the recall-latency trade-off yourself. Finding the right parameter combination for your service SLA is more intuitive than you'd expect. The pgvectorscale GitHub README has clear documentation on each parameter — keep it open alongside you.

References

#pgvectorscale#PostgreSQL#벡터검색#StreamingDiskANN#RAG#벡터데이터베이스#임베딩#양자화#HNSW#하이브리드검색

Database

Tens of Millions of Vectors in RAG on a Single PostgreSQL with pgvectorscale's StreamingDiskANN and Vector Compression — 28x Faster Than Pinecone

TL;DR: pgvectorscale adds StreamingDiskANN indexing and Statistical Binary Quantization (SBQ) to PostgreSQL, enabling tens of millions of vectors to be processed at 99% recall without a dedicated vector database. According to Timescale's own benchmark (50M vectors, 768 dimensions), this yielded a 28x reduction in p95 latency and 75% cost savings compared to Pinecone.

Core Concepts

StreamingDiskANN — Hundreds of Millions of Vectors Without Worrying About Memory

Core Idea: Only the "hot path" nodes — those frequently visited during graph traversal — are kept in memory, while the rest of the data is streamed from NVMe SSD. As a result, data far exceeding RAM capacity can be processed.

Index	RAM Required for 50M Vectors (1536 dims)	Max Dimensions	Characteristics
HNSW (pgvector)	~64GB	2,000 dims	Entire index resident in memory
StreamingDiskANN	~1–4GB (for cache; rest on SSD)	16,000 dims	Hot path in memory, rest on SSD

The 16,000-dimension support is also worth noting. Recent LLMs increasingly generate high-dimensional embeddings, and HNSW's 2,000-dimension limit becomes a bottleneck in such scenarios.

Statistical Binary Quantization (SBQ) — Compress Without Sacrificing Accuracy

If StreamingDiskANN solves the memory problem, SBQ addresses index size and traversal speed. The two technologies work together, with SBQ acting as the compression layer for StreamingDiskANN.

How SBQ Works: In the first stage of index traversal, compressed binary vectors are used to quickly narrow down candidates. In the second stage, the shortlisted candidates are reranked using the original full-precision vectors to produce the final result. A 2-pass structure that captures both speed and accuracy.

In practice, real-world benchmarks achieve 96–99% recall while significantly reducing index size.

Term clarification — recall: The proportion of truly similar vectors that appear in the search results. 99% recall means 99 out of 100 genuinely similar vectors are found. In ANN search, recall and speed are a trade-off.

Practical Application

Example 1: Building a Basic StreamingDiskANN Index

Installation requires pgvector to be installed first. pgvectorscale operates as a layer on top of pgvector. Let's start by defining the table schema.

sql

-- Basic table structure (based on 1536-dim OpenAI embeddings)
CREATE TABLE documents (
    id        BIGSERIAL    PRIMARY KEY,
    content   TEXT         NOT NULL,
    embedding VECTOR(1536) NOT NULL
);
 
-- Enable pgvector and pgvectorscale extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS vectorscale CASCADE;
 
-- Simplest form of a StreamingDiskANN index
CREATE INDEX ON documents USING diskann (embedding vector_cosine_ops);

In a real production environment, it's better to tune the parameters yourself. Here's the full configuration with SBQ compression enabled:

sql

-- It's recommended to allocate sufficient memory for PostgreSQL index builds before building the index
SET maintenance_work_mem = '4GB';
 
-- Detailed index with SBQ + memory-optimized layout
CREATE INDEX documents_embedding_diskann
ON documents USING diskann (embedding vector_cosine_ops)
WITH (
    storage_layout   = 'memory_optimized',  -- Key option to enable SBQ compression
    num_neighbors    = 50,                  -- Max neighbors per graph node
    search_list_size = 100,                 -- Candidate search list size during build
    max_alpha        = 1.2                  -- Graph pruning coefficient
);

Parameter	Role	Default	Tuning Tip
`storage_layout`	Enables SBQ when set to `memory_optimized`	—	Always set for large datasets
`num_neighbors`	Graph connection density	50	Higher = more accurate but larger index
`search_list_size`	Search range during build	100	Higher = better quality but longer build time
`max_alpha`	Pruning aggressiveness	1.2	Adjust between 1.0–1.5

Example 2: Balancing Accuracy and Performance by Tuning Query Parameters

Once the index is created, there are also levers you can adjust at query time. The convenient part is that these can be changed in real time according to your service SLA.

sql

-- Adjust search accuracy per session
-- To apply globally, you can set it at the database level as shown below:
-- ALTER DATABASE mydb SET diskann.query_search_list_size = 100;
SET diskann.query_search_list_size = 100;
-- Higher values = more accurate but increased latency (default: 100)
-- If recall is insufficient, try raising to 150–200
 
SET diskann.query_rescore = 50;
-- Number of candidates for the SBQ reranking stage (default: 50)
-- Setting to 0 skips reranking: faster but lower accuracy
 
-- ANN search based on cosine similarity
-- $1: VECTOR type (query embedding)
SELECT
    id,
    content,
    embedding <=> $1 AS distance
FROM documents
ORDER BY embedding <=> $1
LIMIT 10;

Example 3: Hybrid Search in a RAG Pipeline

sql

-- pgvectorscale (vector search) + PostgreSQL FTS (full-text search) hybrid
-- $1: VECTOR type (query embedding)
-- $2: TEXT type (keywords for full-text search, e.g. 'postgresql & vector')
SELECT
    id,
    content,
    ts_rank(to_tsvector('simple', content), query) AS text_score,
    embedding <=> $1                               AS vector_score
FROM
    documents,
    to_tsquery('simple', $2) query
WHERE
    to_tsvector('simple', content) @@ query   -- FTS narrows candidates first
ORDER BY
    vector_score                               -- Final sort by vector similarity
LIMIT 20;

Pros and Cons

Advantages

Item	Details
Memory efficiency	Significantly lower RAM requirements vs. HNSW — 50M vectors without exceeding RAM
Throughput	28x reduction in p95 latency vs. Pinecone at 99% recall, 11.4x QPS advantage over Qdrant (Timescale's own benchmark, 768 dims)
Cost savings	No need to run both a dedicated vector DB and an OLTP DB — ~75% cost reduction on AWS EC2 (Timescale's own benchmark)
Operational simplicity	Reuse existing PostgreSQL backup, replication, and ACID transaction infrastructure as-is
High-dimension support	Supports embeddings up to 16,000 dimensions without HNSW's 2,000-dimension limit
SBQ accuracy	Higher accuracy than standard BQ — achieves 96–99% recall in general benchmarks

Disadvantages and Caveats

Item	Details	Mitigation
Index build time	~40 min on a mid-tier instance for 10M vectors	Set `maintenance_work_mem` to 4GB+ before building
HNSW is faster at small scale	HNSW may outperform in environments with millions of records or fewer and sufficient RAM	Choose index based on data scale
Rust build environment required	pgvectorscale is written in Rust and requires a compilation environment	Use Docker images or package managers
pgvector dependency	pgvector must be installed before pgvectorscale	Can be handled automatically with `CREATE EXTENSION vectorscale CASCADE`
Rerank cost	Higher `query_rescore` values increase latency	Tune to SLA; the default of 50 is generally appropriate
Update-heavy workloads	Bulk updates incur graph index reconstruction costs	Consider `REINDEX` after batch updates

Most Common Mistakes in Practice

Using DiskANN on small datasets. If you have a few million records or fewer and sufficient server RAM, pgvector's HNSW is faster. DiskANN shows its true value with large-scale data that exceeds RAM.
Not checking maintenance_work_mem before building the index. If this value is too small, build time increases significantly. Setting SET maintenance_work_mem = '4GB' before building makes a noticeable difference.
Setting query_rescore to 0 and missing the drop in recall. Turning off reranking during speed optimization causes a visible drop in accuracy. It's strongly recommended to measure recall before and after any such change.

Closing Thoughts

Three steps you can start right now:

Spin up an environment with Docker. Timescale's official Docker image includes both pgvector and pgvectorscale. You can start immediately with docker pull timescale/timescaledb-ha:pg17 and verify index behavior locally without worrying about a build environment. For the list of extensions included per tag, it's safest to check the latest installation guide on the pgvectorscale official GitHub.
Try swapping the index on an existing pgvector table. Create a new index with CREATE INDEX ... USING diskann, then use EXPLAIN ANALYZE to compare the query plan against your existing HNSW. If you have a small amount of data, the query planner may choose a sequential scan over the index — in that case, temporarily setting SET enable_seqscan = off helps you verify index behavior. No data migration required.
Tune diskann.query_search_list_size and diskann.query_rescore and measure the recall-latency trade-off yourself. Finding the right parameter combination for your service SLA is more intuitive than you'd expect. The pgvectorscale GitHub README has clear documentation on each parameter — keep it open alongside you.

References

#pgvectorscale#PostgreSQL#벡터검색#StreamingDiskANN#RAG#벡터데이터베이스#임베딩#양자화#HNSW#하이브리드검색

Core Concepts

StreamingDiskANN — Hundreds of Millions of Vectors Without Worrying About Memory

Statistical Binary Quantization (SBQ) — Compress Without Sacrificing Accuracy

Practical Application

Example 1: Building a Basic StreamingDiskANN Index

Example 2: Balancing Accuracy and Performance by Tuning Query Parameters

Example 3: Hybrid Search in a RAG Pipeline

Pros and Cons

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

StreamingDiskANN — Hundreds of Millions of Vectors Without Worrying About Memory

Statistical Binary Quantization (SBQ) — Compress Without Sacrificing Accuracy

Practical Application

Example 1: Building a Basic StreamingDiskANN Index

Example 2: Balancing Accuracy and Performance by Tuning Query Parameters

Example 3: Hybrid Search in a RAG Pipeline

Pros and Cons

Advantages

Disadvantages and Caveats

Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

PgBouncer vs Supavisor: PostgreSQL Connection Pooling — What Should You Use in the Serverless Era?

Building a Multi-Tenant SaaS with PostgreSQL Row Level Security (RLS): Code, Pitfalls, and PgBouncer Integration

Implementing Team RBAC with Supabase Auth + RLS — From JWT Claims Design to Role-Based Access Policies

Up to 10 Million Vectors with PostgreSQL Alone — Eliminating the Vector DB with pgvector·pgrag·pgai

Replacing Elasticsearch with PostgreSQL Alone: Hybrid Search with VectorChord-BM25 and RRF

When Does pgvector Hit Its Performance Limits — pgvector vs VectorChord Benchmark Comparison