Up to 10 Million Vectors with PostgreSQL Alone — Eliminating the Vector DB with pgvector·pgrag·pgai
Creating a new Pinecone account, tweaking LangChain config files, running a separate embedding generation script, which is then wired up to a Redis queue... When I first built a RAG pipeline, I took this kind of setup for granted. Then, while initially embedding tens of thousands of customer documents for a B2B SaaS project, the Pinecone index costs came in at three times what I expected — and on top of that, data inconsistencies started appearing between the vector DB and the source DB. That's when I first thought, "Do I really need all of this?"
This article covers the architecture of a PostgreSQL-only RAG pipeline, the role of each extension — pgvector, pgrag, and pgai — and how far you can realistically trust it in production. The target audience is backend and full-stack developers who are reasonably comfortable with SQL and have some experience operating PostgreSQL. The core idea is a deliberate choice to "eliminate one system to operate, in exchange for getting the most out of SQL's expressive power." For services with roughly 10 million vectors or fewer, you can achieve sufficient performance at far lower complexity and cost than combining a dedicated vector DB with external orchestration.
The short answer: yes, it works. And quite well, at that. With the stabilization of HNSW support in pgvector 0.6, the GA release of pgai Vectorizer, and the arrival of pgrag, it's now a realistic option to complete the entire pipeline without ever leaving psql.
Core Concepts
The RAG Pipeline at a Glance
RAG (Retrieval-Augmented Generation) is a pattern where, before sending a question to an LLM, you first retrieve relevant context from your own data and inject it into the prompt. Most requirements along the lines of "I want GPT to answer based on our company's documents" are solved with this pattern.
Document input (PDF, HTML, DOCX...)
↓
Chunking — split into appropriate sizes (pgrag: chunks_by_token_count)
↓
Embedding generation — text → vector (pgrag / pgai Vectorizer)
↓
Vector storage & indexing (pgvector: HNSW index)
↓
Retrieval — vector similarity + FTS + trigrams (RRF ensemble)
↓
Reranking — precise reordering of candidates (pgrag: rerank_score)
↓
LLM call → response generation (pgai: openai_chat)Here's a comparison of what changes when you handle this entire pipeline with PostgreSQL alone:
| Stage | Traditional Tools | PostgreSQL Only |
|---|---|---|
| Chunking | LangChain TextSplitter | pgrag token-based chunking functions |
| Embedding generation | OpenAI API + Python script | pgai / pgrag SQL functions |
| Vector storage & indexing | Pinecone, Qdrant, Weaviate | pgvector HNSW index |
| Retrieval | Vector DB query API | ORDER BY embedding <=> query_vec |
| Reranking | Cohere Rerank API | pgrag reranking SQL functions |
| Generation | LangChain / LlamaIndex | pgai LLM call SQL functions |
Three Extensions, One-Line Definitions
| Extension | Creator | Role |
|---|---|---|
| pgvector | Open-source community | Vector types (vector, halfvec) + HNSW·IVFFlat indexes |
| pgrag | Neon (experimental) | PDF parsing, chunking, local embeddings, reranking, LLM calls — all as SQL functions |
| pgai Vectorizer | Timescale | Automatic embedding synchronization via table triggers |
pgrag maturity warning: pgrag is an experimental extension published by Neon, and as of 2025 it has been validated primarily in Neon's serverless environment. Its production stability is lower than pgvector's, so thorough testing is required before introducing it to a self-hosted environment.
pgvector — The Most Fundamental Building Block
pgvector is an extension that adds vector types and indexes to PostgreSQL. Once installed, you can declare column types like vector(1536) and perform cosine similarity search with the <=> operator.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(1536)
);
-- HNSW index: can be created even before data exists
CREATE INDEX ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);HNSW vs IVFFlat: HNSW (Hierarchical Navigable Small World) is a graph-based index that can be created even before data is inserted. IVFFlat requires data to exist first for cluster partitioning, making it a poor fit for blue/green deployments. Since pgvector 0.6, official documentation recommends HNSW as the default.
Practical Application
Example 1: A Complete RAG Pipeline in psql Alone — pgrag + DeepSeek
I'll admit, at first I thought "reading a PDF with SQL?" — but once you actually run it, it feels surprisingly natural. You can try it immediately on Neon serverless PostgreSQL or a local instance with pgrag installed.
-- 1. Install extensions
CREATE EXTENSION IF NOT EXISTS rag CASCADE;
CREATE EXTENSION IF NOT EXISTS rag_bge_small_en_v15;
CREATE EXTENSION IF NOT EXISTS rag_jina_reranker_v1_tiny_en;
-- 2. Table for storing chunks
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(384) -- bge-small-en-v1.5 dimensions
);
-- 3. Read PDF, chunk it, and insert
-- pg_read_binary_file requires superuser or pg_read_server_files role
INSERT INTO chunks (content)
SELECT chunk
FROM chunks_by_token_count(
convert_pdf_to_text(pg_read_binary_file('/data/manual.pdf')),
192, -- max_tokens
8 -- overlap
);
-- 4. Generate embeddings in batch (passage mode — for storage)
UPDATE chunks
SET embedding = embedding_for_passage(content)
WHERE embedding IS NULL;
-- 5. Retrieval + reranking: ORDER BY rerank_score DESC is required
WITH candidates AS (
SELECT content
FROM chunks
ORDER BY embedding <=> embedding_for_query('HNSW index tuning')
LIMIT 10
),
reranked AS (
SELECT content,
rerank_score(content, 'HNSW index tuning') AS score
FROM candidates
WHERE rerank_score(content, 'HNSW index tuning') > 0.3
ORDER BY score DESC
LIMIT 5
)
-- 6. Inject context and call LLM
SELECT openai_chat(
'deepseek-chat',
prompt_with_context(
'How do I tune PostgreSQL HNSW indexes?',
(SELECT string_agg(content, E'\n---\n') FROM reranked)
)
);| Function | Role |
|---|---|
convert_pdf_to_text() |
Extracts text from a PDF binary |
chunks_by_token_count() |
Sliding window chunking by token count |
embedding_for_passage() |
Embedding for storage (includes BGE asymmetric padding) |
embedding_for_query() |
Embedding for queries (must be kept separate from passage) |
rerank_score() |
Calculates relevance score using jina-reranker |
prompt_with_context() |
Injects search results into a prompt template |
pg_read_binary_filepermissions: This function requires superuser or thepg_read_server_filesrole to execute. Calling it from a regular application account will result in a permission error. In production, consider using a dedicated migration account with file access privileges, or having the application read the file content and pass it directly asBYTEA.
Example 2: Automatic Embedding Synchronization with pgai Vectorizer
If you want embeddings to be generated automatically every time a new post is added to the documents table, here's how to do it. The most surprising thing when I first used this approach was that I didn't need to change a single line of application code.
-- Register a Vectorizer — one line of SQL
SELECT ai.create_vectorizer(
'documents'::regclass,
embedding => ai.embedding_openai('text-embedding-3-small', 1536),
chunking => ai.chunking_recursive_character_text_splitter(
'content',
512, -- chunk_size (tokens)
50 -- overlap
),
destination => 'documents_embedding_store'
);After registration, whenever an INSERT occurs on the documents table, a separately running pgai worker container asynchronously generates embeddings and writes them to documents_embedding_store. Note that if you run the SQL above without the worker, no embeddings will be generated.
# Run the pgai worker container (Docker example)
docker run -d \
-e PGAI_VECTORIZER_WORKER_DB_URL="postgres://user:pass@host:5432/db" \
-e OPENAI_API_KEY="sk-..." \
timescale/pgai-vectorizer-worker:latestLSN (Log Sequence Number) tracking: pgai Vectorizer tracks the PostgreSQL WAL (Write-Ahead Log) position to prevent duplicate processing. Even if the worker crashes and restarts, it knows exactly where it left off, so there is no risk of lost or duplicate embeddings.
Example 3: Using SQL JOINs in Multi-Tenant RAG
Metadata filtering in a dedicated vector DB is more cumbersome than you'd expect. If you want to filter by tenant ID while also applying a date range, it often requires special handling. In PostgreSQL, it's just a JOIN. This is the most common situation I encounter in practice, and I think it's the biggest practical advantage of the standalone setup.
SELECT
c.content,
c.embedding <=> $2 AS distance
FROM document_chunks c
JOIN documents d ON c.document_id = d.id
WHERE d.tenant_id = $1
AND d.created_at > NOW() - INTERVAL '30 days'
AND d.category = ANY($3)
ORDER BY distance
LIMIT 5;Tenant isolation, date filtering, and category filtering are all expressed as index predicates and processed within a single transaction. This is far more efficient than the post-filter approach used by dedicated vector DBs. If you want to verify performance directly, prepend EXPLAIN (ANALYZE, BUFFERS) to check whether an Index Scan is being used properly.
Example 4: Improving Recall with Hybrid Search
Searching with vector similarity alone can miss cases where keywords match exactly. Conversely, using only BM25 full-text search fails to capture semantically similar expressions. Combining all three tends to improve Recall@10 from ~0.62 to ~0.84 on English QA benchmarks.
If this is your first time: This query is a combination of three CTEs + FULL OUTER JOIN + COALESCE, which can feel intimidating at first glance. It's recommended to implement vector search alone first, then incrementally add FTS and trigrams, observing the change in Recall at each step.
-- Hybrid search based on RRF (Reciprocal Rank Fusion)
-- For Korean documents: plainto_tsquery('korean', ...) with the default parser
-- does not perform morphological analysis — consider installing pg_bigm
WITH
vector_ranked AS (
SELECT id, content,
ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
FROM document_chunks
LIMIT 60
),
fts_ranked AS (
SELECT id, content,
ROW_NUMBER() OVER (ORDER BY ts_rank(fts_vector, query) DESC) AS rank
FROM document_chunks,
plainto_tsquery('korean', $2) query
WHERE fts_vector @@ query
LIMIT 60
),
trigram_ranked AS (
SELECT id, content,
ROW_NUMBER() OVER (ORDER BY similarity(content, $2) DESC) AS rank
FROM document_chunks
WHERE content % $2
LIMIT 60
)
SELECT
COALESCE(v.id, f.id, t.id) AS id,
COALESCE(v.content, f.content, t.content) AS content,
-- RRF score: sum of reciprocal ranks (k=60 is a common default)
COALESCE(1.0 / (60 + v.rank), 0) +
COALESCE(1.0 / (60 + f.rank), 0) +
COALESCE(1.0 / (60 + t.rank), 0) AS rrf_score
FROM vector_ranked v
FULL OUTER JOIN fts_ranked f ON v.id = f.id
FULL OUTER JOIN trigram_ranked t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT 10;RRF (Reciprocal Rank Fusion): An ensemble method that determines final rankings by summing the reciprocal ranks from multiple search strategies in the form
1 / (k + rank). k=60 is a commonly well-performing default value.
Pros and Cons Analysis
Advantages
The most frequent question I get is "why not just use Pinecone?" — but in terms of cost and operational complexity, a PostgreSQL-only setup is often practically advantageous.
| Item | Details |
|---|---|
| Operational simplicity | No need for a vector DB, message queue, or separate orchestration service. The number of things to deploy and monitor is reduced to one. |
| Transactional consistency | Embeddings and source data are managed within the same transaction, eliminating data inconsistencies caused by partial failures. |
| SQL JOIN filtering | Complex filters like tenant, date, and category can be expressed as index predicates, making it more flexible than metadata filtering in a dedicated vector DB. |
| Cost | No additional infrastructure costs. You can leverage your existing PostgreSQL instance as-is. |
| Security | Sensitive data never leaves the DB boundary, making regulatory compliance easier. |
| Built-in hybrid search | tsvector, pg_trgm, and pgvector can be freely combined in SQL. |
Drawbacks and Caveats
On the flip side, the HNSW memory issue is the most commonly encountered operational problem in practice. The typical pattern is things working fine at first, then queries suddenly slowing down as data accumulates.
| Item | Details | Mitigation |
|---|---|---|
| Horizontal scaling limits | PostgreSQL is fundamentally a single-node architecture. Once vectors exceed tens of millions, distributed sharding becomes necessary. | PostgreSQL up to ~10M vectors; consider Qdrant or Weaviate beyond that |
| HNSW memory requirements | The HNSW index must reside in RAM for peak performance. If the index size exceeds available memory, I/O latency spikes dramatically. | Use pgvectorscale's StreamingDiskANN, or use halfvec quantization to cut memory usage in half |
| Initial embedding generation load | CPU and API costs are concentrated when first embedding large volumes of data. | Leverage pgai Vectorizer's batch processing and rate limit handling |
| Index build time | Building an HNSW index over millions of vectors can take tens of minutes to hours. | Tune with SET maintenance_work_mem = '4GB'; use blue/green deployment to switch without service interruption |
halfvec: A pgvector type that stores vectors as 16-bit floating point. Compared to the default
vector(32-bit), RAM usage is halved. Combined with Matryoshka embeddings, query latency of under 10ms can be maintained when processing millions of records on a single server.
3 Common Pitfalls
These are issues I've encountered directly in practice, or heard most frequently from colleagues.
-
Confusing
embedding_for_passage()andembedding_for_query()— BGE-family models use an asymmetric approach where different prefixes are applied to storage embeddings (passage) and query embeddings (query). Think of it as optimizing for the full document context when storing, and for the question pattern when searching. Using the same function for both significantly distorts similarity scores. -
Creating the HNSW index after data has been inserted — Building an index on a table that already contains millions of rows requires a lot of server memory and time. Whenever possible, create the index immediately after the table is created, before loading data.
-
Using only vector search and skipping hybrid search — Honestly, when building an initial prototype, vector search alone seems sufficient. But once exact product names, codenames, or abbreviations are mixed in, Recall drops noticeably without BM25 and trigram search. It is recommended to set up hybrid search from the beginning.
Closing Thoughts
A PostgreSQL-only RAG pipeline is a deliberate choice to "eliminate one system to operate, in exchange for getting the most out of SQL's expressive power." For services with roughly 10 million vectors or fewer, you can achieve sufficient performance at far lower complexity and cost than combining a dedicated vector DB with external orchestration.
This is where I started, and following this order will save you a lot of trial and error.
- Create a Neon free account and run
CREATE EXTENSION IF NOT EXISTS vector;— pgvector, pgrag, and pgai are all pre-installed, so you can follow Example 1 immediately. In a local environment, pgrag and pgai require separate builds, so starting with Neon is much easier at first. - Connect automatic embeddings with a single line:
SELECT ai.create_vectorizer('documents'::regclass, ...)— After this, embeddings are generated automatically with every INSERT. Note that the pgai worker container must be running alongside it for this to actually work, so refer to the Docker command in Example 2 as well. - Implement basic search with
ORDER BY embedding <=> query_vec LIMIT 10, then gradually migrate to RRF hybrid — At each stage, verify performance withEXPLAIN (ANALYZE, BUFFERS)and directly observe the change in Recall as you addtsvectorandpg_trgm.
References
- Build an end-to-end RAG pipeline entirely in psql using pgrag and DeepSeek | Neon
- GitHub — neondatabase/pgrag
- pgai: Transforming PostgreSQL into a Production-Ready AI Retrieval Engine | BrightCoding
- GitHub — timescale/pgai
- Building Hybrid Search for RAG: Combining pgvector and Full-Text Search with RRF | DEV Community
- pg_trgm + pgvector Hybrid Retrieval: Build Better RAG in Postgres (2026) | CallSphere
- pgai Vectorizer: Automate AI Embeddings With One SQL Command | DEV Community
- pgvector vs Pinecone: Which Vector Database to Choose in 2026 | Encore
- pgvector, pgvectorscale and the Postgres Vector Search Stack Explained | SoftwareSeni
- Boring RAG: When similarity is just a SQL query | Red Hat Developer
- Building a RAG Server with PostgreSQL - Part 1 | pgEdge