Up to 10 Million Vectors with PostgreSQL Alone — Eliminating the Vector DB with pgvector·pgrag·pgai

Creating a new Pinecone account, tweaking LangChain config files, running a separate embedding generation script, which is then wired up to a Redis queue... When I first built a RAG pipeline, I took this kind of setup for granted. Then, while initially embedding tens of thousands of customer documents for a B2B SaaS project, the Pinecone index costs came in at three times what I expected — and on top of that, data inconsistencies started appearing between the vector DB and the source DB. That's when I first thought, "Do I really need all of this?"

This article covers the architecture of a PostgreSQL-only RAG pipeline, the role of each extension — pgvector, pgrag, and pgai — and how far you can realistically trust it in production. The target audience is backend and full-stack developers who are reasonably comfortable with SQL and have some experience operating PostgreSQL. The core idea is a deliberate choice to "eliminate one system to operate, in exchange for getting the most out of SQL's expressive power." For services with roughly 10 million vectors or fewer, you can achieve sufficient performance at far lower complexity and cost than combining a dedicated vector DB with external orchestration.

The short answer: yes, it works. And quite well, at that. With the stabilization of HNSW support in pgvector 0.6, the GA release of pgai Vectorizer, and the arrival of pgrag, it's now a realistic option to complete the entire pipeline without ever leaving psql.

Core Concepts

The RAG Pipeline at a Glance

RAG (Retrieval-Augmented Generation) is a pattern where, before sending a question to an LLM, you first retrieve relevant context from your own data and inject it into the prompt. Most requirements along the lines of "I want GPT to answer based on our company's documents" are solved with this pattern.

Document input (PDF, HTML, DOCX...)
    ↓
Chunking — split into appropriate sizes (pgrag: chunks_by_token_count)
    ↓
Embedding generation — text → vector (pgrag / pgai Vectorizer)
    ↓
Vector storage & indexing (pgvector: HNSW index)
    ↓
Retrieval — vector similarity + FTS + trigrams (RRF ensemble)
    ↓
Reranking — precise reordering of candidates (pgrag: rerank_score)
    ↓
LLM call → response generation (pgai: openai_chat)

Here's a comparison of what changes when you handle this entire pipeline with PostgreSQL alone:

Stage	Traditional Tools	PostgreSQL Only
Chunking	LangChain TextSplitter	`pgrag` token-based chunking functions
Embedding generation	OpenAI API + Python script	`pgai` / `pgrag` SQL functions
Vector storage & indexing	Pinecone, Qdrant, Weaviate	`pgvector` HNSW index
Retrieval	Vector DB query API	`ORDER BY embedding <=> query_vec`
Reranking	Cohere Rerank API	`pgrag` reranking SQL functions
Generation	LangChain / LlamaIndex	`pgai` LLM call SQL functions

Three Extensions, One-Line Definitions

Extension	Creator	Role
pgvector	Open-source community	Vector types (`vector`, `halfvec`) + HNSW·IVFFlat indexes
pgrag	Neon (experimental)	PDF parsing, chunking, local embeddings, reranking, LLM calls — all as SQL functions
pgai Vectorizer	Timescale	Automatic embedding synchronization via table triggers

pgrag maturity warning: pgrag is an experimental extension published by Neon, and as of 2025 it has been validated primarily in Neon's serverless environment. Its production stability is lower than pgvector's, so thorough testing is required before introducing it to a self-hosted environment.

pgvector — The Most Fundamental Building Block

pgvector is an extension that adds vector types and indexes to PostgreSQL. Once installed, you can declare column types like vector(1536) and perform cosine similarity search with the <=> operator.

sql

CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE document_chunks (
  id        BIGSERIAL PRIMARY KEY,
  content   TEXT NOT NULL,
  embedding vector(1536)
);
 
-- HNSW index: can be created even before data exists
CREATE INDEX ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

HNSW vs IVFFlat: HNSW (Hierarchical Navigable Small World) is a graph-based index that can be created even before data is inserted. IVFFlat requires data to exist first for cluster partitioning, making it a poor fit for blue/green deployments. Since pgvector 0.6, official documentation recommends HNSW as the default.

Practical Application

Example 1: A Complete RAG Pipeline in psql Alone — pgrag + DeepSeek

I'll admit, at first I thought "reading a PDF with SQL?" — but once you actually run it, it feels surprisingly natural. You can try it immediately on Neon serverless PostgreSQL or a local instance with pgrag installed.

sql

-- 1. Install extensions
CREATE EXTENSION IF NOT EXISTS rag CASCADE;
CREATE EXTENSION IF NOT EXISTS rag_bge_small_en_v15;
CREATE EXTENSION IF NOT EXISTS rag_jina_reranker_v1_tiny_en;
 
-- 2. Table for storing chunks
CREATE TABLE chunks (
  id        BIGSERIAL PRIMARY KEY,
  content   TEXT NOT NULL,
  embedding vector(384)  -- bge-small-en-v1.5 dimensions
);
 
-- 3. Read PDF, chunk it, and insert
--    pg_read_binary_file requires superuser or pg_read_server_files role
INSERT INTO chunks (content)
SELECT chunk
FROM chunks_by_token_count(
  convert_pdf_to_text(pg_read_binary_file('/data/manual.pdf')),
  192,  -- max_tokens
  8     -- overlap
);
 
-- 4. Generate embeddings in batch (passage mode — for storage)
UPDATE chunks
SET embedding = embedding_for_passage(content)
WHERE embedding IS NULL;
 
-- 5. Retrieval + reranking: ORDER BY rerank_score DESC is required
WITH candidates AS (
  SELECT content
  FROM chunks
  ORDER BY embedding <=> embedding_for_query('HNSW index tuning')
  LIMIT 10
),
reranked AS (
  SELECT content,
         rerank_score(content, 'HNSW index tuning') AS score
  FROM candidates
  WHERE rerank_score(content, 'HNSW index tuning') > 0.3
  ORDER BY score DESC
  LIMIT 5
)
-- 6. Inject context and call LLM
SELECT openai_chat(
  'deepseek-chat',
  prompt_with_context(
    'How do I tune PostgreSQL HNSW indexes?',
    (SELECT string_agg(content, E'\n---\n') FROM reranked)
  )
);

Function	Role
`convert_pdf_to_text()`	Extracts text from a PDF binary
`chunks_by_token_count()`	Sliding window chunking by token count
`embedding_for_passage()`	Embedding for storage (includes BGE asymmetric padding)
`embedding_for_query()`	Embedding for queries (must be kept separate from passage)
`rerank_score()`	Calculates relevance score using jina-reranker
`prompt_with_context()`	Injects search results into a prompt template

pg_read_binary_file permissions: This function requires superuser or the pg_read_server_files role to execute. Calling it from a regular application account will result in a permission error. In production, consider using a dedicated migration account with file access privileges, or having the application read the file content and pass it directly as BYTEA.

Example 2: Automatic Embedding Synchronization with pgai Vectorizer

If you want embeddings to be generated automatically every time a new post is added to the documents table, here's how to do it. The most surprising thing when I first used this approach was that I didn't need to change a single line of application code.

sql

-- Register a Vectorizer — one line of SQL
SELECT ai.create_vectorizer(
  'documents'::regclass,
  embedding  => ai.embedding_openai('text-embedding-3-small', 1536),
  chunking   => ai.chunking_recursive_character_text_splitter(
                  'content',
                  512,   -- chunk_size (tokens)
                  50     -- overlap
                ),
  destination => 'documents_embedding_store'
);

After registration, whenever an INSERT occurs on the documents table, a separately running pgai worker container asynchronously generates embeddings and writes them to documents_embedding_store. Note that if you run the SQL above without the worker, no embeddings will be generated.

bash

# Run the pgai worker container (Docker example)
docker run -d \
  -e PGAI_VECTORIZER_WORKER_DB_URL="postgres://user:pass@host:5432/db" \
  -e OPENAI_API_KEY="sk-..." \
  timescale/pgai-vectorizer-worker:latest

LSN (Log Sequence Number) tracking: pgai Vectorizer tracks the PostgreSQL WAL (Write-Ahead Log) position to prevent duplicate processing. Even if the worker crashes and restarts, it knows exactly where it left off, so there is no risk of lost or duplicate embeddings.

Example 3: Using SQL JOINs in Multi-Tenant RAG

Metadata filtering in a dedicated vector DB is more cumbersome than you'd expect. If you want to filter by tenant ID while also applying a date range, it often requires special handling. In PostgreSQL, it's just a JOIN. This is the most common situation I encounter in practice, and I think it's the biggest practical advantage of the standalone setup.

sql

SELECT
  c.content,
  c.embedding <=> $2 AS distance
FROM document_chunks c
JOIN documents d ON c.document_id = d.id
WHERE d.tenant_id = $1
  AND d.created_at > NOW() - INTERVAL '30 days'
  AND d.category = ANY($3)
ORDER BY distance
LIMIT 5;

Tenant isolation, date filtering, and category filtering are all expressed as index predicates and processed within a single transaction. This is far more efficient than the post-filter approach used by dedicated vector DBs. If you want to verify performance directly, prepend EXPLAIN (ANALYZE, BUFFERS) to check whether an Index Scan is being used properly.

Example 4: Improving Recall with Hybrid Search

Searching with vector similarity alone can miss cases where keywords match exactly. Conversely, using only BM25 full-text search fails to capture semantically similar expressions. Combining all three tends to improve Recall@10 from ~0.62 to ~0.84 on English QA benchmarks.

If this is your first time: This query is a combination of three CTEs + FULL OUTER JOIN + COALESCE, which can feel intimidating at first glance. It's recommended to implement vector search alone first, then incrementally add FTS and trigrams, observing the change in Recall at each step.

sql

-- Hybrid search based on RRF (Reciprocal Rank Fusion)
-- For Korean documents: plainto_tsquery('korean', ...) with the default parser
-- does not perform morphological analysis — consider installing pg_bigm
WITH
vector_ranked AS (
  SELECT id, content,
    ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
  FROM document_chunks
  LIMIT 60
),
fts_ranked AS (
  SELECT id, content,
    ROW_NUMBER() OVER (ORDER BY ts_rank(fts_vector, query) DESC) AS rank
  FROM document_chunks,
    plainto_tsquery('korean', $2) query
  WHERE fts_vector @@ query
  LIMIT 60
),
trigram_ranked AS (
  SELECT id, content,
    ROW_NUMBER() OVER (ORDER BY similarity(content, $2) DESC) AS rank
  FROM document_chunks
  WHERE content % $2
  LIMIT 60
)
SELECT
  COALESCE(v.id, f.id, t.id)                AS id,
  COALESCE(v.content, f.content, t.content) AS content,
  -- RRF score: sum of reciprocal ranks (k=60 is a common default)
  COALESCE(1.0 / (60 + v.rank), 0) +
  COALESCE(1.0 / (60 + f.rank), 0) +
  COALESCE(1.0 / (60 + t.rank), 0)          AS rrf_score
FROM vector_ranked v
FULL OUTER JOIN fts_ranked    f ON v.id = f.id
FULL OUTER JOIN trigram_ranked t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT 10;

RRF (Reciprocal Rank Fusion): An ensemble method that determines final rankings by summing the reciprocal ranks from multiple search strategies in the form 1 / (k + rank). k=60 is a commonly well-performing default value.

Pros and Cons Analysis

Advantages

The most frequent question I get is "why not just use Pinecone?" — but in terms of cost and operational complexity, a PostgreSQL-only setup is often practically advantageous.

Item	Details
Operational simplicity	No need for a vector DB, message queue, or separate orchestration service. The number of things to deploy and monitor is reduced to one.
Transactional consistency	Embeddings and source data are managed within the same transaction, eliminating data inconsistencies caused by partial failures.
SQL JOIN filtering	Complex filters like tenant, date, and category can be expressed as index predicates, making it more flexible than metadata filtering in a dedicated vector DB.
Cost	No additional infrastructure costs. You can leverage your existing PostgreSQL instance as-is.
Security	Sensitive data never leaves the DB boundary, making regulatory compliance easier.
Built-in hybrid search	`tsvector`, `pg_trgm`, and `pgvector` can be freely combined in SQL.

Drawbacks and Caveats

On the flip side, the HNSW memory issue is the most commonly encountered operational problem in practice. The typical pattern is things working fine at first, then queries suddenly slowing down as data accumulates.

Item	Details	Mitigation
Horizontal scaling limits	PostgreSQL is fundamentally a single-node architecture. Once vectors exceed tens of millions, distributed sharding becomes necessary.	PostgreSQL up to ~10M vectors; consider Qdrant or Weaviate beyond that
HNSW memory requirements	The HNSW index must reside in RAM for peak performance. If the index size exceeds available memory, I/O latency spikes dramatically.	Use `pgvectorscale`'s StreamingDiskANN, or use `halfvec` quantization to cut memory usage in half
Initial embedding generation load	CPU and API costs are concentrated when first embedding large volumes of data.	Leverage pgai Vectorizer's batch processing and rate limit handling
Index build time	Building an HNSW index over millions of vectors can take tens of minutes to hours.	Tune with `SET maintenance_work_mem = '4GB'`; use blue/green deployment to switch without service interruption

halfvec: A pgvector type that stores vectors as 16-bit floating point. Compared to the default vector (32-bit), RAM usage is halved. Combined with Matryoshka embeddings, query latency of under 10ms can be maintained when processing millions of records on a single server.

3 Common Pitfalls

These are issues I've encountered directly in practice, or heard most frequently from colleagues.

Confusing embedding_for_passage() and embedding_for_query() — BGE-family models use an asymmetric approach where different prefixes are applied to storage embeddings (passage) and query embeddings (query). Think of it as optimizing for the full document context when storing, and for the question pattern when searching. Using the same function for both significantly distorts similarity scores.
Creating the HNSW index after data has been inserted — Building an index on a table that already contains millions of rows requires a lot of server memory and time. Whenever possible, create the index immediately after the table is created, before loading data.
Using only vector search and skipping hybrid search — Honestly, when building an initial prototype, vector search alone seems sufficient. But once exact product names, codenames, or abbreviations are mixed in, Recall drops noticeably without BM25 and trigram search. It is recommended to set up hybrid search from the beginning.

Closing Thoughts

A PostgreSQL-only RAG pipeline is a deliberate choice to "eliminate one system to operate, in exchange for getting the most out of SQL's expressive power." For services with roughly 10 million vectors or fewer, you can achieve sufficient performance at far lower complexity and cost than combining a dedicated vector DB with external orchestration.

This is where I started, and following this order will save you a lot of trial and error.

Create a Neon free account and run CREATE EXTENSION IF NOT EXISTS vector; — pgvector, pgrag, and pgai are all pre-installed, so you can follow Example 1 immediately. In a local environment, pgrag and pgai require separate builds, so starting with Neon is much easier at first.
Connect automatic embeddings with a single line: SELECT ai.create_vectorizer('documents'::regclass, ...) — After this, embeddings are generated automatically with every INSERT. Note that the pgai worker container must be running alongside it for this to actually work, so refer to the Docker command in Example 2 as well.
Implement basic search with ORDER BY embedding <=> query_vec LIMIT 10, then gradually migrate to RRF hybrid — At each stage, verify performance with EXPLAIN (ANALYZE, BUFFERS) and directly observe the change in Recall as you add tsvector and pg_trgm.

References

#pgvector#RAG#PostgreSQL#pgai#pgrag#하이브리드검색#벡터DB#HNSW#임베딩#RRF

1,000만 벡터까지는 PostgreSQL 하나로 — pgvector·pgrag·pgai로 벡터 DB 없애기 | DEV BAK - 기술블로그

Database

Up to 10 Million Vectors with PostgreSQL Alone — Eliminating the Vector DB with pgvector·pgrag·pgai

Core Concepts

The RAG Pipeline at a Glance

Document input (PDF, HTML, DOCX...)
    ↓
Chunking — split into appropriate sizes (pgrag: chunks_by_token_count)
    ↓
Embedding generation — text → vector (pgrag / pgai Vectorizer)
    ↓
Vector storage & indexing (pgvector: HNSW index)
    ↓
Retrieval — vector similarity + FTS + trigrams (RRF ensemble)
    ↓
Reranking — precise reordering of candidates (pgrag: rerank_score)
    ↓
LLM call → response generation (pgai: openai_chat)

Here's a comparison of what changes when you handle this entire pipeline with PostgreSQL alone:

Stage	Traditional Tools	PostgreSQL Only
Chunking	LangChain TextSplitter	`pgrag` token-based chunking functions
Embedding generation	OpenAI API + Python script	`pgai` / `pgrag` SQL functions
Vector storage & indexing	Pinecone, Qdrant, Weaviate	`pgvector` HNSW index
Retrieval	Vector DB query API	`ORDER BY embedding <=> query_vec`
Reranking	Cohere Rerank API	`pgrag` reranking SQL functions
Generation	LangChain / LlamaIndex	`pgai` LLM call SQL functions

Three Extensions, One-Line Definitions

Extension	Creator	Role
pgvector	Open-source community	Vector types (`vector`, `halfvec`) + HNSW·IVFFlat indexes
pgrag	Neon (experimental)	PDF parsing, chunking, local embeddings, reranking, LLM calls — all as SQL functions
pgai Vectorizer	Timescale	Automatic embedding synchronization via table triggers

pgrag maturity warning: pgrag is an experimental extension published by Neon, and as of 2025 it has been validated primarily in Neon's serverless environment. Its production stability is lower than pgvector's, so thorough testing is required before introducing it to a self-hosted environment.

pgvector — The Most Fundamental Building Block

sql

CREATE EXTENSION IF NOT EXISTS vector;
 
CREATE TABLE document_chunks (
  id        BIGSERIAL PRIMARY KEY,
  content   TEXT NOT NULL,
  embedding vector(1536)
);
 
-- HNSW index: can be created even before data exists
CREATE INDEX ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

HNSW vs IVFFlat: HNSW (Hierarchical Navigable Small World) is a graph-based index that can be created even before data is inserted. IVFFlat requires data to exist first for cluster partitioning, making it a poor fit for blue/green deployments. Since pgvector 0.6, official documentation recommends HNSW as the default.

Practical Application

Example 1: A Complete RAG Pipeline in psql Alone — pgrag + DeepSeek

sql

-- 1. Install extensions
CREATE EXTENSION IF NOT EXISTS rag CASCADE;
CREATE EXTENSION IF NOT EXISTS rag_bge_small_en_v15;
CREATE EXTENSION IF NOT EXISTS rag_jina_reranker_v1_tiny_en;
 
-- 2. Table for storing chunks
CREATE TABLE chunks (
  id        BIGSERIAL PRIMARY KEY,
  content   TEXT NOT NULL,
  embedding vector(384)  -- bge-small-en-v1.5 dimensions
);
 
-- 3. Read PDF, chunk it, and insert
--    pg_read_binary_file requires superuser or pg_read_server_files role
INSERT INTO chunks (content)
SELECT chunk
FROM chunks_by_token_count(
  convert_pdf_to_text(pg_read_binary_file('/data/manual.pdf')),
  192,  -- max_tokens
  8     -- overlap
);
 
-- 4. Generate embeddings in batch (passage mode — for storage)
UPDATE chunks
SET embedding = embedding_for_passage(content)
WHERE embedding IS NULL;
 
-- 5. Retrieval + reranking: ORDER BY rerank_score DESC is required
WITH candidates AS (
  SELECT content
  FROM chunks
  ORDER BY embedding <=> embedding_for_query('HNSW index tuning')
  LIMIT 10
),
reranked AS (
  SELECT content,
         rerank_score(content, 'HNSW index tuning') AS score
  FROM candidates
  WHERE rerank_score(content, 'HNSW index tuning') > 0.3
  ORDER BY score DESC
  LIMIT 5
)
-- 6. Inject context and call LLM
SELECT openai_chat(
  'deepseek-chat',
  prompt_with_context(
    'How do I tune PostgreSQL HNSW indexes?',
    (SELECT string_agg(content, E'\n---\n') FROM reranked)
  )
);

Function	Role
`convert_pdf_to_text()`	Extracts text from a PDF binary
`chunks_by_token_count()`	Sliding window chunking by token count
`embedding_for_passage()`	Embedding for storage (includes BGE asymmetric padding)
`embedding_for_query()`	Embedding for queries (must be kept separate from passage)
`rerank_score()`	Calculates relevance score using jina-reranker
`prompt_with_context()`	Injects search results into a prompt template

pg_read_binary_file permissions: This function requires superuser or the pg_read_server_files role to execute. Calling it from a regular application account will result in a permission error. In production, consider using a dedicated migration account with file access privileges, or having the application read the file content and pass it directly as BYTEA.

Example 2: Automatic Embedding Synchronization with pgai Vectorizer

sql

-- Register a Vectorizer — one line of SQL
SELECT ai.create_vectorizer(
  'documents'::regclass,
  embedding  => ai.embedding_openai('text-embedding-3-small', 1536),
  chunking   => ai.chunking_recursive_character_text_splitter(
                  'content',
                  512,   -- chunk_size (tokens)
                  50     -- overlap
                ),
  destination => 'documents_embedding_store'
);

bash

# Run the pgai worker container (Docker example)
docker run -d \
  -e PGAI_VECTORIZER_WORKER_DB_URL="postgres://user:pass@host:5432/db" \
  -e OPENAI_API_KEY="sk-..." \
  timescale/pgai-vectorizer-worker:latest

LSN (Log Sequence Number) tracking: pgai Vectorizer tracks the PostgreSQL WAL (Write-Ahead Log) position to prevent duplicate processing. Even if the worker crashes and restarts, it knows exactly where it left off, so there is no risk of lost or duplicate embeddings.

Example 3: Using SQL JOINs in Multi-Tenant RAG

sql

SELECT
  c.content,
  c.embedding <=> $2 AS distance
FROM document_chunks c
JOIN documents d ON c.document_id = d.id
WHERE d.tenant_id = $1
  AND d.created_at > NOW() - INTERVAL '30 days'
  AND d.category = ANY($3)
ORDER BY distance
LIMIT 5;

Example 4: Improving Recall with Hybrid Search

If this is your first time: This query is a combination of three CTEs + FULL OUTER JOIN + COALESCE, which can feel intimidating at first glance. It's recommended to implement vector search alone first, then incrementally add FTS and trigrams, observing the change in Recall at each step.

sql

-- Hybrid search based on RRF (Reciprocal Rank Fusion)
-- For Korean documents: plainto_tsquery('korean', ...) with the default parser
-- does not perform morphological analysis — consider installing pg_bigm
WITH
vector_ranked AS (
  SELECT id, content,
    ROW_NUMBER() OVER (ORDER BY embedding <=> $1) AS rank
  FROM document_chunks
  LIMIT 60
),
fts_ranked AS (
  SELECT id, content,
    ROW_NUMBER() OVER (ORDER BY ts_rank(fts_vector, query) DESC) AS rank
  FROM document_chunks,
    plainto_tsquery('korean', $2) query
  WHERE fts_vector @@ query
  LIMIT 60
),
trigram_ranked AS (
  SELECT id, content,
    ROW_NUMBER() OVER (ORDER BY similarity(content, $2) DESC) AS rank
  FROM document_chunks
  WHERE content % $2
  LIMIT 60
)
SELECT
  COALESCE(v.id, f.id, t.id)                AS id,
  COALESCE(v.content, f.content, t.content) AS content,
  -- RRF score: sum of reciprocal ranks (k=60 is a common default)
  COALESCE(1.0 / (60 + v.rank), 0) +
  COALESCE(1.0 / (60 + f.rank), 0) +
  COALESCE(1.0 / (60 + t.rank), 0)          AS rrf_score
FROM vector_ranked v
FULL OUTER JOIN fts_ranked    f ON v.id = f.id
FULL OUTER JOIN trigram_ranked t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT 10;

RRF (Reciprocal Rank Fusion): An ensemble method that determines final rankings by summing the reciprocal ranks from multiple search strategies in the form 1 / (k + rank). k=60 is a commonly well-performing default value.

Pros and Cons Analysis

Advantages

The most frequent question I get is "why not just use Pinecone?" — but in terms of cost and operational complexity, a PostgreSQL-only setup is often practically advantageous.

Item	Details
Operational simplicity	No need for a vector DB, message queue, or separate orchestration service. The number of things to deploy and monitor is reduced to one.
Transactional consistency	Embeddings and source data are managed within the same transaction, eliminating data inconsistencies caused by partial failures.
SQL JOIN filtering	Complex filters like tenant, date, and category can be expressed as index predicates, making it more flexible than metadata filtering in a dedicated vector DB.
Cost	No additional infrastructure costs. You can leverage your existing PostgreSQL instance as-is.
Security	Sensitive data never leaves the DB boundary, making regulatory compliance easier.
Built-in hybrid search	`tsvector`, `pg_trgm`, and `pgvector` can be freely combined in SQL.

Drawbacks and Caveats

Item	Details	Mitigation
Horizontal scaling limits	PostgreSQL is fundamentally a single-node architecture. Once vectors exceed tens of millions, distributed sharding becomes necessary.	PostgreSQL up to ~10M vectors; consider Qdrant or Weaviate beyond that
HNSW memory requirements	The HNSW index must reside in RAM for peak performance. If the index size exceeds available memory, I/O latency spikes dramatically.	Use `pgvectorscale`'s StreamingDiskANN, or use `halfvec` quantization to cut memory usage in half
Initial embedding generation load	CPU and API costs are concentrated when first embedding large volumes of data.	Leverage pgai Vectorizer's batch processing and rate limit handling
Index build time	Building an HNSW index over millions of vectors can take tens of minutes to hours.	Tune with `SET maintenance_work_mem = '4GB'`; use blue/green deployment to switch without service interruption

halfvec: A pgvector type that stores vectors as 16-bit floating point. Compared to the default vector (32-bit), RAM usage is halved. Combined with Matryoshka embeddings, query latency of under 10ms can be maintained when processing millions of records on a single server.

3 Common Pitfalls

These are issues I've encountered directly in practice, or heard most frequently from colleagues.

Confusing embedding_for_passage() and embedding_for_query() — BGE-family models use an asymmetric approach where different prefixes are applied to storage embeddings (passage) and query embeddings (query). Think of it as optimizing for the full document context when storing, and for the question pattern when searching. Using the same function for both significantly distorts similarity scores.
Creating the HNSW index after data has been inserted — Building an index on a table that already contains millions of rows requires a lot of server memory and time. Whenever possible, create the index immediately after the table is created, before loading data.
Using only vector search and skipping hybrid search — Honestly, when building an initial prototype, vector search alone seems sufficient. But once exact product names, codenames, or abbreviations are mixed in, Recall drops noticeably without BM25 and trigram search. It is recommended to set up hybrid search from the beginning.

Closing Thoughts

This is where I started, and following this order will save you a lot of trial and error.

Create a Neon free account and run CREATE EXTENSION IF NOT EXISTS vector; — pgvector, pgrag, and pgai are all pre-installed, so you can follow Example 1 immediately. In a local environment, pgrag and pgai require separate builds, so starting with Neon is much easier at first.
Connect automatic embeddings with a single line: SELECT ai.create_vectorizer('documents'::regclass, ...) — After this, embeddings are generated automatically with every INSERT. Note that the pgai worker container must be running alongside it for this to actually work, so refer to the Docker command in Example 2 as well.
Implement basic search with ORDER BY embedding <=> query_vec LIMIT 10, then gradually migrate to RRF hybrid — At each stage, verify performance with EXPLAIN (ANALYZE, BUFFERS) and directly observe the change in Recall as you add tsvector and pg_trgm.

References

#pgvector#RAG#PostgreSQL#pgai#pgrag#하이브리드검색#벡터DB#HNSW#임베딩#RRF

Core Concepts

The RAG Pipeline at a Glance

Three Extensions, One-Line Definitions

pgvector — The Most Fundamental Building Block

Practical Application

Example 1: A Complete RAG Pipeline in psql Alone — pgrag + DeepSeek

Example 2: Automatic Embedding Synchronization with pgai Vectorizer

Example 3: Using SQL JOINs in Multi-Tenant RAG

Example 4: Improving Recall with Hybrid Search

Pros and Cons Analysis

Advantages

Drawbacks and Caveats

3 Common Pitfalls

Closing Thoughts

References

Core Concepts

The RAG Pipeline at a Glance

Three Extensions, One-Line Definitions

pgvector — The Most Fundamental Building Block

Practical Application

Example 1: A Complete RAG Pipeline in psql Alone — pgrag + DeepSeek

Example 2: Automatic Embedding Synchronization with pgai Vectorizer

Example 3: Using SQL JOINs in Multi-Tenant RAG

Example 4: Improving Recall with Hybrid Search

Pros and Cons Analysis

Advantages

Drawbacks and Caveats

3 Common Pitfalls

Closing Thoughts

References

Recommended Posts

Tens of Millions of Vectors in RAG on a Single PostgreSQL with pgvectorscale's StreamingDiskANN and Vector Compression — 28x Faster Than Pinecone

PgBouncer vs Supavisor: PostgreSQL Connection Pooling — What Should You Use in the Serverless Era?

Building a Multi-Tenant SaaS with PostgreSQL Row Level Security (RLS): Code, Pitfalls, and PgBouncer Integration

Replacing Elasticsearch with PostgreSQL Alone: Hybrid Search with VectorChord-BM25 and RRF

When Does pgvector Hit Its Performance Limits — pgvector vs VectorChord Benchmark Comparison

How PostgreSQL 18's `uuidv7()` and `uuid_extract_timestamp()` Speed Up Time-Series Queries by 44% Without `created_at`