Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables

When I first introduced RAG, I had a similar experience. I parsed a few hundred PDFs, loaded them into a vector DB, and ran some searches — it retrieved text-heavy content reasonably well, but charts, product photos, and circuit diagrams embedded in the middle of documents were completely ignored. Even when that visual information was essentially the core of the document. It was a structural limitation of traditional text RAG.

Multimodal RAG starts exactly from that point. It's an architecture that retrieves and uses images, tables, audio, and video within the same pipeline as text. This article covers how multimodal RAG works, how to build an actual pipeline, and the tradeoffs you must consider in production. This article assumes readers who have built a basic text RAG pipeline at least once. If vector DBs, embedding models, and chunking are still unfamiliar, it's recommended to review RAG fundamentals first.

This technology is already used in production services across multiple domains — medical imaging, e-commerce, technical documentation — and now is a great time to get started. Native multimodal embedding models like Google Gemini Embedding 2, which map text, images, and audio into a single vector space, have just stabilized, making implementation noticeably less complex compared to a year ago.

Core Concepts

The Three-Stage Structure of Multimodal RAG

If traditional RAG follows a linear flow of "text → embedding → vector search → generation," multimodal RAG performs modality-specific processing in parallel at each stage.

css

[Source Document]
    ├── Text chunks → Transformer embedding
    ├── Images/charts → Vision Encoder (CLIP, ColPali, etc.)
    └── Table data → Convert to text → Transformer embedding
 
          ↓ (stored in a unified vector space)
 
[Query] → Cross-modal retrieval → [Text + Image + Table context]
                                          ↓
                                  VLM (GPT-4o, Gemini, etc.)
                                          ↓
                                    [Final response]

Stage 1 — Multimodal Indexing: Each modality is embedded using a dedicated encoder, then stored in a single unified vector space. For text, models like BGE-M3 (a transformer-based embedding model with strong multilingual support) are used. For images, CLIP or ColPali is used — the two take quite different approaches. Where CLIP aligns image-text pairs, ColPali uses a Late Interaction approach that embeds document page images directly without OCR, preserving layout and formatting information. Embedding quality at this stage determines the entire downstream retrieval performance.

Stage 2 — Cross-modal Retrieval: The key is that even when a text query comes in, related images and tables are retrieved alongside it. Vector similarity search alone has limitations, so combining BM25 keyword search with Tensor Reranking on top of hybrid search is currently closest to the production standard for improving accuracy.

Stage 3 — Multimodal Generation: The retrieved text, images, and tables are all injected as context, and a VLM like GPT-4o or Gemini generates the response. Hallucination is empirically reduced significantly compared to using text alone — which makes sense once you try it. With visual grounding available, the model has a harder time making things up.

What's Different from Traditional Text RAG

Honestly, at first I thought "can't you just extract text from images and feed it in?" — wouldn't OCR-extracted image text be covered by existing RAG? But information like color distribution in charts, spatial patterns in medical images, and texture in product photos is lost the moment it's converted to text. As this accumulates, it directly impacts retrieval quality.

VisRAG: An approach that retrieves and reasons over document page images directly, without parsing documents into text. It can preserve the original document structure without the layout information loss that occurs during OCR conversion.

Key Changes to Watch in 2025–2026

The multimodal RAG ecosystem is changing rapidly — an architecture from a month ago can already feel outdated.

Trend	Description
Native multimodal embeddings	Google Gemini Embedding 2 maps text, images, video, and audio into a single vector space — no need for separate encoders per modality
Agentic RAG	Evolving from a single retrieval-generation loop to multi-step reasoning with Retriever, Validator, and Summarizer agents collaborating
Real-time RAG	Expanding from static indexes to dynamic real-time data connections for news, sensors, stock prices, etc.
RAG-as-a-Service	Rapid growth of enterprise-grade managed services providing multimodal RAG capabilities via API

Practical Application

Let's look at how to actually build this pipeline in code.

Example 1: Building a Multimodal PDF Pipeline with LlamaIndex

Consider a technical documentation assistant scenario — processing engineering PDFs that mix circuit diagrams, schematics, code snippets, and text descriptions. I also ran into a NameError when initially trying to create an index without storage_context — you must create separate text and image stores first, then tie them together with StorageContext. Skip this step and even copy-pasted code will error immediately.

python

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.indices.multi_modal import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.clip import ClipEmbedding
import qdrant_client
 
# 1. Load multimodal documents (extract text + images together)
documents = SimpleDirectoryReader(
    input_dir="./docs",
    required_exts=[".pdf"],
    recursive=True
).load_data()
 
# 2. Configure vector stores (separate indexes for text/images)
client = qdrant_client.QdrantClient(host="localhost", port=6333)
 
text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
)
 
# 3. Combine text/image stores with StorageContext (required step)
storage_context = StorageContext.from_defaults(
    vector_store=text_store,
    image_store=image_store,
)
 
# 4. Align images and text with CLIP embeddings
clip_embedding = ClipEmbedding()
 
# 5. Build multimodal index
index = MultiModalVectorStoreIndex.from_documents(
    documents,
    image_embed_model=clip_embedding,
    storage_context=storage_context,
)
 
# 6. Configure VLM and build query engine
mm_llm = OpenAIMultiModal(
    model="gpt-4o",
    max_new_tokens=1500
)
 
query_engine = index.as_query_engine(
    multi_modal_llm=mm_llm,
    similarity_top_k=5,       # top 5 text chunks
    image_similarity_top_k=3, # top 3 images
)
 
response = query_engine.query(
    "Explain how the voltage regulation module works in this circuit"
)
print(response)

Code Point	Role
`MultiModalVectorStoreIndex`	Manages text and image indexes in a unified way
`StorageContext.from_defaults`	Required step to bind text and image vector stores together
`ClipEmbedding`	Maps images and text into the same vector space
`image_similarity_top_k`	Controls the number of image candidates during retrieval (cost-quality tradeoff)
`OpenAIMultiModal`	VLM that processes image + text context together

Vector search alone can miss keyword-match cases. Combining BM25 with hybrid search plus tensor reranking makes a meaningful difference in practice. The parameters you'll tune most often in production are similarity_top_k and top_n — the balance between these simultaneously determines retrieval quality and latency.

python

from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.postprocessor.colbert_rerank import ColbertRerank
from llama_index.core.query_engine import RetrieverQueryEngine
 
# Define mm_llm if running standalone
mm_llm = OpenAIMultiModal(
    model="gpt-4o",
    max_new_tokens=1500
)
 
# Vector search retriever
vector_retriever = index.as_retriever(similarity_top_k=10)
 
# BM25 keyword search retriever
bm25_retriever = BM25Retriever.from_defaults(
    index=index,
    similarity_top_k=10
)
 
# Combine both retrievers with Reciprocal Rank Fusion
hybrid_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,
    mode="reciprocal_rerank",
)
 
# Final ranking adjustment with ColBERT-based tensor reranking
reranker = ColbertRerank(top_n=3)
 
query_engine = RetrieverQueryEngine(
    retriever=hybrid_retriever,
    node_postprocessors=[reranker],
    llm=mm_llm,
)

Reciprocal Rank Fusion (RRF): An ensemble technique that converts rankings from multiple search results into scores and combines them. It's frequently used in practice because it's simple to implement yet delivers reasonably stable performance.

Example 3: Extracting Images and Tables Separately at the Document Parsing Stage

Pipeline quality is determined at the ingestion stage. You can use Unstructured.io to separately extract text, images, and tables from a PDF.

python

from unstructured.partition.pdf import partition_pdf
 
def parse_multimodal_document(pdf_path: str) -> dict:
    """Separately extract text, images, and tables from a PDF"""
 
    elements = partition_pdf(
        filename=pdf_path,
        extract_images_in_pdf=True,
        extract_image_block_types=["Image", "Table"],
        infer_table_structure=True,
        strategy="hi_res",         # High-resolution parsing (slow but accurate)
    )
 
    result = {"texts": [], "images": [], "tables": []}
 
    for element in elements:
        element_type = type(element).__name__
 
        if element_type == "Table":
            result["tables"].append({
                "content": element.text,
                "metadata": element.metadata.to_dict()
            })
        elif element_type == "Image":
            if hasattr(element.metadata, "image_base64"):
                result["images"].append({
                    "base64": element.metadata.image_base64,
                    "metadata": element.metadata.to_dict()
                })
        else:
            result["texts"].append({
                "content": element.text,
                "metadata": element.metadata.to_dict()
            })
 
    return result

Parsing Strategy	Speed	Accuracy	Recommended For
`fast`	Fast	Low	Text-heavy documents, prototyping
`auto`	Medium	Medium	General PDFs
`hi_res`	Slow	High	Technical documents with many images/tables, production

Pros and Cons Analysis

Advantages

Item	Description
Reduced information loss	Directly utilizes visual information — such as chart color distribution and spatial relationships in images — that would be lost during text conversion
Hallucination suppression	Multi-source grounding empirically and meaningfully improves generation reliability compared to text-only approaches
Richer response quality	Returning text + images + tables together improves the user experience
Existing ecosystem compatibility	Extensible within the LangChain and LlamaIndex ecosystems. Existing RAG assets can be reused

Disadvantages and Caveats

Item	Description	Mitigation
Infrastructure complexity	Modality-specific encoders, chunking strategies, and indexes all differ, sharply increasing operational burden	Simplify with native multimodal embeddings like Gemini Embedding 2
Storage and retrieval cost	At the scale of millions of pages, indexes can grow to TB-level	Vector DB sharding, image resolution optimization, tiered storage design
Latency	Serialized modality processing stages slow down response times	Async ingestion queues, caching, parallel per-modality processing
Cross-modal alignment quality	When retrieving images via text queries, embedding quality differences directly impact performance	Choose high-quality CLIP/ColPali embeddings; consider domain fine-tuning
Lack of evaluation metrics	Standard benchmarks for measuring multimodal RAG quality are still immature	Use RAGAS framework combined with per-modality individual evaluation metrics

Tensor Reranker: A technique that reranks results by computing token-level interactions between a query and documents in a ColBERT-style approach. It's more accurate than standard cosine similarity but computationally expensive, so it's typically applied only to top candidates.

The Most Common Mistakes in Practice

Skipping image preprocessing — Embedding low-resolution or noisy images as-is causes retrieval quality to drop sharply. It's best to apply resizing, normalization, and quality filtering at the ingestion stage ahead of time.
Severing the positional link between text chunks and images — If the connection between text like "see Figure 3" and the actual metadata of Figure 3 is broken during parsing, retrieval will surface images with no context. It's recommended to preserve positional metadata such as page numbers and section information at the parsing stage.
Validating the entire pipeline with a single modality — Assuming that because text search works well, image search will too, leads to serious problems later. A separate evaluation set per modality, with individual verification for each, is necessary.

Closing Thoughts

Multimodal RAG is the core architecture for moving from "AI that understands only text" to "AI that understands the entire document."

It may look complex, but the starting point is closer than you think. Here are 3 steps you can take right now.

Set up the environment — The command below sets up an environment capable of running all three examples from this article. BM25Retriever and ColbertRerank from Example 2 are separate packages, so install them together.

bash

pip install llama-index-multi-modal-llms-openai \
            llama-index-vector-stores-qdrant \
            llama-index-embeddings-clip \
            llama-index-retrievers-bm25 \
            llama-index-postprocessor-colbert-rerank \
            unstructured[pdf]
 
docker run -p 6333:6333 qdrant/qdrant

Validate the pipeline with the simplest case — Start with one or two PDFs containing around 10–20 images. Parse with strategy="hi_res", index with CLIP embeddings, then fire off a query describing an image — you'll be able to see with your own eyes how multimodal retrieval works. If the image retrieval results look off on the first run, don't panic. Check the ingestion stage logs first and the cause will usually surface quickly.
Add hybrid search and reranking in the second iteration — Once you've confirmed the basic pipeline is working, layer in BM25 combination and ColBERT reranking incrementally and compare how retrieval quality changes. This lets you immediately feel each component's real-world contribution.

References

#멀티모달RAG#LlamaIndex#VectorDB#CLIP#ColPali#하이브리드검색#LLM#임베딩#Qdrant#Unstructured

Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables

Core Concepts

The Three-Stage Structure of Multimodal RAG

If traditional RAG follows a linear flow of "text → embedding → vector search → generation," multimodal RAG performs modality-specific processing in parallel at each stage.

css

[Source Document]
    ├── Text chunks → Transformer embedding
    ├── Images/charts → Vision Encoder (CLIP, ColPali, etc.)
    └── Table data → Convert to text → Transformer embedding
 
          ↓ (stored in a unified vector space)
 
[Query] → Cross-modal retrieval → [Text + Image + Table context]
                                          ↓
                                  VLM (GPT-4o, Gemini, etc.)
                                          ↓
                                    [Final response]

What's Different from Traditional Text RAG

VisRAG: An approach that retrieves and reasons over document page images directly, without parsing documents into text. It can preserve the original document structure without the layout information loss that occurs during OCR conversion.

Key Changes to Watch in 2025–2026

The multimodal RAG ecosystem is changing rapidly — an architecture from a month ago can already feel outdated.

Trend	Description
Native multimodal embeddings	Google Gemini Embedding 2 maps text, images, video, and audio into a single vector space — no need for separate encoders per modality
Agentic RAG	Evolving from a single retrieval-generation loop to multi-step reasoning with Retriever, Validator, and Summarizer agents collaborating
Real-time RAG	Expanding from static indexes to dynamic real-time data connections for news, sensors, stock prices, etc.
RAG-as-a-Service	Rapid growth of enterprise-grade managed services providing multimodal RAG capabilities via API

Practical Application

Let's look at how to actually build this pipeline in code.

Example 1: Building a Multimodal PDF Pipeline with LlamaIndex

python

from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.indices.multi_modal import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.clip import ClipEmbedding
import qdrant_client
 
# 1. Load multimodal documents (extract text + images together)
documents = SimpleDirectoryReader(
    input_dir="./docs",
    required_exts=[".pdf"],
    recursive=True
).load_data()
 
# 2. Configure vector stores (separate indexes for text/images)
client = qdrant_client.QdrantClient(host="localhost", port=6333)
 
text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
)
 
# 3. Combine text/image stores with StorageContext (required step)
storage_context = StorageContext.from_defaults(
    vector_store=text_store,
    image_store=image_store,
)
 
# 4. Align images and text with CLIP embeddings
clip_embedding = ClipEmbedding()
 
# 5. Build multimodal index
index = MultiModalVectorStoreIndex.from_documents(
    documents,
    image_embed_model=clip_embedding,
    storage_context=storage_context,
)
 
# 6. Configure VLM and build query engine
mm_llm = OpenAIMultiModal(
    model="gpt-4o",
    max_new_tokens=1500
)
 
query_engine = index.as_query_engine(
    multi_modal_llm=mm_llm,
    similarity_top_k=5,       # top 5 text chunks
    image_similarity_top_k=3, # top 3 images
)
 
response = query_engine.query(
    "Explain how the voltage regulation module works in this circuit"
)
print(response)

Code Point	Role
`MultiModalVectorStoreIndex`	Manages text and image indexes in a unified way
`StorageContext.from_defaults`	Required step to bind text and image vector stores together
`ClipEmbedding`	Maps images and text into the same vector space
`image_similarity_top_k`	Controls the number of image candidates during retrieval (cost-quality tradeoff)
`OpenAIMultiModal`	VLM that processes image + text context together

python

from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.postprocessor.colbert_rerank import ColbertRerank
from llama_index.core.query_engine import RetrieverQueryEngine
 
# Define mm_llm if running standalone
mm_llm = OpenAIMultiModal(
    model="gpt-4o",
    max_new_tokens=1500
)
 
# Vector search retriever
vector_retriever = index.as_retriever(similarity_top_k=10)
 
# BM25 keyword search retriever
bm25_retriever = BM25Retriever.from_defaults(
    index=index,
    similarity_top_k=10
)
 
# Combine both retrievers with Reciprocal Rank Fusion
hybrid_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,
    mode="reciprocal_rerank",
)
 
# Final ranking adjustment with ColBERT-based tensor reranking
reranker = ColbertRerank(top_n=3)
 
query_engine = RetrieverQueryEngine(
    retriever=hybrid_retriever,
    node_postprocessors=[reranker],
    llm=mm_llm,
)

Reciprocal Rank Fusion (RRF): An ensemble technique that converts rankings from multiple search results into scores and combines them. It's frequently used in practice because it's simple to implement yet delivers reasonably stable performance.

Example 3: Extracting Images and Tables Separately at the Document Parsing Stage

Pipeline quality is determined at the ingestion stage. You can use Unstructured.io to separately extract text, images, and tables from a PDF.

python

from unstructured.partition.pdf import partition_pdf
 
def parse_multimodal_document(pdf_path: str) -> dict:
    """Separately extract text, images, and tables from a PDF"""
 
    elements = partition_pdf(
        filename=pdf_path,
        extract_images_in_pdf=True,
        extract_image_block_types=["Image", "Table"],
        infer_table_structure=True,
        strategy="hi_res",         # High-resolution parsing (slow but accurate)
    )
 
    result = {"texts": [], "images": [], "tables": []}
 
    for element in elements:
        element_type = type(element).__name__
 
        if element_type == "Table":
            result["tables"].append({
                "content": element.text,
                "metadata": element.metadata.to_dict()
            })
        elif element_type == "Image":
            if hasattr(element.metadata, "image_base64"):
                result["images"].append({
                    "base64": element.metadata.image_base64,
                    "metadata": element.metadata.to_dict()
                })
        else:
            result["texts"].append({
                "content": element.text,
                "metadata": element.metadata.to_dict()
            })
 
    return result

Parsing Strategy	Speed	Accuracy	Recommended For
`fast`	Fast	Low	Text-heavy documents, prototyping
`auto`	Medium	Medium	General PDFs
`hi_res`	Slow	High	Technical documents with many images/tables, production

Pros and Cons Analysis

Advantages

Item	Description
Reduced information loss	Directly utilizes visual information — such as chart color distribution and spatial relationships in images — that would be lost during text conversion
Hallucination suppression	Multi-source grounding empirically and meaningfully improves generation reliability compared to text-only approaches
Richer response quality	Returning text + images + tables together improves the user experience
Existing ecosystem compatibility	Extensible within the LangChain and LlamaIndex ecosystems. Existing RAG assets can be reused

Disadvantages and Caveats

Item	Description	Mitigation
Infrastructure complexity	Modality-specific encoders, chunking strategies, and indexes all differ, sharply increasing operational burden	Simplify with native multimodal embeddings like Gemini Embedding 2
Storage and retrieval cost	At the scale of millions of pages, indexes can grow to TB-level	Vector DB sharding, image resolution optimization, tiered storage design
Latency	Serialized modality processing stages slow down response times	Async ingestion queues, caching, parallel per-modality processing
Cross-modal alignment quality	When retrieving images via text queries, embedding quality differences directly impact performance	Choose high-quality CLIP/ColPali embeddings; consider domain fine-tuning
Lack of evaluation metrics	Standard benchmarks for measuring multimodal RAG quality are still immature	Use RAGAS framework combined with per-modality individual evaluation metrics

Tensor Reranker: A technique that reranks results by computing token-level interactions between a query and documents in a ColBERT-style approach. It's more accurate than standard cosine similarity but computationally expensive, so it's typically applied only to top candidates.

The Most Common Mistakes in Practice

Skipping image preprocessing — Embedding low-resolution or noisy images as-is causes retrieval quality to drop sharply. It's best to apply resizing, normalization, and quality filtering at the ingestion stage ahead of time.
Severing the positional link between text chunks and images — If the connection between text like "see Figure 3" and the actual metadata of Figure 3 is broken during parsing, retrieval will surface images with no context. It's recommended to preserve positional metadata such as page numbers and section information at the parsing stage.
Validating the entire pipeline with a single modality — Assuming that because text search works well, image search will too, leads to serious problems later. A separate evaluation set per modality, with individual verification for each, is necessary.

Closing Thoughts

Multimodal RAG is the core architecture for moving from "AI that understands only text" to "AI that understands the entire document."

It may look complex, but the starting point is closer than you think. Here are 3 steps you can take right now.

Set up the environment — The command below sets up an environment capable of running all three examples from this article. BM25Retriever and ColbertRerank from Example 2 are separate packages, so install them together.

bash

pip install llama-index-multi-modal-llms-openai \
            llama-index-vector-stores-qdrant \
            llama-index-embeddings-clip \
            llama-index-retrievers-bm25 \
            llama-index-postprocessor-colbert-rerank \
            unstructured[pdf]
 
docker run -p 6333:6333 qdrant/qdrant

Validate the pipeline with the simplest case — Start with one or two PDFs containing around 10–20 images. Parse with strategy="hi_res", index with CLIP embeddings, then fire off a query describing an image — you'll be able to see with your own eyes how multimodal retrieval works. If the image retrieval results look off on the first run, don't panic. Check the ingestion stage logs first and the cause will usually surface quickly.
Add hybrid search and reranking in the second iteration — Once you've confirmed the basic pipeline is working, layer in BM25 combination and ColBERT reranking incrementally and compare how retrieval quality changes. This lets you immediately feel each component's real-world contribution.

References

#멀티모달RAG#LlamaIndex#VectorDB#CLIP#ColPali#하이브리드검색#LLM#임베딩#Qdrant#Unstructured

Core Concepts

The Three-Stage Structure of Multimodal RAG

What's Different from Traditional Text RAG

Key Changes to Watch in 2025–2026

Practical Application

Example 1: Building a Multimodal PDF Pipeline with LlamaIndex

Example 2: Improving Cross-modal Accuracy with Hybrid Search

Example 3: Extracting Images and Tables Separately at the Document Parsing Stage

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Core Concepts

The Three-Stage Structure of Multimodal RAG

What's Different from Traditional Text RAG

Key Changes to Watch in 2025–2026

Practical Application

Example 1: Building a Multimodal PDF Pipeline with LlamaIndex

Example 2: Improving Cross-modal Accuracy with Hybrid Search

Example 3: Extracting Images and Tables Separately at the Document Parsing Stage

Pros and Cons Analysis

Advantages

Disadvantages and Caveats

The Most Common Mistakes in Practice

Closing Thoughts

References

Recommended Posts

Building LLM Tracing with OpenTelemetry: Tracking RAG and Multi-Agent Flows with the gen_ai Standard

Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents

Building Your Own LLM Evaluation Framework vs. Off-the-Shelf Tools: Team Decision Criteria for 2026

Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose

LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer