Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables
When I first introduced RAG, I had a similar experience. I parsed a few hundred PDFs, loaded them into a vector DB, and ran some searches — it retrieved text-heavy content reasonably well, but charts, product photos, and circuit diagrams embedded in the middle of documents were completely ignored. Even when that visual information was essentially the core of the document. It was a structural limitation of traditional text RAG.
Multimodal RAG starts exactly from that point. It's an architecture that retrieves and uses images, tables, audio, and video within the same pipeline as text. This article covers how multimodal RAG works, how to build an actual pipeline, and the tradeoffs you must consider in production. This article assumes readers who have built a basic text RAG pipeline at least once. If vector DBs, embedding models, and chunking are still unfamiliar, it's recommended to review RAG fundamentals first.
This technology is already used in production services across multiple domains — medical imaging, e-commerce, technical documentation — and now is a great time to get started. Native multimodal embedding models like Google Gemini Embedding 2, which map text, images, and audio into a single vector space, have just stabilized, making implementation noticeably less complex compared to a year ago.
Core Concepts
The Three-Stage Structure of Multimodal RAG
If traditional RAG follows a linear flow of "text → embedding → vector search → generation," multimodal RAG performs modality-specific processing in parallel at each stage.
[Source Document]
├── Text chunks → Transformer embedding
├── Images/charts → Vision Encoder (CLIP, ColPali, etc.)
└── Table data → Convert to text → Transformer embedding
↓ (stored in a unified vector space)
[Query] → Cross-modal retrieval → [Text + Image + Table context]
↓
VLM (GPT-4o, Gemini, etc.)
↓
[Final response]Stage 1 — Multimodal Indexing: Each modality is embedded using a dedicated encoder, then stored in a single unified vector space. For text, models like BGE-M3 (a transformer-based embedding model with strong multilingual support) are used. For images, CLIP or ColPali is used — the two take quite different approaches. Where CLIP aligns image-text pairs, ColPali uses a Late Interaction approach that embeds document page images directly without OCR, preserving layout and formatting information. Embedding quality at this stage determines the entire downstream retrieval performance.
Stage 2 — Cross-modal Retrieval: The key is that even when a text query comes in, related images and tables are retrieved alongside it. Vector similarity search alone has limitations, so combining BM25 keyword search with Tensor Reranking on top of hybrid search is currently closest to the production standard for improving accuracy.
Stage 3 — Multimodal Generation: The retrieved text, images, and tables are all injected as context, and a VLM like GPT-4o or Gemini generates the response. Hallucination is empirically reduced significantly compared to using text alone — which makes sense once you try it. With visual grounding available, the model has a harder time making things up.
What's Different from Traditional Text RAG
Honestly, at first I thought "can't you just extract text from images and feed it in?" — wouldn't OCR-extracted image text be covered by existing RAG? But information like color distribution in charts, spatial patterns in medical images, and texture in product photos is lost the moment it's converted to text. As this accumulates, it directly impacts retrieval quality.
VisRAG: An approach that retrieves and reasons over document page images directly, without parsing documents into text. It can preserve the original document structure without the layout information loss that occurs during OCR conversion.
Key Changes to Watch in 2025–2026
The multimodal RAG ecosystem is changing rapidly — an architecture from a month ago can already feel outdated.
| Trend | Description |
|---|---|
| Native multimodal embeddings | Google Gemini Embedding 2 maps text, images, video, and audio into a single vector space — no need for separate encoders per modality |
| Agentic RAG | Evolving from a single retrieval-generation loop to multi-step reasoning with Retriever, Validator, and Summarizer agents collaborating |
| Real-time RAG | Expanding from static indexes to dynamic real-time data connections for news, sensors, stock prices, etc. |
| RAG-as-a-Service | Rapid growth of enterprise-grade managed services providing multimodal RAG capabilities via API |
Practical Application
Let's look at how to actually build this pipeline in code.
Example 1: Building a Multimodal PDF Pipeline with LlamaIndex
Consider a technical documentation assistant scenario — processing engineering PDFs that mix circuit diagrams, schematics, code snippets, and text descriptions. I also ran into a NameError when initially trying to create an index without storage_context — you must create separate text and image stores first, then tie them together with StorageContext. Skip this step and even copy-pasted code will error immediately.
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.indices.multi_modal import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.clip import ClipEmbedding
import qdrant_client
# 1. Load multimodal documents (extract text + images together)
documents = SimpleDirectoryReader(
input_dir="./docs",
required_exts=[".pdf"],
recursive=True
).load_data()
# 2. Configure vector stores (separate indexes for text/images)
client = qdrant_client.QdrantClient(host="localhost", port=6333)
text_store = QdrantVectorStore(
client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
client=client, collection_name="image_collection"
)
# 3. Combine text/image stores with StorageContext (required step)
storage_context = StorageContext.from_defaults(
vector_store=text_store,
image_store=image_store,
)
# 4. Align images and text with CLIP embeddings
clip_embedding = ClipEmbedding()
# 5. Build multimodal index
index = MultiModalVectorStoreIndex.from_documents(
documents,
image_embed_model=clip_embedding,
storage_context=storage_context,
)
# 6. Configure VLM and build query engine
mm_llm = OpenAIMultiModal(
model="gpt-4o",
max_new_tokens=1500
)
query_engine = index.as_query_engine(
multi_modal_llm=mm_llm,
similarity_top_k=5, # top 5 text chunks
image_similarity_top_k=3, # top 3 images
)
response = query_engine.query(
"Explain how the voltage regulation module works in this circuit"
)
print(response)| Code Point | Role |
|---|---|
MultiModalVectorStoreIndex |
Manages text and image indexes in a unified way |
StorageContext.from_defaults |
Required step to bind text and image vector stores together |
ClipEmbedding |
Maps images and text into the same vector space |
image_similarity_top_k |
Controls the number of image candidates during retrieval (cost-quality tradeoff) |
OpenAIMultiModal |
VLM that processes image + text context together |
Example 2: Improving Cross-modal Accuracy with Hybrid Search
Vector search alone can miss keyword-match cases. Combining BM25 with hybrid search plus tensor reranking makes a meaningful difference in practice. The parameters you'll tune most often in production are similarity_top_k and top_n — the balance between these simultaneously determines retrieval quality and latency.
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.postprocessor.colbert_rerank import ColbertRerank
from llama_index.core.query_engine import RetrieverQueryEngine
# Define mm_llm if running standalone
mm_llm = OpenAIMultiModal(
model="gpt-4o",
max_new_tokens=1500
)
# Vector search retriever
vector_retriever = index.as_retriever(similarity_top_k=10)
# BM25 keyword search retriever
bm25_retriever = BM25Retriever.from_defaults(
index=index,
similarity_top_k=10
)
# Combine both retrievers with Reciprocal Rank Fusion
hybrid_retriever = QueryFusionRetriever(
retrievers=[vector_retriever, bm25_retriever],
similarity_top_k=5,
num_queries=1,
mode="reciprocal_rerank",
)
# Final ranking adjustment with ColBERT-based tensor reranking
reranker = ColbertRerank(top_n=3)
query_engine = RetrieverQueryEngine(
retriever=hybrid_retriever,
node_postprocessors=[reranker],
llm=mm_llm,
)Reciprocal Rank Fusion (RRF): An ensemble technique that converts rankings from multiple search results into scores and combines them. It's frequently used in practice because it's simple to implement yet delivers reasonably stable performance.
Example 3: Extracting Images and Tables Separately at the Document Parsing Stage
Pipeline quality is determined at the ingestion stage. You can use Unstructured.io to separately extract text, images, and tables from a PDF.
from unstructured.partition.pdf import partition_pdf
def parse_multimodal_document(pdf_path: str) -> dict:
"""Separately extract text, images, and tables from a PDF"""
elements = partition_pdf(
filename=pdf_path,
extract_images_in_pdf=True,
extract_image_block_types=["Image", "Table"],
infer_table_structure=True,
strategy="hi_res", # High-resolution parsing (slow but accurate)
)
result = {"texts": [], "images": [], "tables": []}
for element in elements:
element_type = type(element).__name__
if element_type == "Table":
result["tables"].append({
"content": element.text,
"metadata": element.metadata.to_dict()
})
elif element_type == "Image":
if hasattr(element.metadata, "image_base64"):
result["images"].append({
"base64": element.metadata.image_base64,
"metadata": element.metadata.to_dict()
})
else:
result["texts"].append({
"content": element.text,
"metadata": element.metadata.to_dict()
})
return result| Parsing Strategy | Speed | Accuracy | Recommended For |
|---|---|---|---|
fast |
Fast | Low | Text-heavy documents, prototyping |
auto |
Medium | Medium | General PDFs |
hi_res |
Slow | High | Technical documents with many images/tables, production |
Pros and Cons Analysis
Advantages
| Item | Description |
|---|---|
| Reduced information loss | Directly utilizes visual information — such as chart color distribution and spatial relationships in images — that would be lost during text conversion |
| Hallucination suppression | Multi-source grounding empirically and meaningfully improves generation reliability compared to text-only approaches |
| Richer response quality | Returning text + images + tables together improves the user experience |
| Existing ecosystem compatibility | Extensible within the LangChain and LlamaIndex ecosystems. Existing RAG assets can be reused |
Disadvantages and Caveats
| Item | Description | Mitigation |
|---|---|---|
| Infrastructure complexity | Modality-specific encoders, chunking strategies, and indexes all differ, sharply increasing operational burden | Simplify with native multimodal embeddings like Gemini Embedding 2 |
| Storage and retrieval cost | At the scale of millions of pages, indexes can grow to TB-level | Vector DB sharding, image resolution optimization, tiered storage design |
| Latency | Serialized modality processing stages slow down response times | Async ingestion queues, caching, parallel per-modality processing |
| Cross-modal alignment quality | When retrieving images via text queries, embedding quality differences directly impact performance | Choose high-quality CLIP/ColPali embeddings; consider domain fine-tuning |
| Lack of evaluation metrics | Standard benchmarks for measuring multimodal RAG quality are still immature | Use RAGAS framework combined with per-modality individual evaluation metrics |
Tensor Reranker: A technique that reranks results by computing token-level interactions between a query and documents in a ColBERT-style approach. It's more accurate than standard cosine similarity but computationally expensive, so it's typically applied only to top candidates.
The Most Common Mistakes in Practice
-
Skipping image preprocessing — Embedding low-resolution or noisy images as-is causes retrieval quality to drop sharply. It's best to apply resizing, normalization, and quality filtering at the ingestion stage ahead of time.
-
Severing the positional link between text chunks and images — If the connection between text like "see Figure 3" and the actual metadata of Figure 3 is broken during parsing, retrieval will surface images with no context. It's recommended to preserve positional metadata such as page numbers and section information at the parsing stage.
-
Validating the entire pipeline with a single modality — Assuming that because text search works well, image search will too, leads to serious problems later. A separate evaluation set per modality, with individual verification for each, is necessary.
Closing Thoughts
Multimodal RAG is the core architecture for moving from "AI that understands only text" to "AI that understands the entire document."
It may look complex, but the starting point is closer than you think. Here are 3 steps you can take right now.
- Set up the environment — The command below sets up an environment capable of running all three examples from this article.
BM25RetrieverandColbertRerankfrom Example 2 are separate packages, so install them together.
pip install llama-index-multi-modal-llms-openai \
llama-index-vector-stores-qdrant \
llama-index-embeddings-clip \
llama-index-retrievers-bm25 \
llama-index-postprocessor-colbert-rerank \
unstructured[pdf]
docker run -p 6333:6333 qdrant/qdrant-
Validate the pipeline with the simplest case — Start with one or two PDFs containing around 10–20 images. Parse with
strategy="hi_res", index with CLIP embeddings, then fire off a query describing an image — you'll be able to see with your own eyes how multimodal retrieval works. If the image retrieval results look off on the first run, don't panic. Check the ingestion stage logs first and the cause will usually surface quickly. -
Add hybrid search and reranking in the second iteration — Once you've confirmed the basic pipeline is working, layer in BM25 combination and ColBERT reranking incrementally and compare how retrieval quality changes. This lets you immediately feel each component's real-world contribution.
References
- Building a Multimodal RAG with Text, Images, Tables | Towards Data Science
- mRAG: Elucidating the Design Space of Multi-modal RAG | arXiv 2505.24073
- Multimodal RAG Explained: From Text to Images and Beyond | USAII
- Gemini Embedding 2 — How Multimodal Embeddings Change RAG | jangwook.net
- From RAG to Context: A 2025 Year-End Review | RAGFlow
- What is Multimodal RAG? | IBM
- Multimodal RAG: A Hands-On Guide | DataCamp
- Multimodal RAG Development: 12 Best Practices | Augment Code
- Bridging Modalities: Multimodal RAG for Advanced Information Retrieval | InfoQ
- RAG-Anything: All-in-One RAG Framework | GitHub (HKUDS)
- Real-World Applications of Multimodal Search and RAG | Milvus
- Best Open-Source RAG Frameworks 2026 | Firecrawl
- RAG in 2026: How Retrieval-Augmented Generation Works for Enterprise AI | Techment