Privacy Policy© 2026 DEV BAK - TECH BLOG. All rights reserved.
DEV BAK - TECH BLOG
AI

Building a Multimodal RAG Pipeline: Making LLMs Understand Images and Tables

When I first introduced RAG, I had a similar experience. I parsed a few hundred PDFs, loaded them into a vector DB, and ran some searches — it retrieved text-heavy content reasonably well, but charts, product photos, and circuit diagrams embedded in the middle of documents were completely ignored. Even when that visual information was essentially the core of the document. It was a structural limitation of traditional text RAG.

Multimodal RAG starts exactly from that point. It's an architecture that retrieves and uses images, tables, audio, and video within the same pipeline as text. This article covers how multimodal RAG works, how to build an actual pipeline, and the tradeoffs you must consider in production. This article assumes readers who have built a basic text RAG pipeline at least once. If vector DBs, embedding models, and chunking are still unfamiliar, it's recommended to review RAG fundamentals first.

This technology is already used in production services across multiple domains — medical imaging, e-commerce, technical documentation — and now is a great time to get started. Native multimodal embedding models like Google Gemini Embedding 2, which map text, images, and audio into a single vector space, have just stabilized, making implementation noticeably less complex compared to a year ago.


Core Concepts

The Three-Stage Structure of Multimodal RAG

If traditional RAG follows a linear flow of "text → embedding → vector search → generation," multimodal RAG performs modality-specific processing in parallel at each stage.

css
[Source Document]
    ├── Text chunks → Transformer embedding
    ├── Images/charts → Vision Encoder (CLIP, ColPali, etc.)
    └── Table data → Convert to text → Transformer embedding
 
          ↓ (stored in a unified vector space)
 
[Query] → Cross-modal retrieval → [Text + Image + Table context]
                                          ↓
                                  VLM (GPT-4o, Gemini, etc.)
                                          ↓
                                    [Final response]

Stage 1 — Multimodal Indexing: Each modality is embedded using a dedicated encoder, then stored in a single unified vector space. For text, models like BGE-M3 (a transformer-based embedding model with strong multilingual support) are used. For images, CLIP or ColPali is used — the two take quite different approaches. Where CLIP aligns image-text pairs, ColPali uses a Late Interaction approach that embeds document page images directly without OCR, preserving layout and formatting information. Embedding quality at this stage determines the entire downstream retrieval performance.

Stage 2 — Cross-modal Retrieval: The key is that even when a text query comes in, related images and tables are retrieved alongside it. Vector similarity search alone has limitations, so combining BM25 keyword search with Tensor Reranking on top of hybrid search is currently closest to the production standard for improving accuracy.

Stage 3 — Multimodal Generation: The retrieved text, images, and tables are all injected as context, and a VLM like GPT-4o or Gemini generates the response. Hallucination is empirically reduced significantly compared to using text alone — which makes sense once you try it. With visual grounding available, the model has a harder time making things up.


What's Different from Traditional Text RAG

Honestly, at first I thought "can't you just extract text from images and feed it in?" — wouldn't OCR-extracted image text be covered by existing RAG? But information like color distribution in charts, spatial patterns in medical images, and texture in product photos is lost the moment it's converted to text. As this accumulates, it directly impacts retrieval quality.

VisRAG: An approach that retrieves and reasons over document page images directly, without parsing documents into text. It can preserve the original document structure without the layout information loss that occurs during OCR conversion.


Key Changes to Watch in 2025–2026

The multimodal RAG ecosystem is changing rapidly — an architecture from a month ago can already feel outdated.

Trend Description
Native multimodal embeddings Google Gemini Embedding 2 maps text, images, video, and audio into a single vector space — no need for separate encoders per modality
Agentic RAG Evolving from a single retrieval-generation loop to multi-step reasoning with Retriever, Validator, and Summarizer agents collaborating
Real-time RAG Expanding from static indexes to dynamic real-time data connections for news, sensors, stock prices, etc.
RAG-as-a-Service Rapid growth of enterprise-grade managed services providing multimodal RAG capabilities via API

Practical Application

Let's look at how to actually build this pipeline in code.

Example 1: Building a Multimodal PDF Pipeline with LlamaIndex

Consider a technical documentation assistant scenario — processing engineering PDFs that mix circuit diagrams, schematics, code snippets, and text descriptions. I also ran into a NameError when initially trying to create an index without storage_context — you must create separate text and image stores first, then tie them together with StorageContext. Skip this step and even copy-pasted code will error immediately.

python
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core.indices.multi_modal import MultiModalVectorStoreIndex
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.clip import ClipEmbedding
import qdrant_client
 
# 1. Load multimodal documents (extract text + images together)
documents = SimpleDirectoryReader(
    input_dir="./docs",
    required_exts=[".pdf"],
    recursive=True
).load_data()
 
# 2. Configure vector stores (separate indexes for text/images)
client = qdrant_client.QdrantClient(host="localhost", port=6333)
 
text_store = QdrantVectorStore(
    client=client, collection_name="text_collection"
)
image_store = QdrantVectorStore(
    client=client, collection_name="image_collection"
)
 
# 3. Combine text/image stores with StorageContext (required step)
storage_context = StorageContext.from_defaults(
    vector_store=text_store,
    image_store=image_store,
)
 
# 4. Align images and text with CLIP embeddings
clip_embedding = ClipEmbedding()
 
# 5. Build multimodal index
index = MultiModalVectorStoreIndex.from_documents(
    documents,
    image_embed_model=clip_embedding,
    storage_context=storage_context,
)
 
# 6. Configure VLM and build query engine
mm_llm = OpenAIMultiModal(
    model="gpt-4o",
    max_new_tokens=1500
)
 
query_engine = index.as_query_engine(
    multi_modal_llm=mm_llm,
    similarity_top_k=5,       # top 5 text chunks
    image_similarity_top_k=3, # top 3 images
)
 
response = query_engine.query(
    "Explain how the voltage regulation module works in this circuit"
)
print(response)
Code Point Role
MultiModalVectorStoreIndex Manages text and image indexes in a unified way
StorageContext.from_defaults Required step to bind text and image vector stores together
ClipEmbedding Maps images and text into the same vector space
image_similarity_top_k Controls the number of image candidates during retrieval (cost-quality tradeoff)
OpenAIMultiModal VLM that processes image + text context together

Example 2: Improving Cross-modal Accuracy with Hybrid Search

Vector search alone can miss keyword-match cases. Combining BM25 with hybrid search plus tensor reranking makes a meaningful difference in practice. The parameters you'll tune most often in production are similarity_top_k and top_n — the balance between these simultaneously determines retrieval quality and latency.

python
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.retrievers.bm25 import BM25Retriever
from llama_index.core.retrievers import QueryFusionRetriever
from llama_index.postprocessor.colbert_rerank import ColbertRerank
from llama_index.core.query_engine import RetrieverQueryEngine
 
# Define mm_llm if running standalone
mm_llm = OpenAIMultiModal(
    model="gpt-4o",
    max_new_tokens=1500
)
 
# Vector search retriever
vector_retriever = index.as_retriever(similarity_top_k=10)
 
# BM25 keyword search retriever
bm25_retriever = BM25Retriever.from_defaults(
    index=index,
    similarity_top_k=10
)
 
# Combine both retrievers with Reciprocal Rank Fusion
hybrid_retriever = QueryFusionRetriever(
    retrievers=[vector_retriever, bm25_retriever],
    similarity_top_k=5,
    num_queries=1,
    mode="reciprocal_rerank",
)
 
# Final ranking adjustment with ColBERT-based tensor reranking
reranker = ColbertRerank(top_n=3)
 
query_engine = RetrieverQueryEngine(
    retriever=hybrid_retriever,
    node_postprocessors=[reranker],
    llm=mm_llm,
)

Reciprocal Rank Fusion (RRF): An ensemble technique that converts rankings from multiple search results into scores and combines them. It's frequently used in practice because it's simple to implement yet delivers reasonably stable performance.


Example 3: Extracting Images and Tables Separately at the Document Parsing Stage

Pipeline quality is determined at the ingestion stage. You can use Unstructured.io to separately extract text, images, and tables from a PDF.

python
from unstructured.partition.pdf import partition_pdf
 
def parse_multimodal_document(pdf_path: str) -> dict:
    """Separately extract text, images, and tables from a PDF"""
 
    elements = partition_pdf(
        filename=pdf_path,
        extract_images_in_pdf=True,
        extract_image_block_types=["Image", "Table"],
        infer_table_structure=True,
        strategy="hi_res",         # High-resolution parsing (slow but accurate)
    )
 
    result = {"texts": [], "images": [], "tables": []}
 
    for element in elements:
        element_type = type(element).__name__
 
        if element_type == "Table":
            result["tables"].append({
                "content": element.text,
                "metadata": element.metadata.to_dict()
            })
        elif element_type == "Image":
            if hasattr(element.metadata, "image_base64"):
                result["images"].append({
                    "base64": element.metadata.image_base64,
                    "metadata": element.metadata.to_dict()
                })
        else:
            result["texts"].append({
                "content": element.text,
                "metadata": element.metadata.to_dict()
            })
 
    return result
Parsing Strategy Speed Accuracy Recommended For
fast Fast Low Text-heavy documents, prototyping
auto Medium Medium General PDFs
hi_res Slow High Technical documents with many images/tables, production

Pros and Cons Analysis

Advantages

Item Description
Reduced information loss Directly utilizes visual information — such as chart color distribution and spatial relationships in images — that would be lost during text conversion
Hallucination suppression Multi-source grounding empirically and meaningfully improves generation reliability compared to text-only approaches
Richer response quality Returning text + images + tables together improves the user experience
Existing ecosystem compatibility Extensible within the LangChain and LlamaIndex ecosystems. Existing RAG assets can be reused

Disadvantages and Caveats

Item Description Mitigation
Infrastructure complexity Modality-specific encoders, chunking strategies, and indexes all differ, sharply increasing operational burden Simplify with native multimodal embeddings like Gemini Embedding 2
Storage and retrieval cost At the scale of millions of pages, indexes can grow to TB-level Vector DB sharding, image resolution optimization, tiered storage design
Latency Serialized modality processing stages slow down response times Async ingestion queues, caching, parallel per-modality processing
Cross-modal alignment quality When retrieving images via text queries, embedding quality differences directly impact performance Choose high-quality CLIP/ColPali embeddings; consider domain fine-tuning
Lack of evaluation metrics Standard benchmarks for measuring multimodal RAG quality are still immature Use RAGAS framework combined with per-modality individual evaluation metrics

Tensor Reranker: A technique that reranks results by computing token-level interactions between a query and documents in a ColBERT-style approach. It's more accurate than standard cosine similarity but computationally expensive, so it's typically applied only to top candidates.

The Most Common Mistakes in Practice

  1. Skipping image preprocessing — Embedding low-resolution or noisy images as-is causes retrieval quality to drop sharply. It's best to apply resizing, normalization, and quality filtering at the ingestion stage ahead of time.

  2. Severing the positional link between text chunks and images — If the connection between text like "see Figure 3" and the actual metadata of Figure 3 is broken during parsing, retrieval will surface images with no context. It's recommended to preserve positional metadata such as page numbers and section information at the parsing stage.

  3. Validating the entire pipeline with a single modality — Assuming that because text search works well, image search will too, leads to serious problems later. A separate evaluation set per modality, with individual verification for each, is necessary.


Closing Thoughts

Multimodal RAG is the core architecture for moving from "AI that understands only text" to "AI that understands the entire document."

It may look complex, but the starting point is closer than you think. Here are 3 steps you can take right now.

  1. Set up the environment — The command below sets up an environment capable of running all three examples from this article. BM25Retriever and ColbertRerank from Example 2 are separate packages, so install them together.
bash
pip install llama-index-multi-modal-llms-openai \
            llama-index-vector-stores-qdrant \
            llama-index-embeddings-clip \
            llama-index-retrievers-bm25 \
            llama-index-postprocessor-colbert-rerank \
            unstructured[pdf]
 
docker run -p 6333:6333 qdrant/qdrant
  1. Validate the pipeline with the simplest case — Start with one or two PDFs containing around 10–20 images. Parse with strategy="hi_res", index with CLIP embeddings, then fire off a query describing an image — you'll be able to see with your own eyes how multimodal retrieval works. If the image retrieval results look off on the first run, don't panic. Check the ingestion stage logs first and the cause will usually surface quickly.

  2. Add hybrid search and reranking in the second iteration — Once you've confirmed the basic pipeline is working, layer in BM25 combination and ColBERT reranking incrementally and compare how retrieval quality changes. This lets you immediately feel each component's real-world contribution.


References

  • Building a Multimodal RAG with Text, Images, Tables | Towards Data Science
  • mRAG: Elucidating the Design Space of Multi-modal RAG | arXiv 2505.24073
  • Multimodal RAG Explained: From Text to Images and Beyond | USAII
  • Gemini Embedding 2 — How Multimodal Embeddings Change RAG | jangwook.net
  • From RAG to Context: A 2025 Year-End Review | RAGFlow
  • What is Multimodal RAG? | IBM
  • Multimodal RAG: A Hands-On Guide | DataCamp
  • Multimodal RAG Development: 12 Best Practices | Augment Code
  • Bridging Modalities: Multimodal RAG for Advanced Information Retrieval | InfoQ
  • RAG-Anything: All-in-One RAG Framework | GitHub (HKUDS)
  • Real-World Applications of Multimodal Search and RAG | Milvus
  • Best Open-Source RAG Frameworks 2026 | Firecrawl
  • RAG in 2026: How Retrieval-Augmented Generation Works for Enterprise AI | Techment
#멀티모달RAG#LlamaIndex#VectorDB#CLIP#ColPali#하이브리드검색#LLM#임베딩#Qdrant#Unstructured
Share

Table of Contents

Core ConceptsThe Three-Stage Structure of Multimodal RAGWhat's Different from Traditional Text RAGKey Changes to Watch in 2025–2026Practical ApplicationExample 1: Building a Multimodal PDF Pipeline with LlamaIndexExample 2: Improving Cross-modal Accuracy with Hybrid SearchExample 3: Extracting Images and Tables Separately at the Document Parsing StagePros and Cons AnalysisAdvantagesDisadvantages and CaveatsThe Most Common Mistakes in PracticeClosing ThoughtsReferences

Recommended Posts

Building LLM Tracing with OpenTelemetry: Tracking RAG and Multi-Agent Flows with the gen_ai Standard
AI

Building LLM Tracing with OpenTelemetry: Tracking RAG and Multi-Agent Flows with the gen_ai Standard

A service connected to GPT-4 suddenly starts giving nonsensical answers. You dig through the logs and find no errors. HTTP response codes are all 200. But users...

May 30, 202625 min read
Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents
AI

Pydantic AI: Implementing Type-Safe LLM Tool Calls in Python AI Agents

Catching runtime errors at write time with RunContext · output_type · dependency injection When you layer LLM-powered features onto Python code, a nagging an...

May 30, 202624 min read
Building Your Own LLM Evaluation Framework vs. Off-the-Shelf Tools: Team Decision Criteria for 2026
AI

Building Your Own LLM Evaluation Framework vs. Off-the-Shelf Tools: Team Decision Criteria for 2026

If your team is shipping RAG, chatbots, or agents to production, this decision is waiting for you If you've ever shipped an AI feature to your product and th...

May 30, 202624 min read
Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose
AI

Comparing Long-Term Memory for AI Agents: Mem0 vs Letta vs Zep — Three Philosophies and How to Choose

If you've ever built an LLM-based app, you've hit this wall. "How do I make it remember past conversations?" You might think you can just shove the entire conve...

May 30, 202629 min read
LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System
AI

LangGraph Supervisor Pattern: How to Stay in Control in a Multi-Agent System

The most common mistake when first designing a multi-agent system is connecting agents loosely under the vague expectation that "they'll figure out how to collaborate." I thought the same thing at first, and the result was always the same: you can't tell where the control flow is, you can't trace where it failed, and debugging inevitably leads you to redesign everything from scratch.

May 30, 202622 min read
Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer
AI

Why 88% of AI Agents Fail in Production: The 5-Layer Harness Architecture Is the Answer

When GPT-4 first came out, I—along with most developers around me—shared the same misconception: "Isn't a good model all you need?" We'd slap a few prompt lines...

May 29, 202628 min read