How to Build Visual RAG System for PDFs with Tables

Traditional text-based RAG systems break when they hit PDFs with tables, charts, and complex layouts because they rely on text extraction that strips away visual structure. You need visual RAG, which treats document pages as images and uses vision-language models to understand content in its original layout. ColQwen2 offers a free, open-source path to build this without expensive API costs, and you can run the entire stack on Google Colab with Gemini handling the final answer generation.

The difference isn't subtle. When you extract text from a financial table, you get a jumbled mess of numbers with no column headers or row relationships. Visual RAG sees the table as humans do.

Why Traditional RAG Can't Read Tables and Charts

Text extraction tools like PyPDF2, pdfplumber, or even OCR engines parse PDFs linearly, reading left to right, top to bottom. They ignore spatial relationships. A three-column layout becomes a single stream of text where column boundaries disappear.

Tables suffer the worst. A balance sheet with rows and columns becomes a string of numbers separated by spaces or newlines, with no indication which number belongs to which quarter or line item. Research shows that roughly 65% of business documents contain at least one table or chart that carries critical information, yet standard text extraction loses this structure entirely.

Charts and diagrams vanish completely. Your RAG system never sees the trend line in a sales chart or the relationship arrows in a system architecture diagram. It's answering questions with one eye closed.

Even when extraction succeeds technically, layout matters semantically. A sidebar note, a callout box, or a figure caption carries different weight than body text, but text-only RAG treats everything as an undifferentiated stream.

What ColQwen2 Is and How Vision Models Enable Multimodal Document Understanding

ColQwen2 is a vision-language model built on the ColPali architecture, designed specifically for document retrieval. Unlike general vision models, it generates embeddings optimized for matching document images to text queries. You can run it locally without API costs.

The model processes an entire PDF page as an image, using a vision transformer to create embeddings that capture both text content and visual layout. When you query "What was Q3 revenue?", ColQwen2 retrieves the page image where that table appears, preserving the table structure for downstream processing.

ColPali architecture uses late interaction, meaning it generates multiple embeddings per image (one per visual patch) rather than a single vector. This lets it match specific regions of a page to your query. A 10-page document might generate 2,000+ patch embeddings, enabling fine-grained retrieval.

The model was trained on millions of document-query pairs, learning to recognize tables, charts, forms, and diagrams. It understands that numbers in a grid formation represent a table, that a line graph shows trends, that a flowchart depicts process steps. This visual reasoning doesn't exist in text-only embeddings.

You pair ColQwen2 with a vision-capable LLM like Gemini or GPT-4V for the final answer generation. ColQwen2 retrieves the relevant page images, then the LLM reads those images to extract the answer. The LLM sees what you see.

Step-by-Step Visual RAG Architecture for PDFs with Complex Layouts

Visual RAG replaces the text extraction step with image conversion, then uses vision models for both embedding and retrieval. Here's the complete pipeline.

Convert PDF Pages to Images

Use pdf2image or PyMuPDF to render each PDF page as a PNG or JPEG. Quality matters: 150-200 DPI captures tables and charts clearly without creating massive files. A 50-page document at 200 DPI produces roughly 50-100 MB of images.


from pdf2image import convert_from_path

images = convert_from_path('financial_report.pdf', dpi=200)
for i, image in enumerate(images):
    image.save(f'page_{i}.png', 'PNG')

Store images locally or in cloud storage. You'll reference them during retrieval and pass them to the vision LLM for answer generation.

Generate Vision Embeddings with ColQwen2

Load ColQwen2 and process each page image to create embeddings. The model outputs a tensor of patch embeddings per page. You'll store these in a vector database alongside metadata linking to the original page image.


from colpali_engine.models import ColQwen2, ColQwen2Processor
import torch

model = ColQwen2.from_pretrained("vidore/colqwen2-v0.1")
processor = ColQwen2Processor.from_pretrained("vidore/colqwen2-v0.1")

# Process a single page
image = Image.open('page_0.png')
inputs = processor(images=image, return_tensors="pt")

with torch.no_grad():
    embeddings = model(**inputs).last_hidden_state

Each page generates hundreds of embeddings (one per visual patch). Store these with page identifiers so you can retrieve the correct image later.

Build Visual Retrieval with Vector Search

Index your vision embeddings in a vector database like FAISS, Qdrant, or Chroma. When a user queries, embed the query text using ColQwen2's text encoder, then search for the most similar page embeddings.


import faiss
import numpy as np

# Create index
dimension = embeddings.shape[-1]
index = faiss.IndexFlatIP(dimension)  # Inner product for similarity

# Add page embeddings
page_embeddings = embeddings.mean(dim=1).numpy()  # Average patch embeddings
index.add(page_embeddings)

# Query
query = "What was the revenue growth in Q3?"
query_inputs = processor(text=query, return_tensors="pt")
with torch.no_grad():
    query_embedding = model.encode_text(**query_inputs).numpy()

# Retrieve top 3 pages
distances, indices = index.search(query_embedding, k=3)

The retrieval returns page indices. Load the corresponding page images for the next step.

Generate Answers with Vision LLM

Pass the retrieved page images to a vision-capable LLM along with the original query. Gemini 2.0 Flash offers free tier access with good vision capabilities. The LLM reads the image and extracts the answer.


import google.generativeai as genai

genai.configure(api_key='your_api_key')
model = genai.GenerativeModel('gemini-2.0-flash')

# Load retrieved page
retrieved_page = Image.open(f'page_{indices[0][0]}.png')

# Generate answer
response = model.generate_content([
    "Answer this question based on the document page: " + query,
    retrieved_page
])

print(response.text)

The LLM sees the table structure, chart axes, and diagram labels exactly as they appear. It can read across columns, follow chart legends, and interpret visual relationships.

How to Implement Free Visual RAG Stack on Google Colab Without API Costs

You can run the entire visual RAG pipeline on Google Colab's free tier, using ColQwen2 for embeddings and retrieval, then Gemini's free API tier for answer generation. This setup handles documents up to 100 pages without hitting resource limits.

Start by installing dependencies in a Colab notebook. ColQwen2 requires transformers and the colpali-engine package. pdf2image needs poppler-utils for PDF rendering.


!apt-get install poppler-utils
!pip install pdf2image colpali-engine transformers torch pillow faiss-cpu google-generativeai

Upload your PDF to Colab's file system or mount Google Drive. Convert pages to images at 150 DPI to balance quality and memory usage. Colab's free tier provides roughly 12 GB RAM, enough for 50-100 page documents.

Load ColQwen2 in inference mode to reduce memory footprint. The model is about 3 GB, leaving plenty of room for image processing and vector storage.


model = ColQwen2.from_pretrained(
    "vidore/colqwen2-v0.1",
    torch_dtype=torch.float16,
    device_map="auto"
)

Process pages in batches of 5-10 to avoid memory spikes. Store embeddings in FAISS with IndexFlatIP for exact search. For larger document sets (500+ pages), switch to IndexIVFFlat for faster approximate search.

Gemini's free tier allows 15 requests per minute, which is sufficient for interactive Q&A. Cache retrieved page images to avoid redundant API calls when users ask follow-up questions about the same section.

The complete implementation runs in under 30 minutes for a 20-page document on Colab's free GPU. No credit card, no vendor lock-in. If you're building document AI systems, this stack proves the concept before you commit to paid infrastructure.

When Visual RAG Outperforms Text Extraction

Visual RAG excels with financial reports, where tables dominate. A 10-K filing contains dozens of financial statements with nested rows, merged cells, and footnote references. Text extraction produces garbage. Visual RAG reads it like an analyst.

Research papers with charts and diagrams benefit enormously. When a user asks "What was the accuracy improvement?", visual RAG retrieves the results table or bar chart, and the LLM reads the exact numbers from the visual. Text extraction might miss the chart entirely or mangle the table formatting.

Technical manuals and specification sheets use complex layouts: multi-column text, callout boxes, annotated diagrams, parts lists. These documents encode information spatially. A wiring diagram shows connections through lines and labels, and text extraction can't capture that.

Infographics and marketing materials rely entirely on visual design. Statistics in large fonts, icons representing concepts, flowcharts showing processes. There's often minimal body text to extract. Visual RAG treats the entire page as the content.

Forms and applications preserve field structure. An insurance claim form has boxes, checkboxes, and signature lines. Visual RAG can answer "What's the deductible amount?" by locating the correct field, while text extraction loses the field-value pairing.

Testing shows visual RAG achieves roughly 40% higher accuracy on table-heavy documents compared to text-based RAG, measured by exact answer match. The gap widens with complex layouts: 60%+ improvement on documents with multi-column layouts or embedded charts.

Performance Considerations and Choosing Visual vs Text-Based RAG

Visual RAG costs more computationally. Processing a page image through a vision transformer takes 2-3 seconds on CPU, versus milliseconds for text embedding. For a 100-page document, expect 5-10 minutes of processing time upfront.

Storage increases significantly. Text embeddings for a page might be 1-2 KB, but vision patch embeddings can reach 50-100 KB per page. A 1,000-page document corpus requires 50-100 MB of embedding storage versus 1-2 MB for text-only.

Retrieval speed stays comparable. Vector search across vision embeddings takes the same time as text embeddings given proper indexing. FAISS handles millions of vectors efficiently regardless of what they represent.

Answer generation is slower because you're passing images to the LLM instead of text snippets. A vision LLM processes a page image in 3-5 seconds versus under 1 second for text. For interactive applications, this latency matters.

Choose text-based RAG when your documents are text-heavy with simple layouts: novels, articles, transcripts, plain text reports. If you can extract clean paragraphs with minimal formatting loss, text RAG is faster and cheaper.

Choose visual RAG when layout carries meaning: any document with tables, charts, diagrams, forms, or multi-column layouts. Also when visual elements (photos, illustrations, maps) contain information relevant to queries. For more on implementing AI systems effectively, see how to implement AI in your business without wasting money.

Hybrid approaches work well. Use text RAG for narrative sections and visual RAG for pages flagged as containing tables or charts. This requires page classification upfront but optimizes cost and speed.

For production systems handling thousands of documents, consider the infrastructure costs. Vision embeddings and image storage add up. Run cost analysis comparing visual RAG compute costs against the business value of accurate answers on complex documents.

Building Your Visual RAG Implementation

Start small with a single complex PDF that breaks your current text-based RAG. Convert it to images, run it through ColQwen2, and test retrieval quality. Compare answers from visual RAG versus your existing system.

Measure accuracy on questions that require reading tables or charts. If visual RAG shows clear improvement (20%+ higher exact match rate), expand to your full document set. If the improvement is marginal, your documents might not need visual processing.

Optimize your pipeline iteratively. Test different DPI settings (150 vs 200 vs 300) to find the sweet spot between image quality and file size. Experiment with batch sizes during embedding generation to maximize GPU utilization without running out of memory.

Monitor retrieval precision. If ColQwen2 retrieves irrelevant pages, you might need to adjust your query formulation or add query expansion. Vision models work best with specific queries ("What was Q3 revenue?") rather than vague ones ("Tell me about finances").

Consider the broader AI architecture when building document systems. For related techniques on structuring AI components, check out how to build hybrid RAG systems with knowledge graphs and how AI systems like ChatGPT actually work for foundational understanding.

Look, visual RAG isn't a replacement for all document processing, but it solves a real problem that text extraction can't touch. When your documents contain information that lives in tables, charts, and layouts, treating pages as images and using vision models for retrieval gives you accurate answers that text-based systems miss. ColQwen2 and Gemini provide a free path to implementation you can test today on Google Colab, proving the approach before committing to production infrastructure.