The Fatal Flaw in Standard RAG
Most RAG pipelines start with a simple premise: split text into 500-character chunks and hope for the best. This works for basic blog posts. However, the moment you feed it a 10-K financial filing with nested tables or a technical manual full of circuit diagrams, the system breaks.
Standard chunking often slices a table right down the middle, separating a row from its headers. When the retriever pulls that isolated row, the LLM sees a string of random numbers with no context. It doesn’t just fail; it hallucinates with confidence.
I spent months troubleshooting why a production RAG system was failing on basic PDF queries. The culprit wasn’t the LLM’s reasoning—it was the retrieval strategy. By switching to a Multi-vector approach, specifically the Parent Document strategy, we increased our retrieval accuracy on tabular data from a shaky 55% to over 92%. I have since deployed this architecture across several enterprise projects where data precision is non-negotiable.
How Retrieval Strategies Actually Compare
Before writing code, you need to understand why the “Naive” approach fails where Multi-vector succeeds. It comes down to how we balance searchability with context.
The Naive Approach (Standard RAG)
In a typical setup, you split a document into small, equal-sized chunks, embed them, and store them. At query time, the system finds the top-K chunks. Small chunks are great for pinpointing specific keywords but lack the surrounding context. Conversely, large chunks contain too much “noise,” which dilutes the embedding vector and makes the search less precise. You are stuck in a trade-off where you either get precision without context or context without precision.
The Multi-vector Strategy (Parent-Child)
The Multi-vector Retriever decouples the data we search from the data we send to the LLM. We index small “child” chunks or concise summaries for the search process.
These are linked to a larger “parent” document in a separate store. When a child chunk matches a query, the retriever pulls the entire parent document—perhaps an entire 2,000-word chapter or a full 15-row table—and hands it to the LLM. You get the surgical precision of a small-chunk search combined with the rich context of a full document.
The Trade-offs: Is It Worth the Effort?
This method is powerful, but it isn’t a free lunch. You need to weigh the performance gains against the infrastructure costs.
The Benefits
- Document Integrity: Tables stay whole. The LLM sees the headers, the units, and the footnotes together.
- Semantic Clarity: By indexing summaries of images or tables, you make non-textual data searchable.
- Higher Signal-to-Noise: Searching a 100-word summary is often more accurate than searching a 1,000-word raw text block.
The Costs
- Storage Demands: You are essentially storing data twice—once as embeddings and once as raw text/images.
- Ingestion Latency: Parsing a 100-page PDF using high-fidelity tools can take minutes rather than seconds.
- API Expenses: Generating summaries for 50 tables in a report requires 50 extra LLM calls during the upload phase.
The Modern Tech Stack
To build this, you need tools that can actually “see” document structure rather than just reading strings of characters.
- Unstructured.io: This is the gold standard for partitioning PDFs into distinct elements like tables and narrative text.
- LangChain: Provides the
MultiVectorRetrieverclass to manage the link between summaries and raw data. - ChromaDB or Pinecone: High-performance stores for your vectors.
- Redis: An excellent choice for the
DocStoreto keep parent documents ready for sub-millisecond retrieval. - GPT-4o or Claude 3.5 Sonnet: These models excel at summarizing complex tables into searchable descriptions.
Step-by-Step Implementation
Let’s build a pipeline that processes a financial report containing both text and complex tables.
1. Extracting Structural Elements
We use unstructured to identify table boundaries. This prevents the “sliced table” problem from the start.
from unstructured.partition.pdf import partition_pdf
# Extract elements with high-fidelity table detection
raw_pdf_elements = partition_pdf(
filename="q3_report.pdf",
extract_images_in_pdf=False,
infer_table_structure=True, # The secret sauce for tables
chunking_strategy="by_title",
max_characters=4000,
)
# Separate tables for specialized processing
tables = [el for el in raw_pdf_elements if el.category == "Table"]
texts = [el for el in raw_pdf_elements if el.category == "CompositeElement"]
2. Creating Searchable Summaries
Raw HTML or Markdown tables are often too dense for vector search. Summarizing them creates a “searchable bridge.”
from langchain_openai import ChatOpenAI
model = ChatOpenAI(model="gpt-4o", temperature=0)
table_summaries = []
for table in tables:
prompt = f"Summarize this table for semantic search. Include key metrics and row/column headers: {table.text}"
summary = model.invoke(prompt)
table_summaries.append(summary.content)
3. Wiring the Multi-vector Retriever
This is where we link the searchable summaries in the vector store to the original raw data in the document store.
import uuid
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.storage import InMemoryStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Vector store for the summaries
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
store = InMemoryStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
docstore=store,
id_key=id_key,
)
def add_data(summaries, raw_contents):
doc_ids = [str(uuid.uuid4()) for _ in summaries]
# Add summaries to vector search
summary_docs = [Document(page_content=s, metadata={id_key: doc_ids[i]}) for i, s in enumerate(summaries)]
retriever.vectorstore.add_documents(summary_docs)
# Add raw data to the parent store
retriever.docstore.mset(list(zip(doc_ids, raw_contents)))
add_data(table_summaries, [t.text for t in tables])
add_data([t.text[:1000] for t in texts], [t.text for t in texts])
4. The Result
When a user asks, “What was our Q3 revenue growth?”, the system finds the summary of the revenue table. It then retrieves the entire raw table for the LLM. The model now has every number it needs to calculate the answer accurately.
Extending to Images and Charts
You can apply this exact same pattern to visual data. Use a multimodal model like GPT-4o to write a 200-word description of a line chart. Index that description. Link it to the original image. When a user asks about trends, the description triggers the retrieval, and the LLM receives the actual image to analyze. It’s a robust way to handle data that text-only models simply cannot see.
Final Thoughts
Moving beyond basic RAG is necessary for any serious enterprise application. Standard chunking is fine for prototypes, but it fails the moment it hits real-world document layouts. While Multi-vector architectures require more orchestration and higher ingestion costs, the payoff is a system that users can actually trust. If your PDFs contain more than just plain paragraphs, this is the architecture you should be building today.

