Building Production RAG Pipelines: LangChain, ChromaDB, and Lessons Learned

RAGLangChainPythonAI

By Pavan Sharma — AI Agent Developer & Full Stack Engineer

Why RAG is Harder Than It Looks

Retrieval-Augmented Generation (RAG) is the architecture behind almost every production LLM application — customer support bots, internal knowledge bases, document Q&A systems. The concept is straightforward: when a user asks a question, retrieve relevant documents, inject them into the prompt as context, and let the LLM generate an answer grounded in your data.

In practice, most RAG systems fail silently. They retrieve the wrong documents, inject too much context, or produce confident answers from irrelevant sources. Here's what I've learned from building real RAG pipelines.

The Core Architecture

A production RAG pipeline has six stages:

User Query → Query Processing → Vector Retrieval → Reranking → Context Assembly → LLM Generation

Each stage is a failure point.

Stage 1: Document Ingestion

This is where most tutorials skip the hard part. Real documents are:

▸PDFs with scanned pages (need OCR)
▸HTML with navigation menus polluting the content
▸Tables that chunk badly
▸Code blocks that need to stay together

The chunking strategy matters enormously. Fixed-size chunking (split every 512 tokens) is the default in most tutorials but performs poorly on structured documents. Semantic chunking — splitting on meaningful boundaries like paragraphs, sections, and topic shifts — gives significantly better retrieval precision.

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["

", "
", ". ", " ", ""]
)

Stage 2: Embedding and Vector Storage

I use ChromaDB for local development and Pinecone for production. The embedding model choice matters more than the vector database choice. OpenAI's text-embedding-3-small is a solid default. For domain-specific retrieval, fine-tuned embeddings on your corpus will outperform general-purpose ones.

One mistake I made early: not normalizing metadata. When your vectors have inconsistent metadata (some documents have source, others have url, others have nothing), filtering queries become unreliable.

Stage 3: Query Processing

Naive RAG sends the raw user query to the retriever. This breaks for:

▸Ambiguous queries ("what did we decide about pricing?")
▸Multi-part questions ("compare X and Y and tell me which is better for Z")
▸Follow-up questions in conversations ("what about the previous version?")

HyDE (Hypothetical Document Embeddings) solves part of this — you ask the LLM to generate a hypothetical answer, embed that, and retrieve against it. The hypothetical answer often matches real document embeddings better than the question does.

Stage 4: Reranking

Vector similarity is not the same as relevance. A cross-encoder reranker (I use Cohere's rerank endpoint) takes your top-K retrieved chunks and re-scores them with full attention to both query and document. The top 3-5 after reranking are substantially more relevant than top-K by vector similarity alone.

Stage 5: Context Assembly and the Lost-in-the-Middle Problem

Research has shown LLMs have a "lost in the middle" problem — information in the middle of a long context window is less reliably used than information at the beginning and end. For RAG this means:

▸Put the most relevant chunk first and last
▸Keep total context under 2,000 tokens for reliability
▸Add a brief summary line before each chunk indicating its source

Stage 6: Evaluating RAG Quality

This is the most neglected part. Use RAGAs metrics:

▸Context Precision: are the retrieved chunks actually relevant?
▸Context Recall: did retrieval capture all necessary information?
▸Answer Faithfulness: does the generated answer actually use the context?
▸Answer Relevancy: does the answer address the question?

Building an evaluation harness before optimizing your pipeline is essential — otherwise you don't know if your changes are helping.

Final Takeaway

Good RAG is not about finding the best vector database. It's about: clean ingestion, smart chunking, query expansion, reranking, and continuous evaluation. Most RAG failures happen in stages 1 and 2, not in the LLM.

⚡ Work With Me

← Back to all transmissions