Part
5
  |  
RAG — Giving Claude a Memory
  |  
Chapter
17

Chunking, Embeddings, and the Ranking Pipeline

Your retrieval system is only as good as how you split your documents, how you represent them numerically, and how ruthlessly you filter the results.
Reading Time
12
mins
BACK TO CLAUDE MASTERCLASS

Most teams that struggle with RAG quality blame the model. They try switching from Sonnet to Opus, tweak the temperature, rewrite the system prompt for the fifteenth time. None of it helps — because the real problem happened three steps earlier, when they split a 40-page document into 500-character blocks with no regard for paragraph boundaries, embedded those fragments with a generic model, and retrieved the top five by cosine similarity without any reranking. They handed Claude a briefing packet full of torn sentences and half-thoughts, then blamed Claude for the incoherent answer.

If you hand Claude a briefing packet full of torn sentences and half-thoughts, don't blame Claude for the incoherent answer.

The retrieval pipeline has three layers that each demand deliberate engineering: chunking (how you split documents into searchable units), embeddings (how you convert text into numerical representations for similarity search), and ranking (how you filter the initial search results down to the documents that actually answer the question). Get any one of these wrong and your RAG system underperforms. Get all three right and you have something that consistently finds the needle in the haystack.

Chunking: Where Most Pipelines Die

Chunking is the decision about where to draw the boundaries in your source documents. It sounds trivial — just split on paragraphs, right? But the chunk size and splitting strategy have more impact on retrieval quality than your choice of embedding model or vector database.

I've used five chunking strategies in production systems, and each serves a different use case.

Fixed-size chunking splits text at rigid character intervals — every 500 characters, every 1,000 characters, regardless of content. It's fast and deterministic. It's also destructive: sentences get cut mid-thought, paragraphs get split across chunks, and the resulting fragments lose coherence.

def fixed_size_chunks(text: str, chunk_size: int = 500) -> list[dict]:
    """Split text into fixed-size character blocks. Fast but destructive."""
    chunks = []
    for i in range(0, len(text), chunk_size):
        chunks.append({
            "index": len(chunks),
            "text": text[i:i + chunk_size]
        })
    return chunks

I use fixed-size chunking only for preprocessing steps where I need uniform block sizes — embedding benchmarks, token counting, or feeding into a secondary splitting stage. Never as the final chunking strategy for retrieval.

Recursive chunking respects document structure. It splits on paragraphs first, then falls back to sentences when a paragraph exceeds the size limit. This preserves natural language boundaries while enforcing a maximum chunk size.

def recursive_chunks(text: str, max_size: int = 800) -> list[dict]:
    """Split on paragraphs, then sentences if a paragraph is too large."""
    paragraphs = text.split("\n\n")
    chunks = []
    for para in paragraphs:
        para = para.strip()
        if not para:
            continue
        if len(para) <= max_size:
            chunks.append({"index": len(chunks), "text": para})
        else:
            # Fall back to sentence splitting
            sentences = para.replace(". ", ".\n").split("\n")
            current = ""
            for sentence in sentences:
                if len(current) + len(sentence) + 1 > max_size:
                    if current:
                        chunks.append({"index": len(chunks), "text": current.strip()})
                    current = sentence
                else:
                    current = f"{current} {sentence}" if current else sentence
            if current:
                chunks.append({"index": len(chunks), "text": current.strip()})
    return chunks

This is my default for articles, documentation, blog posts — any text where paragraph boundaries carry meaning. It handles 80% of real-world documents well enough.

Document-based chunking uses explicit markers in the source material — page breaks, section headers, or separator characters — as chunk boundaries. Each page or section becomes one chunk. This is the right choice for PDFs, scanned documents, or any content where the original page structure matters for context.

Framework · The Chunk Coherence Test · CCT

Read each chunk in isolation, without seeing the surrounding text. If the chunk makes sense on its own — if someone could read it and understand the point being made — it's a good chunk. If it starts mid-sentence or ends without completing a thought, your chunking strategy is broken and no amount of embedding sophistication will save it.

Semantic chunking hands the splitting decision to an LLM. You send the full document to Claude with instructions to identify meaningful topic boundaries and return structured JSON with each section labeled.

import json
from anthropic import Anthropic

client = Anthropic()

def semantic_chunks(text: str) -> list[dict]:
    """Use Claude to identify meaningful content boundaries."""
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        temperature=0,
        messages=[{
            "role": "user",
            "content": (
                "Split the following document into meaningful sections based on "
                "topic shifts and conceptual boundaries. Return a JSON array where "
                "each element has 'index', 'title', and 'text' fields.\n\n"
                "Return ONLY the JSON array, no other text.\n\n"
                f"Document:\n{text}"
            ),
        }],
    )
    return json.loads(response.content[0].text)

Semantic chunking produces the highest-quality boundaries but costs an API call per document and doesn't scale to millions of pages. I use it for knowledge bases under a few hundred documents where retrieval precision matters more than processing cost.

Agentic chunking goes further: Claude not only splits the document but generates metadata for each chunk — a title, a summary, and keywords. This metadata becomes searchable itself, dramatically improving retrieval when the user's question doesn't match the chunk's literal text but does match its topic.

Chunk size sweet spot

For most retrieval tasks, I target chunks between 200 and 800 tokens. Below 200 and you lose context — the chunk is too small to carry a complete idea. Above 800 and you get noise pollution — the chunk contains too many ideas and matches too many queries. Start at 500 tokens and adjust based on your retrieval precision metrics.

Embeddings: Turning Text into Searchable Vectors

Once you have clean chunks, you need to make them searchable. Keyword matching works for exact terms but fails completely when the user phrases their question differently from how the document states the answer. If your knowledge base says "refund policy" and the user asks "how do I get my money back?", keyword search returns nothing.

Embeddings solve this by converting text into high-dimensional numerical vectors — arrays of hundreds or thousands of floating-point numbers — where semantically similar texts end up close together in vector space. "Refund policy" and "get my money back" map to nearby vectors even though they share no words.

import numpy as np
import voyageai

voyage_client = voyageai.Client()

def embed_chunks(chunks: list[dict]) -> list[list[float]]:
    """Generate embeddings for a list of text chunks."""
    texts = [chunk["text"] for chunk in chunks]
    result = voyage_client.embed(texts, model="voyage-3", input_type="document")
    return result.embeddings

def embed_query(query: str) -> list[float]:
    """Generate a query-optimized embedding."""
    result = voyage_client.embed([query], model="voyage-3", input_type="query")
    return result.embeddings[0]

def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a, b = np.array(vec_a), np.array(vec_b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

The key detail most tutorials skip: the embedding model distinguishes between document embeddings and query embeddings. When you embed your chunks for storage, you set input_type="document". When you embed the user's question at query time, you set input_type="query". The model optimizes the vector representation differently for each case — document embeddings capture comprehensive meaning, while query embeddings are optimized for matching against those stored representations. Mixing them up degrades retrieval quality silently.

Key takeaway

Embedding models produce different vectors for the same text depending on whether it's being stored as a document or used as a search query. Always use the correct input type — mixing them is the most common silent killer of retrieval precision.

To find the most relevant chunk for a given question, you compute the cosine similarity between the query embedding and every chunk embedding, then sort by score:

def search_chunks(query: str, chunks: list[dict], embeddings: list[list[float]], top_k: int = 5) -> list[tuple[dict, float]]:
    """Return the top_k chunks most similar to the query."""
    query_embedding = embed_query(query)
    scored = []
    for chunk, embedding in zip(chunks, embeddings):
        score = cosine_similarity(query_embedding, embedding)
        scored.append((chunk, score))
    scored.sort(key=lambda x: x[1], reverse=True)
    return scored[:top_k]

Cosine similarity returns a value between -1 and 1. In practice, relevant chunks typically score above 0.7 and irrelevant ones fall below 0.4, but these thresholds vary by embedding model and domain. Don't set hard cutoffs until you've calibrated against your actual data.

Choosing an Embedding Model

The embedding model you choose determines the quality ceiling of your semantic search. Three factors matter:

Dimensionality. Higher-dimensional embeddings capture more nuance but cost more to store and compare. Voyage-3 uses 1,024 dimensions. OpenAI's text-embedding-3-large goes up to 3,072. For most knowledge bases, 1,024 dimensions is the sweet spot — enough precision for production retrieval without the storage overhead of larger models.

Domain relevance. General-purpose embedding models work well for general-purpose content. If your knowledge base is highly specialized — medical literature, legal filings, source code — a domain-specific embedding model will outperform a generic one. Test your retrieval precision with at least two models before committing to one.

Speed and cost. Embedding every chunk in your knowledge base is a one-time cost at ingest. Embedding every user query is an ongoing cost at query time. Choose a model where query-time embedding is fast enough for your latency requirements — typically under 100ms per query.

Embedding model migration

If you switch embedding models after building your index, you must re-embed every chunk. Embeddings from different models live in different vector spaces — you cannot compare a Voyage-3 query embedding against an OpenAI document embedding. Plan your model choice early and treat it as a commitment.

BM25: The Keyword Layer That Refuses to Die

Before you dismiss keyword search entirely in favor of embeddings, know this: BM25, the ranking algorithm behind most traditional search engines, still outperforms pure semantic search on exact-match queries. When the user searches for "error code E-4012" or a specific product SKU, embeddings struggle because those strings have no semantic content — they're arbitrary identifiers. BM25 finds them instantly because it matches the literal tokens.

from rank_bm25 import BM25Okapi
import re

def tokenize(text: str) -> list[str]:
    """Clean and tokenize text for BM25 indexing."""
    text = text.lower()
    text = re.sub(r"[^\w\s]", "", text)
    return text.split()

def build_bm25_index(chunks: list[dict]) -> BM25Okapi:
    """Build a BM25 index over chunk texts."""
    tokenized = [tokenize(chunk["text"]) for chunk in chunks]
    return BM25Okapi(tokenized)

def bm25_search(index: BM25Okapi, chunks: list[dict], query: str, top_k: int = 5) -> list[tuple[dict, float]]:
    """Retrieve top_k chunks by BM25 score."""
    query_tokens = tokenize(query)
    scores = index.get_scores(query_tokens)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return ranked[:top_k]

The pattern I've settled on in production is hybrid retrieval: run both BM25 and semantic search in parallel, then merge the results. BM25 catches the exact-match cases that embeddings miss, and embeddings catch the paraphrase cases that BM25 misses. The merged list goes to the reranker, which decides the final ordering.

✕ Semantic search only
  • Handles paraphrases and synonyms
  • Misses exact identifiers and codes
  • Requires embedding infrastructure
  • Struggles with short, precise queries
✓ Hybrid (semantic + BM25)
  • Catches both meaning and exact matches
  • Covers the full query spectrum
  • More resilient to edge cases
  • Slightly more complex but dramatically more accurate

Reranking: The Precision Layer

Here's the problem with retrieval: it's designed to cast a wide net. Both BM25 and semantic search return candidates — chunks that might be relevant. But "might be relevant" is not good enough when you're assembling a briefing packet for Claude. Every irrelevant chunk in that packet dilutes the signal and wastes tokens.

Reranking is the precision layer. After your retrievers produce a candidate list, a reranker scores each candidate against the original question and reorders them by actual usefulness — not just surface similarity.

The most effective reranking approach I've found uses Claude itself as the reranker. You send Claude the question, the candidate documents, and strict instructions to score and reorder:

import json

def rerank_with_claude(question: str, candidates: list[dict], top_k: int = 3) -> list[dict]:
    """Use Claude to rerank candidate chunks by relevance to the question."""
    docs_text = "\n\n".join(
        f"[Document {i+1}]: {c['text'][:500]}"
        for i, c in enumerate(candidates)
    )
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        temperature=0,
        messages=[{
            "role": "user",
            "content": (
                f"Given this question: {question}\n\n"
                f"Rank these documents by how useful each one is for answering "
                f"the question. Return a JSON array of objects with 'doc_index' "
                f"(1-based), 'score' (0-1), and 'reason' fields.\n\n"
                f"Return ONLY valid JSON.\n\n{docs_text}"
            ),
        }],
    )
    rankings = json.loads(response.content[0].text)
    rankings.sort(key=lambda x: x["score"], reverse=True)
    return [candidates[r["doc_index"] - 1] for r in rankings[:top_k]]

Semantic search gets you close. Reranking gives you accuracy.

You can also use dedicated reranking models — Cohere's reranker, Voyage's reranker, or a cross-encoder from the sentence-transformers library. These are faster and cheaper than an LLM call but less flexible. For most production systems, I start with a lightweight model reranker for latency-sensitive paths and reserve Claude-based reranking for high-stakes queries where precision justifies the extra API call.

When to skip reranking

If your knowledge base is small (under 50 chunks) and your semantic search consistently returns the right documents in the top three results, reranking adds latency without improving quality. Measure first. Add complexity only when the metrics justify it.

The Full Retrieval Stack

Putting it all together, a production retrieval pipeline flows through four stages:

  1. Chunk the source documents using recursive or semantic splitting, targeting 200-800 token chunks that pass the Chunk Coherence Test.
  2. Embed every chunk and store the vectors in a database (or in memory for smaller datasets). Also build a BM25 index over the same chunks.
  3. Retrieve candidates from both the semantic index and the BM25 index, merging the top results from each.
  4. Rerank the merged candidates to produce the final context — the three to five chunks most likely to contain the answer.

The reranked chunks become the context block in your prompt to Claude. This is the briefing packet. Everything upstream of this point exists to make this packet as precise and relevant as possible.

Framework · The Three-Stage Filter · TSF

Think of retrieval as a funnel with three progressively tighter filters. Stage one (chunking) determines what units exist to search. Stage two (retrieval) casts a wide net to find candidates. Stage three (reranking) narrows the candidates to the documents that actually answer the question. Skipping any stage means the next one has to compensate — and it usually can't.

Storing Your Chunks

For production systems, you need persistent storage for both the chunks and their embeddings. SQLite works well for prototypes and single-machine deployments:

import sqlite3

def init_db(db_path: str = "chunks.db"):
    """Initialize the chunks database."""
    conn = sqlite3.connect(db_path)
    conn.execute("""
        CREATE TABLE IF NOT EXISTS chunks (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            doc_id TEXT NOT NULL,
            chunk_index INTEGER NOT NULL,
            title TEXT,
            summary TEXT,
            keywords TEXT,
            text TEXT NOT NULL
        )
    """)
    conn.commit()
    conn.close()

def save_chunks(doc_id: str, chunks: list[dict], db_path: str = "chunks.db"):
    """Persist chunks to SQLite."""
    conn = sqlite3.connect(db_path)
    for chunk in chunks:
        keywords = ",".join(chunk.get("keywords", []))
        conn.execute(
            "INSERT INTO chunks (doc_id, chunk_index, title, summary, keywords, text) "
            "VALUES (?, ?, ?, ?, ?, ?)",
            (doc_id, chunk["index"], chunk.get("title", ""),
             chunk.get("summary", ""), keywords, chunk["text"])
        )
    conn.commit()
    conn.close()

When you scale beyond a single machine or need sub-second retrieval over millions of vectors, you move to a dedicated vector database — Pinecone, Qdrant, Weaviate, or pgvector if you're already running PostgreSQL. The chunking and embedding logic stays the same; only the storage and search layer changes.

What to Do Monday Morning

Chunk a real document five ways and compare

Take one document from your knowledge base. Run it through fixed-size, recursive, and semantic chunking. Read the output chunks from each strategy. Which chunks pass the Chunk Coherence Test? Which strategy produces fragments that make sense in isolation? That's your chunking strategy.

Set up a Voyage AI account and embed your chunks

Create a free Voyage AI account, get an API key, and embed your chunks from the recursive strategy. Store the embeddings alongside the chunks in SQLite. Run five questions against the embedded chunks using cosine similarity and compare the results to your keyword baseline from Chapter 16.

Add BM25 alongside semantic search

Install the rank-bm25 package. Build a BM25 index over the same chunks. For each of your ten golden questions, compare which retriever finds the right answer: BM25 alone, semantic alone, or the merged list. Record where each one wins and where each one fails.

Try Claude-based reranking on your worst queries

Take the three queries where retrieval performed worst. Run the merged candidate list through Claude-based reranking. Did the correct chunk move up in the rankings? If yes, reranking belongs in your pipeline. If no, the problem is upstream in chunking or embedding — go back to Step 1.

Persist your chunks with metadata

Extend your SQLite storage to include title, summary, and keyword fields. Run agentic chunking on your most important documents to generate rich metadata. Compare retrieval quality with and without metadata-enhanced chunks.

The ranking pipeline is where retrieval quality is won or lost. Chunking determines what exists to be found. Embeddings determine how it's found. Reranking determines whether the right thing actually makes it into Claude's context window.