Introduction
RAG is the part of the cookbook I was most interested in, mostly because it's where "the model knows things" stops being a convincing story. In practice, RAG is a retrieval engineering problem with an LLM stapled on the end—and the cookbook does a good job showing the moving pieces without pretending it's magic.
What I've learned the hard way: you don't fix weak retrieval by stuffing more context into the prompt. You fix it with better chunking, better ranking, and tighter instructions about what the model is allowed to claim.
1. Basic RAG Pipeline
Location: capabilities/retrieval_augmented_generation/guide.ipynb
Document Processing
from anthropic import Anthropic
import voyageai
anthropic = Anthropic()
voyage = voyageai.Client()
def chunk_document(text: str, chunk_size: int = 1000, overlap: int = 200) -> list[str]:
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk)
start = end - overlap
return chunks
Chunking is one of those knobs that looks boring and then dominates everything. I usually start close to this (size ~800–1500 chars, overlap ~10–25%), then tune based on:
- whether answers require cross-paragraph context,
- whether I'm retrieving code/specs (often needs smaller chunks),
- cost/latency constraints.
Embedding Generation
def embed_chunks(chunks: list[str]) -> list[list[float]]:
result = voyage.embed(
texts=chunks,
model="voyage-3",
input_type="document"
)
return result.embeddings
def embed_query(query: str) -> list[float]:
result = voyage.embed(
texts=[query],
model="voyage-3",
input_type="query"
)
return result.embeddings[0]
I like the separation between document and query embedding types here; it's easy to accidentally mix them and silently degrade retrieval.
Similarity Search
import numpy as np
def cosine_similarity(a: list[float], b: list[float]) -> float:
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def retrieve_relevant(query: str, chunks: list[str], embeddings: list[list[float]], top_k: int = 5) -> list[str]:
query_embedding = embed_query(query)
similarities = [
(i, cosine_similarity(query_embedding, emb))
for i, emb in enumerate(embeddings)
]
similarities.sort(key=lambda x: x[1], reverse=True)
return [chunks[i] for i, _ in similarities[:top_k]]
In my experience, top_k is rarely "the answer." What matters is: do you have a reranker (next section), and do you have guardrails when retrieved chunks conflict?
Context Injection
def rag_query(query: str, chunks: list[str], embeddings: list[list[float]]) -> str:
relevant_chunks = retrieve_relevant(query, chunks, embeddings)
context = "\n\n---\n\n".join(relevant_chunks)
response = anthropic.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=2000,
messages=[{
"role": "user",
"content": f"""Answer the question using only the provided context.
Context:
{context}
Question: {query}
If the context doesn't contain enough information, say so."""
}]
)
return response.content[0].text
This "use only the provided context" instruction is simple, but it's doing a lot of work. I'll often add one more line like "If context is ambiguous, list assumptions" because ambiguity is where hallucinations tend to sneak in.
2. Contextual Embeddings
Location: capabilities/contextual-embeddings/guide.ipynb
I've found standard chunking loses context in ways that are subtle until you debug a bad answer. Contextual embeddings are one of the cleaner fixes: you add document-level context before embedding so each chunk is semantically anchored.
The Problem
A chunk like "He increased revenue by 40%" loses meaning without knowing who "He" refers to.
Solution: Context Prepending
def add_context_to_chunk(chunk: str, document_summary: str, chunk_position: int) -> str:
context_prefix = f"""Document Summary: {document_summary}
This is chunk {chunk_position} of the document.
---
"""
return context_prefix + chunk
def create_contextual_embeddings(document: str) -> tuple[list[str], list[list[float]]]:
summary = anthropic.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=500,
messages=[{"role": "user", "content": f"Summarize this document in 2-3 sentences:\n\n{document}"}]
).content[0].text
chunks = chunk_document(document)
contextualized = [
add_context_to_chunk(chunk, summary, i)
for i, chunk in enumerate(chunks)
]
embeddings = embed_chunks(contextualized)
return chunks, embeddings # Store original chunks, but use contextualized for embedding
The trade-off is extra cost/latency (you're summarizing each document). I still like this pattern when documents are long and pronoun-heavy (reports, narratives, tickets), because it reduces those "I retrieved the right chunk but it still didn't answer" failures.
3. Summary Indexing
I use this when the corpus is big enough that "search all chunks every time" starts to hurt. The idea is: search documents by summary first, then dive into the best few.
class SummaryIndex:
def __init__(self):
self.documents = []
self.doc_summaries = []
self.doc_embeddings = []
self.chunk_data = [] # list of (chunks, embeddings) per document
def add_document(self, doc_id: str, text: str):
summary = anthropic.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=300,
messages=[{"role": "user", "content": f"Summarize:\n{text}"}]
).content[0].text
self.documents.append(doc_id)
self.doc_summaries.append(summary)
self.doc_embeddings.append(embed_query(summary))
chunks = chunk_document(text)
chunk_embeddings = embed_chunks(chunks)
self.chunk_data.append((chunks, chunk_embeddings))
def search(self, query: str, top_docs: int = 3, top_chunks: int = 5) -> list[str]:
query_emb = embed_query(query)
doc_scores = [
(i, cosine_similarity(query_emb, emb))
for i, emb in enumerate(self.doc_embeddings)
]
doc_scores.sort(key=lambda x: x[1], reverse=True)
results = []
for doc_idx, _ in doc_scores[:top_docs]:
chunks, embeddings = self.chunk_data[doc_idx]
chunk_scores = [
(i, cosine_similarity(query_emb, emb))
for i, emb in enumerate(embeddings)
]
chunk_scores.sort(key=lambda x: x[1], reverse=True)
results.extend([chunks[i] for i, _ in chunk_scores[:top_chunks]])
return results[:top_chunks]
One thing to watch out for: if your documents change often, you need a refresh strategy (or you'll end up retrieving stale summaries that don't match the underlying text).
4. Reranking
I think reranking is the "make it feel smart" step. Vector search gets you decent recall; reranking is how you stop feeding the model plausible-but-wrong context.
Cross-encoder Reranking with Claude
def rerank_with_claude(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
chunks_text = "\n\n".join([f"[{i}] {chunk}" for i, chunk in enumerate(chunks)])
response = anthropic.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
messages=[{
"role": "user",
"content": f"""Rank these passages by relevance to the query.
Query: {query}
Passages:
{chunks_text}
Return the indices of the {top_k} most relevant passages as a JSON array.
Example: [2, 0, 4]"""
}]
)
import json
indices = json.loads(response.content[0].text)
return [chunks[i] for i in indices]
This is also a nice place to add "diversity" constraints if you keep getting near-duplicate chunks.
Hybrid Retrieval
Combine vector search with keyword search:
from rank_bm25 import BM25Okapi
def hybrid_retrieve(query: str, chunks: list[str], embeddings: list[list[float]], alpha: float = 0.5) -> list[str]:
query_emb = embed_query(query)
vector_scores = [cosine_similarity(query_emb, emb) for emb in embeddings]
tokenized_chunks = [chunk.lower().split() for chunk in chunks]
bm25 = BM25Okapi(tokenized_chunks)
bm25_scores = bm25.get_scores(query.lower().split())
vector_scores = np.array(vector_scores)
bm25_scores = np.array(bm25_scores)
vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min() + 1e-8)
bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-8)
combined = alpha * vector_scores + (1 - alpha) * bm25_scores
top_indices = np.argsort(combined)[::-1][:5]
return [chunks[i] for i in top_indices]
In my experience, hybrid retrieval helps most on:
- short queries,
- proper nouns / IDs / error codes,
- domains where exact phrasing matters (APIs, logs, legal terms).
Summary
| Technique | Purpose |
|---|---|
| Basic RAG | Ground responses in external knowledge |
| Contextual Embeddings | Preserve document context in chunks |
| Summary Indexing | Hierarchical search for large corpora |
| Reranking | Improve precision after initial retrieval |
| Hybrid Search | Combine semantic and keyword matching |
Next I'll switch from "retrieve and answer" to tool use—because RAG is great for knowledge, but tools are how you actually do things.