NJI

Introduction

RAG is the part of the cookbook I was most interested in, mostly because it's where "the model knows things" stops being a convincing story. In practice, RAG is a retrieval engineering problem with an LLM stapled on the end—and the cookbook does a good job showing the moving pieces without pretending it's magic.

What I've learned the hard way: you don't fix weak retrieval by stuffing more context into the prompt. You fix it with better chunking, better ranking, and tighter instructions about what the model is allowed to claim.

1. Basic RAG Pipeline

RAG Pipeline Flow

❓

Query

🔢

Embed

🔍

Retrieve

➕

Augment

✨

Generate

Convert question to embedding vector

Encode query for similarity search

Find relevant document chunks

Combine context with query

Generate final answer

Location: capabilities/retrieval_augmented_generation/guide.ipynb

Document Processing

from anthropic import Anthropic
import voyageai

anthropic = Anthropic()
voyage = voyageai.Client()

def chunk_document(text: str, chunk_size: int = 1000, overlap: int = 200) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk)
        start = end - overlap
    return chunks

Chunking is one of those knobs that looks boring and then dominates everything. I usually start close to this (size ~800–1500 chars, overlap ~10–25%), then tune based on:

whether answers require cross-paragraph context,
whether I'm retrieving code/specs (often needs smaller chunks),
cost/latency constraints.

Embedding Generation

def embed_chunks(chunks: list[str]) -> list[list[float]]:
    result = voyage.embed(
        texts=chunks,
        model="voyage-3",
        input_type="document"
    )
    return result.embeddings

def embed_query(query: str) -> list[float]:
    result = voyage.embed(
        texts=[query],
        model="voyage-3",
        input_type="query"
    )
    return result.embeddings[0]

I like the separation between document and query embedding types here; it's easy to accidentally mix them and silently degrade retrieval.

Similarity Search

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def retrieve_relevant(query: str, chunks: list[str], embeddings: list[list[float]], top_k: int = 5) -> list[str]:
    query_embedding = embed_query(query)
    
    similarities = [
        (i, cosine_similarity(query_embedding, emb))
        for i, emb in enumerate(embeddings)
    ]
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    return [chunks[i] for i, _ in similarities[:top_k]]

In my experience, top_k is rarely "the answer." What matters is: do you have a reranker (next section), and do you have guardrails when retrieved chunks conflict?

Context Injection

def rag_query(query: str, chunks: list[str], embeddings: list[list[float]]) -> str:
    relevant_chunks = retrieve_relevant(query, chunks, embeddings)
    context = "\n\n---\n\n".join(relevant_chunks)
    
    response = anthropic.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": f"""Answer the question using only the provided context.

Context:
{context}

Question: {query}

If the context doesn't contain enough information, say so."""
        }]
    )
    return response.content[0].text

This "use only the provided context" instruction is simple, but it's doing a lot of work. I'll often add one more line like "If context is ambiguous, list assumptions" because ambiguity is where hallucinations tend to sneak in.

2. Contextual Embeddings

Location: capabilities/contextual-embeddings/guide.ipynb

I've found standard chunking loses context in ways that are subtle until you debug a bad answer. Contextual embeddings are one of the cleaner fixes: you add document-level context before embedding so each chunk is semantically anchored.

The Problem

A chunk like "He increased revenue by 40%" loses meaning without knowing who "He" refers to.

Solution: Context Prepending

def add_context_to_chunk(chunk: str, document_summary: str, chunk_position: int) -> str:
    context_prefix = f"""Document Summary: {document_summary}
This is chunk {chunk_position} of the document.

---

"""
    return context_prefix + chunk

def create_contextual_embeddings(document: str) -> tuple[list[str], list[list[float]]]:
    summary = anthropic.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{"role": "user", "content": f"Summarize this document in 2-3 sentences:\n\n{document}"}]
    ).content[0].text
    
    chunks = chunk_document(document)
    contextualized = [
        add_context_to_chunk(chunk, summary, i)
        for i, chunk in enumerate(chunks)
    ]
    
    embeddings = embed_chunks(contextualized)
    return chunks, embeddings  # Store original chunks, but use contextualized for embedding

The trade-off is extra cost/latency (you're summarizing each document). I still like this pattern when documents are long and pronoun-heavy (reports, narratives, tickets), because it reduces those "I retrieved the right chunk but it still didn't answer" failures.

3. Summary Indexing

I use this when the corpus is big enough that "search all chunks every time" starts to hurt. The idea is: search documents by summary first, then dive into the best few.

class SummaryIndex:
    def __init__(self):
        self.documents = []
        self.doc_summaries = []
        self.doc_embeddings = []
        self.chunk_data = []  # list of (chunks, embeddings) per document
    
    def add_document(self, doc_id: str, text: str):
        summary = anthropic.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=300,
            messages=[{"role": "user", "content": f"Summarize:\n{text}"}]
        ).content[0].text
        
        self.documents.append(doc_id)
        self.doc_summaries.append(summary)
        self.doc_embeddings.append(embed_query(summary))
        
        chunks = chunk_document(text)
        chunk_embeddings = embed_chunks(chunks)
        self.chunk_data.append((chunks, chunk_embeddings))
    
    def search(self, query: str, top_docs: int = 3, top_chunks: int = 5) -> list[str]:
        query_emb = embed_query(query)
        
        doc_scores = [
            (i, cosine_similarity(query_emb, emb))
            for i, emb in enumerate(self.doc_embeddings)
        ]
        doc_scores.sort(key=lambda x: x[1], reverse=True)
        
        results = []
        for doc_idx, _ in doc_scores[:top_docs]:
            chunks, embeddings = self.chunk_data[doc_idx]
            chunk_scores = [
                (i, cosine_similarity(query_emb, emb))
                for i, emb in enumerate(embeddings)
            ]
            chunk_scores.sort(key=lambda x: x[1], reverse=True)
            results.extend([chunks[i] for i, _ in chunk_scores[:top_chunks]])
        
        return results[:top_chunks]

One thing to watch out for: if your documents change often, you need a refresh strategy (or you'll end up retrieving stale summaries that don't match the underlying text).

4. Reranking

Before & After Reranking

Initial Vector Search

Database Performance

0.72

API Rate Limiting

0.68

Query Optimization

0.81

Caching Strategies

0.65

Index Performance

0.77

↗ Sorted by cosine similarity

After Reranking

Query Optimization

0.94

Index Performance

0.89

Database Performance

0.85

Caching Strategies

0.73

API Rate Limiting

0.69

↗ Re-scored with cross-encoder

I think reranking is the "make it feel smart" step. Vector search gets you decent recall; reranking is how you stop feeding the model plausible-but-wrong context.

Cross-encoder Reranking with Claude

def rerank_with_claude(query: str, chunks: list[str], top_k: int = 3) -> list[str]:
    chunks_text = "\n\n".join([f"[{i}] {chunk}" for i, chunk in enumerate(chunks)])
    
    response = anthropic.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Rank these passages by relevance to the query.

Query: {query}

Passages:
{chunks_text}

Return the indices of the {top_k} most relevant passages as a JSON array.
Example: [2, 0, 4]"""
        }]
    )
    
    import json
    indices = json.loads(response.content[0].text)
    return [chunks[i] for i in indices]

This is also a nice place to add "diversity" constraints if you keep getting near-duplicate chunks.

Hybrid Retrieval

Combine vector search with keyword search:

from rank_bm25 import BM25Okapi

def hybrid_retrieve(query: str, chunks: list[str], embeddings: list[list[float]], alpha: float = 0.5) -> list[str]:
    query_emb = embed_query(query)
    vector_scores = [cosine_similarity(query_emb, emb) for emb in embeddings]
    
    tokenized_chunks = [chunk.lower().split() for chunk in chunks]
    bm25 = BM25Okapi(tokenized_chunks)
    bm25_scores = bm25.get_scores(query.lower().split())
    
    vector_scores = np.array(vector_scores)
    bm25_scores = np.array(bm25_scores)
    vector_scores = (vector_scores - vector_scores.min()) / (vector_scores.max() - vector_scores.min() + 1e-8)
    bm25_scores = (bm25_scores - bm25_scores.min()) / (bm25_scores.max() - bm25_scores.min() + 1e-8)
    
    combined = alpha * vector_scores + (1 - alpha) * bm25_scores
    top_indices = np.argsort(combined)[::-1][:5]
    
    return [chunks[i] for i in top_indices]

In my experience, hybrid retrieval helps most on:

short queries,
proper nouns / IDs / error codes,
domains where exact phrasing matters (APIs, logs, legal terms).

Summary

Technique	Purpose
Basic RAG	Ground responses in external knowledge
Contextual Embeddings	Preserve document context in chunks
Summary Indexing	Hierarchical search for large corpora
Reranking	Improve precision after initial retrieval
Hybrid Search	Combine semantic and keyword matching

Next I'll switch from "retrieve and answer" to tool use—because RAG is great for knowledge, but tools are how you actually do things.