Claude Cookbooks StudyPart 1 of 8
AI Agent

Claude Cookbooks (1): Overview & Basic Capabilities

Introduction

I spent time going through Anthropic's Claude Cookbooks repository because I wanted more than "prompt tips"—I wanted runnable patterns I could actually reuse. The notebooks are uneven (some are more polished than others), but the capabilities folder is a solid baseline for the three things I keep reaching for in real projects: classification, summarization, and text-to-SQL.

A quick framing: these are "sharp tools." They work best when you constrain outputs, validate on your side, and assume the model will occasionally surprise you.


Repository Structure

claude-cookbooks/
├── capabilities/          # Core use cases
│   ├── classification/
│   ├── summarization/
│   ├── retrieval_augmented_generation/
│   ├── contextual-embeddings/
│   └── text_to_sql/
├── tool_use/              # Function calling
├── multimodal/            # Vision capabilities
├── patterns/agents/       # Agentic architectures
├── extended_thinking/     # Complex reasoning
├── claude_agent_sdk/      # SDK tutorials
├── skills/                # Document generation
├── finetuning/            # Model customization
└── observability/         # Monitoring & costs

1. Classification

Location: capabilities/classification/guide.ipynb

When I'm doing classification with an LLM, the biggest lever isn't the model—it's how hard I pin down the output format. If you let the model "explain itself," you'll get prose. If you make it pick from an explicit set, it behaves more like a classifier.

Zero-shot Classification

from anthropic import Anthropic

client = Anthropic()

def classify_text(text: str, categories: list[str]) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=100,
        messages=[{
            "role": "user",
            "content": f"""Classify the following text into exactly one category.

Categories: {', '.join(categories)}

Text: {text}

Respond with only the category name."""
        }]
    )
    return response.content[0].text.strip()

In my experience, this pattern holds up surprisingly well as long as:

  • your category list is short and mutually exclusive (or at least mostly),
  • you don't ask for a rationale in the same response (that's when the model gets chatty).

Multi-label Classification

For documents that may belong to multiple categories:

def classify_multi_label(text: str, categories: list[str]) -> list[str]:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        messages=[{
            "role": "user",
            "content": f"""Analyze the following text and identify ALL applicable categories.

Categories: {', '.join(categories)}

Text: {text}

Return a JSON array of matching categories. Example: ["category1", "category2"]"""
        }]
    )
    import json
    return json.loads(response.content[0].text)

A common mistake is treating json.loads(...) as "done." In real code, I usually wrap this with retry + a stricter schema (or tool-based extraction) because one stray trailing comment will break parsing.

Confidence Scores

Request confidence along with classification:

def classify_with_confidence(text: str, categories: list[str]) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=300,
        messages=[{
            "role": "user",
            "content": f"""Classify this text and provide confidence scores.

Categories: {', '.join(categories)}
Text: {text}

Return JSON: {{"category": "...", "confidence": 0.0-1.0, "reasoning": "..."}}"""
        }]
    )
    import json
    return json.loads(response.content[0].text)

My take: confidence numbers from LLMs are useful as ranking signals, not calibrated probabilities. If you need real calibration, you'll still want held-out evaluation and thresholds tuned to your domain.


2. Summarization

Location: capabilities/summarization/guide.ipynb

Summarization looks trivial until you run it at scale. The cookbook patterns are a good starting point, but what matters most is being explicit about length and what to preserve (numbers, names, decisions, action items, etc.).

Extractive vs. Abstractive

Claude supports both approaches:

def summarize(text: str, style: str = "abstractive", length: str = "medium") -> str:
    length_guide = {
        "short": "2-3 sentences",
        "medium": "1 paragraph",
        "long": "3-4 paragraphs"
    }
    
    style_instruction = (
        "Rephrase and synthesize the main ideas in your own words."
        if style == "abstractive"
        else "Extract and quote the most important sentences directly."
    )
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1000,
        messages=[{
            "role": "user",
            "content": f"""Summarize the following text.

Length: {length_guide[length]}
Style: {style_instruction}

Text:
{text}"""
        }]
    )
    return response.content[0].text

If I'm summarizing anything where "exact wording" matters (legal, policy, incident postmortems), I lean extractive first, then do a second pass to reorganize—otherwise it's too easy to accidentally soften or reinterpret.

Hierarchical Summarization

For very long documents, summarize in chunks then synthesize:

def hierarchical_summarize(chunks: list[str], final_length: str = "medium") -> str:
    chunk_summaries = []
    for chunk in chunks:
        summary = summarize(chunk, length="short")
        chunk_summaries.append(summary)
    
    combined = "\n\n".join(chunk_summaries)
    return summarize(combined, length=final_length)

This is one of those patterns that "just works," but I've also seen it compound mistakes (one bad chunk summary can pollute the final). If the document is high-stakes, I add a verification step ("list 10 key facts and quote the source sentence") before producing the final narrative.


3. Text-to-SQL

Location: capabilities/text_to_sql/guide.ipynb

Text-to-SQL is the fastest way I know to turn "we should make data accessible" into something demo-able. It's also where you absolutely have to treat outputs as untrusted: validate, restrict permissions, and log everything.

Schema-aware Query Generation

def text_to_sql(question: str, schema: str, dialect: str = "postgresql") -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Generate a {dialect} query for this question.

Database Schema:
{schema}

Question: {question}

Return only the SQL query, no explanation."""
        }]
    )
    return response.content[0].text.strip()

One thing I don't think the docs emphasize enough: the formatting of schema matters. I've had better results when I include primary keys, foreign keys, and a couple of representative rows—especially when table names are vague.

With Query Validation

Add self-verification:

def text_to_sql_validated(question: str, schema: str) -> dict:
    query = text_to_sql(question, schema)
    
    validation = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"""Review this SQL query for correctness.

Schema: {schema}
Question: {question}
Query: {query}

Return JSON: {{"valid": true/false, "issues": [...], "corrected_query": "..."}}"""
        }]
    )
    
    import json
    result = json.loads(validation.content[0].text)
    result["original_query"] = query
    return result

This self-check catches a lot of silly mistakes (wrong join key, wrong table). I still run the query in a sandboxed role and cap runtime/rows, because "looks valid" isn't the same as "safe."


Summary

If you're skimming, these are the three patterns I keep coming back to:

CapabilityUse Case
ClassificationCategorizing text, sentiment analysis, intent detection
SummarizationDocument condensation, meeting notes, article summaries
Text-to-SQLNatural language database queries

Next up I get into RAG and contextual embeddings—where these "single-call" recipes start to compound into something that feels like a real system.