Introduction
I spent time going through Anthropic's Claude Cookbooks repository because I wanted more than "prompt tips"—I wanted runnable patterns I could actually reuse. The notebooks are uneven (some are more polished than others), but the capabilities folder is a solid baseline for the three things I keep reaching for in real projects: classification, summarization, and text-to-SQL.
A quick framing: these are "sharp tools." They work best when you constrain outputs, validate on your side, and assume the model will occasionally surprise you.
Repository Structure
claude-cookbooks/
├── capabilities/ # Core use cases
│ ├── classification/
│ ├── summarization/
│ ├── retrieval_augmented_generation/
│ ├── contextual-embeddings/
│ └── text_to_sql/
├── tool_use/ # Function calling
├── multimodal/ # Vision capabilities
├── patterns/agents/ # Agentic architectures
├── extended_thinking/ # Complex reasoning
├── claude_agent_sdk/ # SDK tutorials
├── skills/ # Document generation
├── finetuning/ # Model customization
└── observability/ # Monitoring & costs
1. Classification
Location: capabilities/classification/guide.ipynb
When I'm doing classification with an LLM, the biggest lever isn't the model—it's how hard I pin down the output format. If you let the model "explain itself," you'll get prose. If you make it pick from an explicit set, it behaves more like a classifier.
Zero-shot Classification
from anthropic import Anthropic
client = Anthropic()
def classify_text(text: str, categories: list[str]) -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=100,
messages=[{
"role": "user",
"content": f"""Classify the following text into exactly one category.
Categories: {', '.join(categories)}
Text: {text}
Respond with only the category name."""
}]
)
return response.content[0].text.strip()
In my experience, this pattern holds up surprisingly well as long as:
- your category list is short and mutually exclusive (or at least mostly),
- you don't ask for a rationale in the same response (that's when the model gets chatty).
Multi-label Classification
For documents that may belong to multiple categories:
def classify_multi_label(text: str, categories: list[str]) -> list[str]:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
messages=[{
"role": "user",
"content": f"""Analyze the following text and identify ALL applicable categories.
Categories: {', '.join(categories)}
Text: {text}
Return a JSON array of matching categories. Example: ["category1", "category2"]"""
}]
)
import json
return json.loads(response.content[0].text)
A common mistake is treating json.loads(...) as "done." In real code, I usually wrap this with retry + a stricter schema (or tool-based extraction) because one stray trailing comment will break parsing.
Confidence Scores
Request confidence along with classification:
def classify_with_confidence(text: str, categories: list[str]) -> dict:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=300,
messages=[{
"role": "user",
"content": f"""Classify this text and provide confidence scores.
Categories: {', '.join(categories)}
Text: {text}
Return JSON: {{"category": "...", "confidence": 0.0-1.0, "reasoning": "..."}}"""
}]
)
import json
return json.loads(response.content[0].text)
My take: confidence numbers from LLMs are useful as ranking signals, not calibrated probabilities. If you need real calibration, you'll still want held-out evaluation and thresholds tuned to your domain.
2. Summarization
Location: capabilities/summarization/guide.ipynb
Summarization looks trivial until you run it at scale. The cookbook patterns are a good starting point, but what matters most is being explicit about length and what to preserve (numbers, names, decisions, action items, etc.).
Extractive vs. Abstractive
Claude supports both approaches:
def summarize(text: str, style: str = "abstractive", length: str = "medium") -> str:
length_guide = {
"short": "2-3 sentences",
"medium": "1 paragraph",
"long": "3-4 paragraphs"
}
style_instruction = (
"Rephrase and synthesize the main ideas in your own words."
if style == "abstractive"
else "Extract and quote the most important sentences directly."
)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1000,
messages=[{
"role": "user",
"content": f"""Summarize the following text.
Length: {length_guide[length]}
Style: {style_instruction}
Text:
{text}"""
}]
)
return response.content[0].text
If I'm summarizing anything where "exact wording" matters (legal, policy, incident postmortems), I lean extractive first, then do a second pass to reorganize—otherwise it's too easy to accidentally soften or reinterpret.
Hierarchical Summarization
For very long documents, summarize in chunks then synthesize:
def hierarchical_summarize(chunks: list[str], final_length: str = "medium") -> str:
chunk_summaries = []
for chunk in chunks:
summary = summarize(chunk, length="short")
chunk_summaries.append(summary)
combined = "\n\n".join(chunk_summaries)
return summarize(combined, length=final_length)
This is one of those patterns that "just works," but I've also seen it compound mistakes (one bad chunk summary can pollute the final). If the document is high-stakes, I add a verification step ("list 10 key facts and quote the source sentence") before producing the final narrative.
3. Text-to-SQL
Location: capabilities/text_to_sql/guide.ipynb
Text-to-SQL is the fastest way I know to turn "we should make data accessible" into something demo-able. It's also where you absolutely have to treat outputs as untrusted: validate, restrict permissions, and log everything.
Schema-aware Query Generation
def text_to_sql(question: str, schema: str, dialect: str = "postgresql") -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Generate a {dialect} query for this question.
Database Schema:
{schema}
Question: {question}
Return only the SQL query, no explanation."""
}]
)
return response.content[0].text.strip()
One thing I don't think the docs emphasize enough: the formatting of schema matters. I've had better results when I include primary keys, foreign keys, and a couple of representative rows—especially when table names are vague.
With Query Validation
Add self-verification:
def text_to_sql_validated(question: str, schema: str) -> dict:
query = text_to_sql(question, schema)
validation = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Review this SQL query for correctness.
Schema: {schema}
Question: {question}
Query: {query}
Return JSON: {{"valid": true/false, "issues": [...], "corrected_query": "..."}}"""
}]
)
import json
result = json.loads(validation.content[0].text)
result["original_query"] = query
return result
This self-check catches a lot of silly mistakes (wrong join key, wrong table). I still run the query in a sandboxed role and cap runtime/rows, because "looks valid" isn't the same as "safe."
Summary
If you're skimming, these are the three patterns I keep coming back to:
| Capability | Use Case |
|---|---|
| Classification | Categorizing text, sentiment analysis, intent detection |
| Summarization | Document condensation, meeting notes, article summaries |
| Text-to-SQL | Natural language database queries |
Next up I get into RAG and contextual embeddings—where these "single-call" recipes start to compound into something that feels like a real system.