Introduction
"Extended Thinking" is one of those features I was skeptical about until I started using it on problems that actually deserve multi-step reasoning: tricky debugging, tradeoff-heavy decisions, multi-stage calculations, and anything where I want the model to slow down.
That said, I don't think it's a "turn it on everywhere" setting. It costs tokens, it can add latency, and (depending on your product) you may not want to show internal reasoning to end users. I treat it like a tool: useful when the task demands it, noisy when it doesn't.
Location: extended_thinking/
1. Basic Extended Thinking
Location: extended_thinking/extended_thinking.ipynb
Enabling Thinking
This is the minimal shape: set thinking with a token budget and then parse the mixed response blocks. The structure matters because you'll often want to log the thinking internally even if you only show the final answer to a user.
from anthropic import Anthropic
client = Anthropic()
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={
"type": "enabled",
"budget_tokens": 10000 # Max tokens for thinking
},
messages=[{
"role": "user",
"content": "Solve this step by step: If a train travels at 60 mph for 2.5 hours, then at 80 mph for 1.5 hours, what is the average speed for the entire journey?"
}]
)
# Response contains both thinking and text blocks
for block in response.content:
if block.type == "thinking":
print("=== THINKING ===")
print(block.thinking)
print()
elif block.type == "text":
print("=== ANSWER ===")
print(block.text)
When to Use Extended Thinking
I've found this table to be basically right, with one extra note: if your prompt already has a tight structure and the task is routine, thinking often doesn't help much.
| Use Case | Benefit |
|---|---|
| Math problems | Step-by-step calculation |
| Logic puzzles | Systematic reasoning |
| Code analysis | Thorough examination |
| Complex decisions | Weighing multiple factors |
| Debugging | Methodical investigation |
2. Thinking with Tool Use
Location: extended_thinking/extended_thinking_with_tool_use.ipynb
This is where Extended Thinking gets more interesting for me: it improves tool sequencing, especially on "I need to compute X, then convert units, then compute Y" style tasks.
One thing I'm not fully sure about (and it probably depends on your setup) is how much of the improvement comes from thinking vs. simply having tools available. But in practice, enabling thinking tends to reduce the weird "tool thrash" where the model calls tools in an unhelpful order.
tools = [
{
"name": "calculator",
"description": "Perform mathematical calculations",
"input_schema": {
"type": "object",
"properties": {
"expression": {"type": "string", "description": "Math expression to evaluate"}
},
"required": ["expression"]
}
},
{
"name": "unit_converter",
"description": "Convert between units",
"input_schema": {
"type": "object",
"properties": {
"value": {"type": "number"},
"from_unit": {"type": "string"},
"to_unit": {"type": "string"}
},
"required": ["value", "from_unit", "to_unit"]
}
}
]
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 5000},
tools=tools,
messages=[{
"role": "user",
"content": "A car travels 150 kilometers in 2 hours. What is its speed in miles per hour?"
}]
)
# Claude will think about:
# 1. What information is given
# 2. What needs to be calculated
# 3. Which tools to use and in what order
# Then execute the tools appropriately
Complex Multi-step Problems
When I'm deciding whether to spend a bigger thinking budget, I look for "nested" tasks like this—multiple steps, unit conversions, and money math at the end.
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 8000},
tools=tools,
messages=[{
"role": "user",
"content": """A cylindrical water tank has a radius of 3 meters and height of 5 meters.
1. Calculate the volume in cubic meters
2. Convert to gallons
3. If water costs $0.003 per gallon, how much to fill the tank?"""
}]
)
3. Structured Problem Solving
Analysis Framework
This helper function is a pattern I keep reusing: give the model a framework and force it to walk through it. Even without Extended Thinking, structured prompts help; with thinking enabled, the model tends to adhere to the structure more consistently.
def analyze_with_thinking(problem: str, framework: str = "general") -> dict:
frameworks = {
"general": """Analyze this problem using:
1. Understanding: What is being asked?
2. Given: What information do we have?
3. Approach: What method should we use?
4. Solution: Step-by-step work
5. Verification: Check the answer""",
"debugging": """Debug this issue using:
1. Symptoms: What is the observed behavior?
2. Expected: What should happen?
3. Hypotheses: Possible causes
4. Investigation: Check each hypothesis
5. Root Cause: Identify the actual issue
6. Fix: Proposed solution""",
"decision": """Analyze this decision using:
1. Options: What are the choices?
2. Criteria: What factors matter?
3. Analysis: Evaluate each option
4. Tradeoffs: Pros and cons
5. Recommendation: Best choice and why"""
}
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 10000},
messages=[{
"role": "user",
"content": f"{frameworks[framework]}\n\nProblem:\n{problem}"
}]
)
thinking = ""
answer = ""
for block in response.content:
if block.type == "thinking":
thinking = block.thinking
elif block.type == "text":
answer = block.text
return {"thinking": thinking, "answer": answer}
Code Review with Thinking
I like this pattern, but I'm cautious about it: it's easy to get false confidence from a "thorough-sounding" review. I treat it as a starting point, not a substitute for tests or careful reading.
def review_code(code: str) -> dict:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 8000},
messages=[{
"role": "user",
"content": f"""Review this code for:
1. Correctness - Does it work as intended?
2. Bugs - Any potential issues?
3. Performance - Can it be optimized?
4. Security - Any vulnerabilities?
5. Style - Does it follow best practices?
Code:
\`\`\`
{code}
\`\`\`
Provide specific line references and suggestions."""
}]
)
result = {"thinking": "", "review": ""}
for block in response.content:
if block.type == "thinking":
result["thinking"] = block.thinking
elif block.type == "text":
result["review"] = block.text
return result
4. Thinking Budget Strategies
Adaptive Budgets
I've experimented with "auto budgets" like this and I still don't know if it's always worth the extra classifier call. But when costs matter, it can be a reasonable compromise: most prompts don't need 10k thinking tokens, and a small classifier pass can prevent waste.
def query_with_adaptive_thinking(question: str, complexity: str = "auto") -> str:
budgets = {
"simple": 2000,
"medium": 5000,
"complex": 10000,
"very_complex": 15000
}
if complexity == "auto":
# Estimate complexity based on question
classifier = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=50,
messages=[{
"role": "user",
"content": f"Rate the reasoning complexity (simple/medium/complex/very_complex): {question}"
}]
)
complexity = classifier.content[0].text.strip().lower()
if complexity not in budgets:
complexity = "medium"
budget = budgets.get(complexity, 5000)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": budget},
messages=[{"role": "user", "content": question}]
)
return next(b.text for b in response.content if b.type == "text")
5. Thinking Visibility Control
When to Show Thinking
This is a subtle product decision. For internal tooling, I often want thinking for debugging. For end-user apps, I usually default to hiding it unless there's a clear value and a clear policy.
def query_with_thinking_control(question: str, show_thinking: bool = False) -> str | dict:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=16000,
thinking={"type": "enabled", "budget_tokens": 5000},
messages=[{"role": "user", "content": question}]
)
thinking = ""
answer = ""
for block in response.content:
if block.type == "thinking":
thinking = block.thinking
elif block.type == "text":
answer = block.text
if show_thinking:
return {"thinking": thinking, "answer": answer}
return answer
Summary
| Concept | Description |
|---|---|
| Basic Thinking | Enable reasoning transparency with thinking parameter |
| Budget Control | Allocate tokens for thinking process |
| Tool Integration | Improved tool selection with reasoning |
| Structured Analysis | Framework-based problem solving |
| Adaptive Budgets | Match complexity to thinking allocation |
In the final post, I'll pull together a handful of "advanced" topics that feel like the real work of shipping: tuning, document generation, observability, tool evaluation, and even prompt patterns for UI output.