NJI

Introduction

Thinking Modes Comparison

⚡

Standard

Fast • Lower cost

Fast

Lower

Direct response generation without visible reasoning process

Characteristics

Immediate response

Optimized for speed

Lower token usage

Good for simple queries

Best For

Quick questions

Simple tasks

Routine operations

High-volume requests

🧠

Extended Thinking

Deliberate • Higher cost

Deliberate

Higher

Deep reasoning with visible thought process before final response

Characteristics

Visible reasoning

Step-by-step analysis

Higher token usage

Better complex reasoning

Best For

Complex problems

Multi-step reasoning

Debugging tasks

Critical decisions

Choose based on task complexity • Extended thinking can be enabled per request

"Extended Thinking" is one of those features I was skeptical about until I started using it on problems that actually deserve multi-step reasoning: tricky debugging, tradeoff-heavy decisions, multi-stage calculations, and anything where I want the model to slow down.

That said, I don't think it's a "turn it on everywhere" setting. It costs tokens, it can add latency, and (depending on your product) you may not want to show internal reasoning to end users. I treat it like a tool: useful when the task demands it, noisy when it doesn't.

Location: extended_thinking/

1. Basic Extended Thinking

Location: extended_thinking/extended_thinking.ipynb

Enabling Thinking

This is the minimal shape: set thinking with a token budget and then parse the mixed response blocks. The structure matters because you'll often want to log the thinking internally even if you only show the final answer to a user.

from anthropic import Anthropic

client = Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000  # Max tokens for thinking
    },
    messages=[{
        "role": "user",
        "content": "Solve this step by step: If a train travels at 60 mph for 2.5 hours, then at 80 mph for 1.5 hours, what is the average speed for the entire journey?"
    }]
)

# Response contains both thinking and text blocks
for block in response.content:
    if block.type == "thinking":
        print("=== THINKING ===")
        print(block.thinking)
        print()
    elif block.type == "text":
        print("=== ANSWER ===")
        print(block.text)

When to Use Extended Thinking

I've found this table to be basically right, with one extra note: if your prompt already has a tight structure and the task is routine, thinking often doesn't help much.

Use Case	Benefit
Math problems	Step-by-step calculation
Logic puzzles	Systematic reasoning
Code analysis	Thorough examination
Complex decisions	Weighing multiple factors
Debugging	Methodical investigation

2. Thinking with Tool Use

Location: extended_thinking/extended_thinking_with_tool_use.ipynb

This is where Extended Thinking gets more interesting for me: it improves tool sequencing, especially on "I need to compute X, then convert units, then compute Y" style tasks.

One thing I'm not fully sure about (and it probably depends on your setup) is how much of the improvement comes from thinking vs. simply having tools available. But in practice, enabling thinking tends to reduce the weird "tool thrash" where the model calls tools in an unhelpful order.

tools = [
    {
        "name": "calculator",
        "description": "Perform mathematical calculations",
        "input_schema": {
            "type": "object",
            "properties": {
                "expression": {"type": "string", "description": "Math expression to evaluate"}
            },
            "required": ["expression"]
        }
    },
    {
        "name": "unit_converter",
        "description": "Convert between units",
        "input_schema": {
            "type": "object",
            "properties": {
                "value": {"type": "number"},
                "from_unit": {"type": "string"},
                "to_unit": {"type": "string"}
            },
            "required": ["value", "from_unit", "to_unit"]
        }
    }
]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 5000},
    tools=tools,
    messages=[{
        "role": "user",
        "content": "A car travels 150 kilometers in 2 hours. What is its speed in miles per hour?"
    }]
)

# Claude will think about:
# 1. What information is given
# 2. What needs to be calculated
# 3. Which tools to use and in what order
# Then execute the tools appropriately

Complex Multi-step Problems

When I'm deciding whether to spend a bigger thinking budget, I look for "nested" tasks like this—multiple steps, unit conversions, and money math at the end.

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=16000,
    thinking={"type": "enabled", "budget_tokens": 8000},
    tools=tools,
    messages=[{
        "role": "user",
        "content": """A cylindrical water tank has a radius of 3 meters and height of 5 meters.
        
1. Calculate the volume in cubic meters
2. Convert to gallons
3. If water costs $0.003 per gallon, how much to fill the tank?"""
    }]
)

3. Structured Problem Solving

Analysis Framework

This helper function is a pattern I keep reusing: give the model a framework and force it to walk through it. Even without Extended Thinking, structured prompts help; with thinking enabled, the model tends to adhere to the structure more consistently.

def analyze_with_thinking(problem: str, framework: str = "general") -> dict:
    frameworks = {
        "general": """Analyze this problem using:
1. Understanding: What is being asked?
2. Given: What information do we have?
3. Approach: What method should we use?
4. Solution: Step-by-step work
5. Verification: Check the answer""",
        
        "debugging": """Debug this issue using:
1. Symptoms: What is the observed behavior?
2. Expected: What should happen?
3. Hypotheses: Possible causes
4. Investigation: Check each hypothesis
5. Root Cause: Identify the actual issue
6. Fix: Proposed solution""",
        
        "decision": """Analyze this decision using:
1. Options: What are the choices?
2. Criteria: What factors matter?
3. Analysis: Evaluate each option
4. Tradeoffs: Pros and cons
5. Recommendation: Best choice and why"""
    }
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=16000,
        thinking={"type": "enabled", "budget_tokens": 10000},
        messages=[{
            "role": "user",
            "content": f"{frameworks[framework]}\n\nProblem:\n{problem}"
        }]
    )
    
    thinking = ""
    answer = ""
    for block in response.content:
        if block.type == "thinking":
            thinking = block.thinking
        elif block.type == "text":
            answer = block.text
    
    return {"thinking": thinking, "answer": answer}

Code Review with Thinking

I like this pattern, but I'm cautious about it: it's easy to get false confidence from a "thorough-sounding" review. I treat it as a starting point, not a substitute for tests or careful reading.

def review_code(code: str) -> dict:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=16000,
        thinking={"type": "enabled", "budget_tokens": 8000},
        messages=[{
            "role": "user",
            "content": f"""Review this code for:
1. Correctness - Does it work as intended?
2. Bugs - Any potential issues?
3. Performance - Can it be optimized?
4. Security - Any vulnerabilities?
5. Style - Does it follow best practices?

Code:
\`\`\`
{code}
\`\`\`

Provide specific line references and suggestions."""
        }]
    )
    
    result = {"thinking": "", "review": ""}
    for block in response.content:
        if block.type == "thinking":
            result["thinking"] = block.thinking
        elif block.type == "text":
            result["review"] = block.text
    
    return result

4. Thinking Budget Strategies

Thinking Budget Strategies

🔋

Low Budget

2,000 tokens

~$0.006 per request

Quick analysis and simple reasoning tasks

Characteristics

Fast response

Basic reasoning

Cost-effective

High throughput

Best For

Simple math problems

Basic classification

Quick explanations

Routine questions

⚡

Medium Budget

5,000 tokens

~$0.015 per request

Balanced reasoning for most complex tasks

Characteristics

Good balance

Solid reasoning

Reasonable cost

Versatile

Best For

Code analysis

Multi-step problems

Research synthesis

Decision making

🚀

High Budget

10,000 tokens

~$0.030 per request

Deep thinking for critical and complex scenarios

Characteristics

Deep reasoning

Thorough analysis

High accuracy

Comprehensive

Best For

Complex debugging

Strategic planning

Critical analysis

Research projects

💡 Pro Tip

Start with Medium budget for most tasks. Scale up for critical decisions, down for routine operations.

Adaptive Budgets

I've experimented with "auto budgets" like this and I still don't know if it's always worth the extra classifier call. But when costs matter, it can be a reasonable compromise: most prompts don't need 10k thinking tokens, and a small classifier pass can prevent waste.

def query_with_adaptive_thinking(question: str, complexity: str = "auto") -> str:
    budgets = {
        "simple": 2000,
        "medium": 5000,
        "complex": 10000,
        "very_complex": 15000
    }
    
    if complexity == "auto":
        # Estimate complexity based on question
        classifier = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=50,
            messages=[{
                "role": "user",
                "content": f"Rate the reasoning complexity (simple/medium/complex/very_complex): {question}"
            }]
        )
        complexity = classifier.content[0].text.strip().lower()
        if complexity not in budgets:
            complexity = "medium"
    
    budget = budgets.get(complexity, 5000)
    
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=16000,
        thinking={"type": "enabled", "budget_tokens": budget},
        messages=[{"role": "user", "content": question}]
    )
    
    return next(b.text for b in response.content if b.type == "text")

5. Thinking Visibility Control

When to Show Thinking

This is a subtle product decision. For internal tooling, I often want thinking for debugging. For end-user apps, I usually default to hiding it unless there's a clear value and a clear policy.

def query_with_thinking_control(question: str, show_thinking: bool = False) -> str | dict:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=16000,
        thinking={"type": "enabled", "budget_tokens": 5000},
        messages=[{"role": "user", "content": question}]
    )
    
    thinking = ""
    answer = ""
    for block in response.content:
        if block.type == "thinking":
            thinking = block.thinking
        elif block.type == "text":
            answer = block.text
    
    if show_thinking:
        return {"thinking": thinking, "answer": answer}
    return answer

Summary

Concept	Description
Basic Thinking	Enable reasoning transparency with `thinking` parameter
Budget Control	Allocate tokens for thinking process
Tool Integration	Improved tool selection with reasoning
Structured Analysis	Framework-based problem solving
Adaptive Budgets	Match complexity to thinking allocation

In the final post, I'll pull together a handful of "advanced" topics that feel like the real work of shipping: tuning, document generation, observability, tool evaluation, and even prompt patterns for UI output.