Introduction
For the last post in this series, I wanted to collect the things that feel "production-shaped" but don't fit neatly into a single pattern: customization, document generation, cost tracking, evaluation, and the oddly important craft of getting the model to output UI that doesn't look generic.
Some of these topics are optional (you can ship a lot without fine-tuning), but I've found that observability and evaluation stop being optional the moment anything important depends on the agent.
1. Fine-tuning on Amazon Bedrock
Location: finetuning/finetuning_on_bedrock.ipynb
Fine-tune Claude 3 Haiku for specialized tasks.
I'm a bit cautious about fine-tuning as a first move. In my experience, teams often reach for it when they really need better data formatting, tighter constraints, or retrieval. But when you do need consistent style or domain behavior, a small fine-tune can be a clean solution.
Data Preparation
import json
def prepare_training_data(examples: list[dict]) -> str:
"""Convert examples to JSONL format for Bedrock."""
lines = []
for ex in examples:
record = {
"system": ex.get("system", ""),
"messages": [
{"role": "user", "content": ex["input"]},
{"role": "assistant", "content": ex["output"]}
]
}
lines.append(json.dumps(record))
return "\n".join(lines)
# Example: Customer support fine-tuning data
examples = [
{
"system": "You are a helpful customer support agent for TechCorp.",
"input": "My device won't turn on",
"output": "I understand your device isn't powering on. Let's troubleshoot: 1) Check if it's charged - connect to power for 15 minutes. 2) Try a hard reset - hold power button for 10 seconds. 3) Check the charging cable and adapter. Did any of these help?"
},
# ... more examples
]
training_data = prepare_training_data(examples)
Training Job
import boto3
bedrock = boto3.client("bedrock", region_name="us-west-2")
response = bedrock.create_model_customization_job(
jobName="techcorp-support-v1",
customModelName="techcorp-support-haiku",
roleArn="arn:aws:iam::...:role/BedrockFineTuningRole",
baseModelIdentifier="anthropic.claude-3-haiku-20240307-v1:0",
trainingDataConfig={
"s3Uri": "s3://my-bucket/training-data.jsonl"
},
outputDataConfig={
"s3Uri": "s3://my-bucket/output/"
},
hyperParameters={
"epochCount": "2",
"batchSize": "8",
"learningRate": "0.00001"
}
)
job_arn = response["jobArn"]
2. Skills API for Document Generation
Location: skills/
Claude's Skills API generates Excel, PowerPoint, and PDF files directly.
This is one of those features that feels like a gimmick until you need it. If your workflow produces reports, dashboards, or slide decks, generating the file directly can remove a whole layer of "LLM output → templating → rendering" glue.
Basic Document Generation
from anthropic import Anthropic
client = Anthropic()
# Generate an Excel file
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
skills=["excel_generation"],
messages=[{
"role": "user",
"content": """Create an Excel file with:
Sheet 1 "Sales":
- Columns: Date, Product, Units, Revenue
- 10 rows of sample sales data
Sheet 2 "Summary":
- Total revenue
- Average units per sale
- Best selling product"""
}]
)
# Response includes file data
for block in response.content:
if block.type == "file":
with open("sales_report.xlsx", "wb") as f:
f.write(block.data)
Financial Dashboard
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=8192,
skills=["excel_generation"],
messages=[{
"role": "user",
"content": """Create a financial dashboard Excel file:
1. "Input" sheet:
- Revenue, Expenses, Investments for Q1-Q4
- Use input cells highlighted in yellow
2. "Dashboard" sheet:
- Calculate: Net Income, Profit Margin, YoY Growth
- Create charts: Revenue trend, Expense breakdown pie chart
- Conditional formatting: Red if negative, Green if positive
3. "Projections" sheet:
- Project next year based on growth rates
- Include best/worst/expected scenarios"""
}]
)
3. Observability & Cost Management
Location: observability/usage_cost_api.ipynb
Monitor usage and costs programmatically.
If I had to pick one "advanced topic" that I think should be basic, it's this. It's hard to improve an agent system if you don't know where tokens are going or which workflows are exploding costs.
Usage API
from anthropic import Anthropic
client = Anthropic()
# Get usage statistics
usage = client.admin.usage.retrieve(
start_date="2025-01-01",
end_date="2025-01-31"
)
print(f"Total requests: {usage.total_requests}")
print(f"Total input tokens: {usage.total_input_tokens}")
print(f"Total output tokens: {usage.total_output_tokens}")
print(f"Estimated cost: ${usage.estimated_cost:.2f}")
# Breakdown by model
for model in usage.by_model:
print(f" {model.name}: {model.requests} requests, ${model.cost:.2f}")
Cost Tracking Wrapper
I like having a wrapper like this during development because it makes cost visible at the exact moment you introduce a change. Otherwise, costs show up later as a surprise in a dashboard.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class UsageTracker:
input_tokens: int = 0
output_tokens: int = 0
requests: int = 0
# Pricing per 1M tokens (example rates)
INPUT_COST_PER_M = 3.00 # Sonnet input
OUTPUT_COST_PER_M = 15.00 # Sonnet output
def record(self, response):
self.input_tokens += response.usage.input_tokens
self.output_tokens += response.usage.output_tokens
self.requests += 1
@property
def estimated_cost(self) -> float:
input_cost = (self.input_tokens / 1_000_000) * self.INPUT_COST_PER_M
output_cost = (self.output_tokens / 1_000_000) * self.OUTPUT_COST_PER_M
return input_cost + output_cost
def report(self) -> str:
return f"""Usage Report:
Requests: {self.requests}
Input tokens: {self.input_tokens:,}
Output tokens: {self.output_tokens:,}
Estimated cost: ${self.estimated_cost:.4f}"""
tracker = UsageTracker()
# Wrap API calls
response = client.messages.create(...)
tracker.record(response)
print(tracker.report())
4. Tool Evaluation Framework
Location: tool_evaluation/tool_evaluation.ipynb
Systematically evaluate agent tool use performance.
I like this section because it makes evaluation concrete. If you're using tools, you can test: did the agent pick the right tool, pass the right args, and return something that matches expectations?
A common mistake is only evaluating "final answer quality." For tool-based agents, the tool call sequence is often the real product behavior.
from dataclasses import dataclass
@dataclass
class ToolEvalCase:
query: str
expected_tools: list[str]
expected_args: dict[str, dict] # tool_name -> expected args
expected_result_contains: str
def evaluate_tool_use(agent, test_cases: list[ToolEvalCase]) -> dict:
results = {
"total": len(test_cases),
"passed": 0,
"failed": 0,
"details": []
}
for case in test_cases:
# Run the agent and capture tool calls
tool_calls = []
# ... agent execution with tool call logging ...
# Evaluate
tools_correct = set([tc["name"] for tc in tool_calls]) == set(case.expected_tools)
args_correct = all(
tool_calls_match_expected(tc, case.expected_args.get(tc["name"], {}))
for tc in tool_calls
)
result_correct = case.expected_result_contains in str(response)
passed = tools_correct and args_correct and result_correct
if passed:
results["passed"] += 1
else:
results["failed"] += 1
results["details"].append({
"query": case.query,
"passed": passed,
"tools_correct": tools_correct,
"args_correct": args_correct,
"result_correct": result_correct
})
results["accuracy"] = results["passed"] / results["total"]
return results
5. Frontend Aesthetics Prompting
Location: coding/prompting_for_frontend_aesthetics.ipynb
Guide Claude toward distinctive, polished UI designs.
This is the "soft" topic that turns out to be surprisingly practical. If you've ever asked a model for UI code and got the same generic SaaS card layout, you know why this matters. The model isn't wrong—it's just optimizing for the most common pattern it has seen.
I've had better luck when I'm explicit about constraints (palette size, typography pairing, spacing system) and when I ask for micro-interactions. Otherwise the output tends to look like a template.
Aesthetic System Prompts
aesthetics_prompt = """You are a senior frontend developer with a strong design background.
Design Principles:
1. **Distinctive**: Avoid generic Bootstrap/Material look. Create memorable interfaces.
2. **Cohesive**: Consistent color palette, typography, and spacing.
3. **Purposeful**: Every element should serve a function.
4. **Refined**: Attention to micro-interactions, transitions, and polish.
Style Guidelines:
- Use a curated color palette (provide 5-7 colors max)
- Typography: One serif + one sans-serif, with clear hierarchy
- Spacing: Consistent 4px/8px grid system
- Animations: Subtle, purposeful (150-300ms duration)
- Shadows: Soft, natural-looking
- Borders: Minimal, use spacing and color for separation
When generating code:
- Include CSS custom properties for theming
- Add hover/focus states
- Consider dark mode
- Make it responsive"""
Specific UI Components
def generate_ui_component(component_type: str, requirements: str) -> str:
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
system=aesthetics_prompt,
messages=[{
"role": "user",
"content": f"""Create a {component_type} component.
Requirements:
{requirements}
Style inspiration: Modern SaaS dashboard, clean and professional with a touch of personality.
Provide:
1. React/TypeScript component code
2. Tailwind CSS or CSS-in-JS styles
3. Example usage"""
}]
)
return response.content[0].text
Series Conclusion
Over 8 posts, we covered:
| Post | Topics |
|---|---|
| 1. Overview | Classification, Summarization, Text-to-SQL |
| 2. RAG | Retrieval, Contextual Embeddings, Reranking |
| 3. Tool Use | Function calling, Pydantic, Memory |
| 4. Multimodal | Vision, Charts, OCR, Crop Tool |
| 5. Agent Patterns | Chaining, Routing, Orchestration |
| 6. Agent SDK | Production agents, MCP integration |
| 7. Extended Thinking | Complex reasoning, Problem solving |
| 8. Advanced | Fine-tuning, Skills, Observability |
If I had to summarize what stuck with me: the best "agent" systems aren't the ones with the most autonomy—they're the ones with the most structure. Chaining, routing, eval loops, observability, and a little discipline around tools tend to beat vague "be smart" prompts.
The Claude Cookbooks repository continues to grow, and I suspect the most valuable additions over time will be the unglamorous ones: evaluation harnesses, debugging workflows, and practical production patterns.