NJI

Introduction

Advanced Topics Overview

🎯

Fine-tuning on Bedrock

Customize Claude 3 Haiku for specialized tasks and domain-specific behavior

Model customization

Domain adaptation

AWS Bedrock

Training jobs

📊

Skills API

Generate Excel, PowerPoint, and PDF files directly from Claude responses

Document generation

Excel creation

PowerPoint slides

PDF reports

📈

Observability & Cost

Monitor usage patterns, track costs, and optimize performance at scale

Usage monitoring

Cost tracking

Performance metrics

Analytics

✅

Tool Evaluation

Systematic testing framework for agent tool use and performance validation

Automated testing

Performance metrics

Quality assurance

Benchmarking

🎨

Frontend Aesthetics

Prompt engineering for distinctive, polished UI designs and code generation

UI generation

Design systems

Component creation

Styling guidance

🚀

Production Patterns

Best practices for deploying and scaling Claude applications in production

Deployment strategies

Error handling

Scaling patterns

Monitoring

🎓 Production-Ready Techniques

Advanced patterns for fine-tuning, monitoring, and scaling Claude applications in production environments

For the last post in this series, I wanted to collect the things that feel "production-shaped" but don't fit neatly into a single pattern: customization, document generation, cost tracking, evaluation, and the oddly important craft of getting the model to output UI that doesn't look generic.

Some of these topics are optional (you can ship a lot without fine-tuning), but I've found that observability and evaluation stop being optional the moment anything important depends on the agent.

1. Fine-tuning on Amazon Bedrock

Location: finetuning/finetuning_on_bedrock.ipynb

Fine-tune Claude 3 Haiku for specialized tasks.

I'm a bit cautious about fine-tuning as a first move. In my experience, teams often reach for it when they really need better data formatting, tighter constraints, or retrieval. But when you do need consistent style or domain behavior, a small fine-tune can be a clean solution.

Data Preparation

import json

def prepare_training_data(examples: list[dict]) -> str:
    """Convert examples to JSONL format for Bedrock."""
    lines = []
    for ex in examples:
        record = {
            "system": ex.get("system", ""),
            "messages": [
                {"role": "user", "content": ex["input"]},
                {"role": "assistant", "content": ex["output"]}
            ]
        }
        lines.append(json.dumps(record))
    return "\n".join(lines)

# Example: Customer support fine-tuning data
examples = [
    {
        "system": "You are a helpful customer support agent for TechCorp.",
        "input": "My device won't turn on",
        "output": "I understand your device isn't powering on. Let's troubleshoot: 1) Check if it's charged - connect to power for 15 minutes. 2) Try a hard reset - hold power button for 10 seconds. 3) Check the charging cable and adapter. Did any of these help?"
    },
    # ... more examples
]

training_data = prepare_training_data(examples)

Training Job

import boto3

bedrock = boto3.client("bedrock", region_name="us-west-2")

response = bedrock.create_model_customization_job(
    jobName="techcorp-support-v1",
    customModelName="techcorp-support-haiku",
    roleArn="arn:aws:iam::...:role/BedrockFineTuningRole",
    baseModelIdentifier="anthropic.claude-3-haiku-20240307-v1:0",
    trainingDataConfig={
        "s3Uri": "s3://my-bucket/training-data.jsonl"
    },
    outputDataConfig={
        "s3Uri": "s3://my-bucket/output/"
    },
    hyperParameters={
        "epochCount": "2",
        "batchSize": "8",
        "learningRate": "0.00001"
    }
)

job_arn = response["jobArn"]

2. Skills API for Document Generation

Location: skills/

Claude's Skills API generates Excel, PowerPoint, and PDF files directly.

This is one of those features that feels like a gimmick until you need it. If your workflow produces reports, dashboards, or slide decks, generating the file directly can remove a whole layer of "LLM output → templating → rendering" glue.

Basic Document Generation

from anthropic import Anthropic

client = Anthropic()

# Generate an Excel file
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    skills=["excel_generation"],
    messages=[{
        "role": "user",
        "content": """Create an Excel file with:
        
Sheet 1 "Sales": 
- Columns: Date, Product, Units, Revenue
- 10 rows of sample sales data

Sheet 2 "Summary":
- Total revenue
- Average units per sale
- Best selling product"""
    }]
)

# Response includes file data
for block in response.content:
    if block.type == "file":
        with open("sales_report.xlsx", "wb") as f:
            f.write(block.data)

Financial Dashboard

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=8192,
    skills=["excel_generation"],
    messages=[{
        "role": "user",
        "content": """Create a financial dashboard Excel file:

1. "Input" sheet: 
   - Revenue, Expenses, Investments for Q1-Q4
   - Use input cells highlighted in yellow

2. "Dashboard" sheet:
   - Calculate: Net Income, Profit Margin, YoY Growth
   - Create charts: Revenue trend, Expense breakdown pie chart
   - Conditional formatting: Red if negative, Green if positive

3. "Projections" sheet:
   - Project next year based on growth rates
   - Include best/worst/expected scenarios"""
    }]
)

3. Observability & Cost Management

Location: observability/usage_cost_api.ipynb

Monitor usage and costs programmatically.

If I had to pick one "advanced topic" that I think should be basic, it's this. It's hard to improve an agent system if you don't know where tokens are going or which workflows are exploding costs.

Usage API

from anthropic import Anthropic

client = Anthropic()

# Get usage statistics
usage = client.admin.usage.retrieve(
    start_date="2025-01-01",
    end_date="2025-01-31"
)

print(f"Total requests: {usage.total_requests}")
print(f"Total input tokens: {usage.total_input_tokens}")
print(f"Total output tokens: {usage.total_output_tokens}")
print(f"Estimated cost: ${usage.estimated_cost:.2f}")

# Breakdown by model
for model in usage.by_model:
    print(f"  {model.name}: {model.requests} requests, ${model.cost:.2f}")

Cost Tracking Wrapper

I like having a wrapper like this during development because it makes cost visible at the exact moment you introduce a change. Otherwise, costs show up later as a surprise in a dashboard.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class UsageTracker:
    input_tokens: int = 0
    output_tokens: int = 0
    requests: int = 0
    
    # Pricing per 1M tokens (example rates)
    INPUT_COST_PER_M = 3.00   # Sonnet input
    OUTPUT_COST_PER_M = 15.00  # Sonnet output
    
    def record(self, response):
        self.input_tokens += response.usage.input_tokens
        self.output_tokens += response.usage.output_tokens
        self.requests += 1
    
    @property
    def estimated_cost(self) -> float:
        input_cost = (self.input_tokens / 1_000_000) * self.INPUT_COST_PER_M
        output_cost = (self.output_tokens / 1_000_000) * self.OUTPUT_COST_PER_M
        return input_cost + output_cost
    
    def report(self) -> str:
        return f"""Usage Report:
Requests: {self.requests}
Input tokens: {self.input_tokens:,}
Output tokens: {self.output_tokens:,}
Estimated cost: ${self.estimated_cost:.4f}"""

tracker = UsageTracker()

# Wrap API calls
response = client.messages.create(...)
tracker.record(response)

print(tracker.report())

4. Tool Evaluation Framework

Location: tool_evaluation/tool_evaluation.ipynb

Systematically evaluate agent tool use performance.

I like this section because it makes evaluation concrete. If you're using tools, you can test: did the agent pick the right tool, pass the right args, and return something that matches expectations?

A common mistake is only evaluating "final answer quality." For tool-based agents, the tool call sequence is often the real product behavior.

from dataclasses import dataclass

@dataclass
class ToolEvalCase:
    query: str
    expected_tools: list[str]
    expected_args: dict[str, dict]  # tool_name -> expected args
    expected_result_contains: str

def evaluate_tool_use(agent, test_cases: list[ToolEvalCase]) -> dict:
    results = {
        "total": len(test_cases),
        "passed": 0,
        "failed": 0,
        "details": []
    }
    
    for case in test_cases:
        # Run the agent and capture tool calls
        tool_calls = []
        
        # ... agent execution with tool call logging ...
        
        # Evaluate
        tools_correct = set([tc["name"] for tc in tool_calls]) == set(case.expected_tools)
        args_correct = all(
            tool_calls_match_expected(tc, case.expected_args.get(tc["name"], {}))
            for tc in tool_calls
        )
        result_correct = case.expected_result_contains in str(response)
        
        passed = tools_correct and args_correct and result_correct
        
        if passed:
            results["passed"] += 1
        else:
            results["failed"] += 1
        
        results["details"].append({
            "query": case.query,
            "passed": passed,
            "tools_correct": tools_correct,
            "args_correct": args_correct,
            "result_correct": result_correct
        })
    
    results["accuracy"] = results["passed"] / results["total"]
    return results

5. Frontend Aesthetics Prompting

Location: coding/prompting_for_frontend_aesthetics.ipynb

Guide Claude toward distinctive, polished UI designs.

This is the "soft" topic that turns out to be surprisingly practical. If you've ever asked a model for UI code and got the same generic SaaS card layout, you know why this matters. The model isn't wrong—it's just optimizing for the most common pattern it has seen.

I've had better luck when I'm explicit about constraints (palette size, typography pairing, spacing system) and when I ask for micro-interactions. Otherwise the output tends to look like a template.

Aesthetic System Prompts

aesthetics_prompt = """You are a senior frontend developer with a strong design background.

Design Principles:
1. **Distinctive**: Avoid generic Bootstrap/Material look. Create memorable interfaces.
2. **Cohesive**: Consistent color palette, typography, and spacing.
3. **Purposeful**: Every element should serve a function.
4. **Refined**: Attention to micro-interactions, transitions, and polish.

Style Guidelines:
- Use a curated color palette (provide 5-7 colors max)
- Typography: One serif + one sans-serif, with clear hierarchy
- Spacing: Consistent 4px/8px grid system
- Animations: Subtle, purposeful (150-300ms duration)
- Shadows: Soft, natural-looking
- Borders: Minimal, use spacing and color for separation

When generating code:
- Include CSS custom properties for theming
- Add hover/focus states
- Consider dark mode
- Make it responsive"""

Specific UI Components

def generate_ui_component(component_type: str, requirements: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        system=aesthetics_prompt,
        messages=[{
            "role": "user",
            "content": f"""Create a {component_type} component.

Requirements:
{requirements}

Style inspiration: Modern SaaS dashboard, clean and professional with a touch of personality.

Provide:
1. React/TypeScript component code
2. Tailwind CSS or CSS-in-JS styles
3. Example usage"""
        }]
    )
    return response.content[0].text

Series Conclusion

Over 8 posts, we covered:

Post	Topics
1. Overview	Classification, Summarization, Text-to-SQL
2. RAG	Retrieval, Contextual Embeddings, Reranking
3. Tool Use	Function calling, Pydantic, Memory
4. Multimodal	Vision, Charts, OCR, Crop Tool
5. Agent Patterns	Chaining, Routing, Orchestration
6. Agent SDK	Production agents, MCP integration
7. Extended Thinking	Complex reasoning, Problem solving
8. Advanced	Fine-tuning, Skills, Observability

If I had to summarize what stuck with me: the best "agent" systems aren't the ones with the most autonomy—they're the ones with the most structure. Chaining, routing, eval loops, observability, and a little discipline around tools tend to beat vague "be smart" prompts.

The Claude Cookbooks repository continues to grow, and I suspect the most valuable additions over time will be the unglamorous ones: evaluation harnesses, debugging workflows, and practical production patterns.