Part
3
  |  
The API Layer
  |  
Chapter
9

Multi-Turn Conversations

The API doesn't remember you. Every request is a blank slate — and that's not a limitation, it's the most important architectural decision you'll make.
Reading Time
11
mins
BACK TO CLAUDE MASTERCLASS

The trap is assuming the API maintains state. Developers come from the chat interface, where you type a message and Claude remembers everything you've said. You carry that mental model into the API and expect the same behavior. It doesn't work that way. Every API call is independent. Claude receives a message, generates a response, and immediately forgets the entire exchange. There is no session. There is no memory. There is no hidden thread linking your requests together.

This is not a bug. It's a design decision that gives you complete control — and complete responsibility — over what Claude knows at any given moment.

The API doesn't forget your conversation. It never knew your conversation existed. You are the memory.

I've seen teams build chat products on the Claude API and wonder why the assistant "forgets" what the user said two messages ago. The answer is always the same: they're sending only the latest message. Claude processes that message in isolation, generates a coherent response to that single input, and the team interprets the lack of continuity as a model failure. It's an integration failure. The model is doing exactly what it was asked to do — respond to the only message it received.

Understanding statelessness is the prerequisite for building anything conversational. Once you internalize it, everything else in this chapter — message arrays, conversation history, context management — becomes obvious engineering rather than mysterious API behavior.

The good news: once you understand what's happening, the implementation is straightforward. The API gives you a clean primitive — a message list — and you build everything on top of it. No proprietary session management. No hidden state you can't inspect. Just data structures you own and control completely.

How Multi-Turn Actually Works

Multi-turn conversations in the Claude API work through a simple mechanism: you send the entire conversation history with every request. Not a summary. Not a session ID. The actual messages, in order, from the beginning.

Here's what a three-turn conversation looks like from the API's perspective:

import anthropic

client = anthropic.Anthropic()

# Turn 1: Just one message
response_1 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "I'm building a REST API in Flask. What's the cleanest way to handle authentication?"}
    ]
)

# Turn 2: Include Turn 1's exchange + new message
response_2 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "I'm building a REST API in Flask. What's the cleanest way to handle authentication?"},
        {"role": "assistant", "content": response_1.content[0].text},
        {"role": "user", "content": "I like the JWT approach. Show me the middleware implementation."}
    ]
)

# Turn 3: Include Turns 1+2 + new message
response_3 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "I'm building a REST API in Flask. What's the cleanest way to handle authentication?"},
        {"role": "assistant", "content": response_1.content[0].text},
        {"role": "user", "content": "I like the JWT approach. Show me the middleware implementation."},
        {"role": "assistant", "content": response_2.content[0].text},
        {"role": "user", "content": "How do I handle token refresh without logging the user out?"}
    ]
)

Each request is self-contained. Turn 3 contains all six messages — three from the user, two from Claude, plus the new question. Claude reads the entire history, understands the context (Flask, JWT, middleware already discussed), and generates a response that builds on everything that came before.

The pattern is: user, assistant, user, assistant, user. Messages alternate between roles, always ending with a user message. The API enforces this structure.

Framework · The Conversation Ledger · CL

Your application is the accountant. Every user message gets appended to the ledger. Every assistant response gets appended to the ledger. Every API call sends the full ledger. If the ledger is wrong, the conversation is wrong — Claude has no independent record to fall back on.

Building a Conversation Loop

The three-turn example above is explicit but impractical. In a real application, you maintain a list and append to it dynamically:

import anthropic

client = anthropic.Anthropic()

conversation_history = []

def chat(user_message: str) -> str:
    conversation_history.append({
        "role": "user",
        "content": user_message
    })

    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=1024,
        system="You are a senior software architect. Be concise. Use code examples when they clarify your point.",
        messages=conversation_history
    )

    assistant_message = response.content[0].text

    conversation_history.append({
        "role": "assistant",
        "content": assistant_message
    })

    return assistant_message

# Interactive loop
print("Chat with Claude (type 'exit' to quit)\n")
while True:
    user_input = input("You: ").strip()
    if user_input.lower() in ("exit", "quit"):
        break
    reply = chat(user_input)
    print(f"\nClaude: {reply}\n")

This is the skeleton of every conversational Claude application. The conversation_history list is the ledger. The chat function appends the user message, sends the full ledger, appends the response, and returns the text.

Notice the system prompt: it's set as a separate parameter, not inside the messages array. This is important. The system prompt exists outside the conversation history. It doesn't need to alternate with user/assistant messages, it doesn't grow with the conversation, and it applies consistently on every turn. Keep it stable. Change it only when the assistant's role genuinely needs to shift — which, in most applications, is never.

Also notice that the chat function always appends both sides of the exchange. If you forget to append Claude's response, the next turn's message array will have two consecutive user messages, and the API will reject it. The alternation rule — user, assistant, user, assistant — is enforced strictly.

Every call sends the entire history. On turn 1, you send one message. On turn 10, you send nineteen messages (ten user, nine assistant, plus the new user message triggers the tenth response). On turn 50, you send ninety-nine messages. The payload grows linearly with every exchange.

This is where most developers discover the first real constraint of multi-turn conversations: the context window.

The Context Window Is Not Infinite

Claude's context window is large — up to 200,000 tokens depending on the model. But "large" is not "infinite," and conversation history eats into that window from both directions: your messages consume input tokens, and Claude's responses consume both output tokens (when generated) and input tokens (when replayed in subsequent requests).

A practical conversation grows fast. If each exchange averages 300 tokens (150 user, 150 assistant), you hit 30,000 tokens — 15% of the window — in just 100 turns. That sounds like plenty until your system prompt is 2,000 tokens, the user pastes a 5,000-token document in turn 3, and Claude generates a 3,000-token code review in turn 4. Now you're at 10,000 tokens by turn 4, and the conversation hasn't even started in earnest.

Token math matters

You pay for input tokens on every request. If your conversation history is 20,000 tokens and you make ten more API calls, you're paying for 200,000+ input tokens just to maintain context — before Claude generates a single word of output. This is why conversation management is both a technical and a financial concern.

There are three strategies I've seen work in production for managing growing conversation history:

Truncation — Drop the oldest messages when the history exceeds a threshold. Simple, but you lose early context. The user says "remember I mentioned Flask earlier?" and Claude has no idea what they're talking about because those messages were pruned.

Summarization — Periodically summarize the conversation so far into a compact message and replace the history with that summary plus the recent messages. This preserves the gist while controlling token count. The tradeoff is that summaries are lossy — specific details, code snippets, and exact numbers get compressed or dropped.

Sliding window with pinned messages — Keep the system prompt and a few "pinned" early messages (the user's initial request, any critical context) plus the most recent N turns. This balances recency with foundational context. I use this pattern most often because it's predictable and easy to reason about.

The right strategy depends on your application. A customer support bot needs early context (the user's initial complaint) more than middle messages, so sliding window with pinning is ideal. A creative writing assistant benefits from summarization because the narrative arc matters but exact wording doesn't. A technical debugging session is the hardest case — you might need the exact error message from turn 3, the code snippet from turn 7, and the latest stack trace — where truncation would lose critical information.

Here's my sliding window implementation:

def manage_history(history: list, max_messages: int = 20) -> list:
    """Keep the first exchange (context) and the most recent messages."""
    if len(history) <= max_messages:
        return history
    # Pin the first user+assistant exchange
    pinned = history[:2]
    # Keep the most recent messages
    recent = history[-(max_messages - 2):]
    return pinned + recent
Key takeaway

Statelessness is not a limitation you work around — it's a property you design for. The conversation history is your application's data structure. You control what goes in, what gets pruned, and what Claude sees on every turn. That control is the entire point.

Stateless vs. Stateful: A Demonstration

The difference between stateless and stateful behavior is the difference between a conversation and a sequence of unrelated questions. Here's a concrete demonstration:

import anthropic

client = anthropic.Anthropic()

# --- Stateless: each call knows nothing about the previous one ---
print("=== STATELESS ===")

r1 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=256,
    messages=[{"role": "user", "content": "My name is Hesham and I build automation systems."}]
)
print(f"Turn 1: {r1.content[0].text}\n")

r2 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=256,
    messages=[{"role": "user", "content": "What's my name and what do I do?"}]
)
print(f"Turn 2: {r2.content[0].text}\n")
# Claude will say it doesn't know — because it genuinely doesn't.

# --- Stateful: full history carried forward ---
print("=== STATEFUL ===")

history = []
history.append({"role": "user", "content": "My name is Hesham and I build automation systems."})

r3 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=256,
    messages=history
)
history.append({"role": "assistant", "content": r3.content[0].text})
print(f"Turn 1: {r3.content[0].text}\n")

history.append({"role": "user", "content": "What's my name and what do I do?"})

r4 = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=256,
    messages=history
)
print(f"Turn 2: {r4.content[0].text}\n")
# Claude knows: Hesham, automation systems.

Run this code and the difference is stark. In the stateless version, Claude politely says it has no information about you. In the stateful version, it recalls your name and profession because that information is right there in the message history you sent.

This demonstration is worth running even if the concept seems obvious. Developers who understand statelessness intellectually still build applications that assume state, because the chat interface trained them to expect memory by default. Seeing the two side by side makes the engineering requirement concrete: you are the memory layer.

If Claude seems forgetful, the first question is not "what's wrong with the model?" — it's "what am I actually sending in the messages array?"

The System Prompt in Multi-Turn Context

A subtle point about the system prompt deserves its own section. The system prompt is not part of the messages array. It's a top-level parameter that the API injects before the conversation history on every request. This means the system prompt doesn't grow, doesn't need alternation, and doesn't compete with the conversation for token budget (though it does consume tokens — just predictably).

The practical implication: use the system prompt for instructions that apply to every turn. Use the conversation history for context that evolves. I've seen teams stuff turn-specific instructions into the system prompt, changing it on every request. This is technically valid but architecturally messy. If Claude's behavior needs to change mid-conversation, that change should be a user message ("From now on, respond in bullet points") rather than a system prompt swap — because the user message becomes part of the history and the context makes sense across turns. A system prompt change is invisible to the conversation; Claude sees the new instructions but has no record of the old ones.

Production Patterns

Real applications need more than a growing list and a while loop. Here are the patterns I've seen work reliably in production:

Persist the history externally. The in-memory list dies when the process restarts. For any user-facing application, store the conversation history in a database — Postgres, Redis, or even a JSON file — keyed by session ID. Reload it when the user returns.

Separate the system prompt from the history. The system prompt is not a message in the history array. It's a separate parameter. This means it doesn't count toward the alternating user/assistant pattern, and it applies consistently without being replayed as part of the conversation. Keep it in your configuration, not in the message list.

Validate the message structure before sending. The API requires strict alternation: user, assistant, user, assistant. Two consecutive user messages or two consecutive assistant messages will cause an error. If your application allows message editing or deletion, re-validate the structure before every API call.

Log token usage on every call. The response.usage object tells you exactly how many tokens each request consumed. Log it. Graph it over time. You'll spot conversations that are growing too large, prompts that are more expensive than expected, and patterns that need optimization.

Implement a conversation reset mechanism. Every long-running conversational application needs a way for the user (or the system) to start fresh. A simple "reset" command that clears the history and starts a new session prevents the inevitable context window overflow. In my projects, I add an automatic reset warning when the history exceeds 80% of the context window — before the API call fails, not after.

✕ Naive multi-turn
  • Grows without limit
  • No persistence across sessions
  • No token tracking
  • Crashes on malformed history
✓ Production multi-turn
  • Sliding window or summarization
  • Database-backed history
  • Token usage logged per call
  • History validated before send

One more production pattern worth calling out: separate conversation threads. Most applications aren't a single endless conversation. They have distinct contexts — a support ticket, a document review, a code debugging session. Each context should have its own conversation history. I've seen applications that dump all user interactions into a single thread, which means Claude is asked to review Python code with the context of an unrelated marketing discussion from an hour ago still in the history. The solution is simple: scope conversation histories to a context identifier (ticket ID, document ID, task type) and load only the relevant history for each API call.

Monday-Morning Moves

Build the conversation loop from this chapter

Copy the chat() function pattern. Run it in your terminal. Have a real multi-turn conversation. Watch how Claude maintains context across turns — and understand that you are providing that context, not Claude.

Run the stateless vs. stateful demo

Execute the side-by-side comparison code. See the difference with your own eyes. This demonstration is worth more than any explanation because it makes statelessness visceral, not abstract.

Implement a history management strategy

Choose one: truncation, summarization, or sliding window with pinned messages. Implement it before your conversation history reaches 50 turns in development. The first time your API call fails with a context length error in production, you'll wish you had done this on day one.

Add token logging to every API call

Pull response.usage.input_tokens and response.usage.output_tokens from every response. Print them during development. Log them in production. This data tells you when your conversations are getting expensive and where to optimize.

Persist your history somewhere durable

Pick a storage backend — even a SQLite database or JSON file is better than in-memory only. Key it by session or user ID. Test that your application can resume a conversation after a restart.

You are the memory layer. The API provides the reasoning. If the reasoning seems off, check the memory you're feeding it before you blame the model.