Putting LLMs in Production Workflows: What Actually Works | The Workflow Engineer

Here's the trap I see teams fall into. You wire up an OpenAI node, test it on five sample records in the builder, and the output looks perfect. You switch the trigger to a schedule or webhook and walk away. Two weeks later you discover the workflow has been silently failing on edge cases, burning through tokens because there's no routing layer, and posting unmoderated hallucinations to your customers.

It didn't crash. It degraded in ways no default dashboard shows you. That's worse than a crash.

I don't rescue these workflows with better prompting. I fix them with what I call the AI Production Stack.

Framework · The AI Production Stack · router + parser + memory + escalation

Four layers that wrap around the model: the model router decides which LLM sees the input; the parser guarantees the output is machine-readable; memory keeps state across calls; escalation decides when the machine should stop and ask for help. Skip one and you're running a prototype with a schedule trigger.

Together they turn a fancy text generator into infrastructure you can sleep through.

Structured Output Is the Contract

The first layer I enforce is the parser. In a demo, you can read the model's response with your eyes, tweak the prompt, and move on. In production, the model will eventually wrap its JSON in markdown backticks, add a polite preamble like "Here is the result you requested," or hallucinate a field name. If the next node expects $json.urgency === "critical" to evaluate cleanly and instead gets a string wrapped in a code fence, your workflow doesn't fail gracefully. It fails downstream, in a different system, with a different owner, at 2 AM.

Key takeaway

Structured output is the contract between the AI and the rest of your workflow. Not a suggestion. A contract.

JSON mode is table stakes. Both OpenAI and Anthropic support constraining output to valid JSON. But JSON mode only promises syntactically valid JSON; it doesn't promise the right shape, the right keys, or the right types. You still need a schema, and you need validation against that schema.

My system prompts include the schema explicitly — field names, types, enums, and exactly what null means. For a support ticket classifier, the output isn't "a JSON object with some fields." It's a locked specification:

{
  "category": "billing" | "technical" | "account" | "general",
  "urgency": "low" | "medium" | "high" | "critical",
  "summary": "One sentence summary of the issue",
  "requires_human": true | false
}

But the model still drifts. Temperature changes, prompt injections in user data, or simply bad luck can produce malformed output. So I add a Structured Output Parser step. I often use a small, fast model like GPT-4.1 Mini for this job. The parser doesn't do creative reasoning; it enforces the contract. It takes the main model's messy output and rewrites it into compliant JSON, or it throws a validation error that routes the item to a manual review queue.

The downstream nodes never parse text. They consume fields directly. An IF node checks $json.requires_human without string splitting, without regex, without praying.

When I need a non-obvious format — like extracting calendar events where "noon" must become "12:00" and multi-day events need explicit start and end dates — I add few-shot examples to the system prompt. Two or three concrete examples embedded in the prompt typically drop format errors from roughly 15% to under 3%. That doesn't sound dramatic until you're processing ten thousand items and 15% means fifteen hundred broken executions to clean up manually.

Always wrap your final JSON.parse

Even with a parser layer, wrap your final JSON.parse in a try-catch. Production means planning for the day the model ignores every instruction you gave it.

Cheap-First-Then-Smart

The second layer is the model router. Not every input needs a large, expensive model.

Framework · Cheap-first-then-smart

A fast, cheap model classifies or triages the input. Only the subset that actually needs sophisticated reasoning gets routed to the expensive model. At scale this is a 70%+ cost reduction.

For a workflow handling roughly 1,000 customer questions per day, sending everything to Claude Sonnet costs around $15. Routing through GPT-4o-mini first drops that to about $4.50 — roughly a 70% reduction — because most questions are simple FAQs that a template or a mini model can handle.

The structure is straightforward:

[Webhook] → [Classify: gpt-4o-mini] → [IF: is_complex] →── true ──→ [Generate: Claude Sonnet]
                                              └─ false ─→ [Template Response]

The classification step returns a structured decision — faq, complex, or escalate — and the workflow branches on that signal. The classification prompt itself is a contract: it must return JSON with a type field and nothing else. If it returns "faq_topic": null for a complex question, the IF node routes it correctly.

This isn't premature optimization. At low volume, the savings don't matter. Above a few hundred calls per day, this pattern is the difference between an AI feature your finance team questions and one they ignore because the cost is negligible. I've seen teams burn through thousands of dollars on unnecessary Sonnet calls because they classified everything as "complex" by default.

In more advanced setups, I use a dual-model architecture where the primary model handles the creative work, a secondary model enforces structured output, and a fallback model catches failures. I built a student support agent that uses exactly this stack: Mistral Large drafts the response, GPT-4.1 Mini parses the decision schema, and Anthropic Claude waits as the fallback if Mistral fails or rate-limits. Each model does what it's good at.

Memory That Doesn't Evaporate

The third layer is memory. Out of the box, every LLM call is stateless. That's fine for classifying a single email or extracting entities from a form. It's useless for a conversation. If a user asks "how does that compare to last quarter?" and the workflow has no memory, the model either invents an answer or asks for context it already had. Both outcomes destroy trust.

I use three kinds of memory in production, depending on the risk profile and the cost constraints:

Session-scoped memory — the minimum viable layer. A Window Buffer Memory sub-node keeps the last N messages in a conversation thread. Default to 10 messages. The session key matters more than the window size: if you use a static ID, every user shares the same context and you leak data between unrelated conversations. Generate a random 48-character session key per thread, or bind it to something inherently unique like a Slack thread_ts.
Persistent memory — what you need when conversations span hours, days, or multiple workflow executions. Redis is my default. It survives n8n restarts, works across multiple workers in queue mode, and gives each thread an isolated memory context.
Hybrid memory — what I use when token costs bite. Older messages get compressed into a summary; only the last few are kept verbatim. For high-volume chat workflows, cap memory by tokens instead of messages, keeping as much history as fits in a fixed budget and dropping the oldest context first.

Memory is not free

Every prior message is re-sent as prompt tokens. If you don't bound it — by window, summary, or token cap — your per-interaction cost grows linearly with conversation length until the model hits its context limit and starts dropping its own instructions to make room. Bound aggressively. You can always raise the cap; explaining why the AI forgot how to follow instructions is harder.

Retry, Fallback, and the 429 Reality

The fourth layer is resilience. LLM APIs fail. They rate-limit. They return 429s. In a batch workflow processing 500 feedback items, hitting the rate limit isn't a risk — it's a certainty on item 87.

n8n has a built-in retry toggle, but here's the catch: it uses fixed intervals. A fixed 1,000-millisecond wait between retries will hammer the API at the exact same cadence and fail again if the limit window hasn't reset. I don't use fixed-interval retries for AI nodes.

Instead, I wrap AI calls in a Code node with exponential backoff and jitter:

let attempts = 0;
let success = false;

while (attempts < 5 && !success) {
  try {
    // API call via HTTP Request or helper
    success = true;
  } catch (error) {
    attempts++;
    if (error.message.includes('429') && attempts < 5) {
      const delay = Math.pow(2, attempts) * 1000 + Math.random() * 1000;
      await new Promise(resolve => setTimeout(resolve, delay));
    } else {
      throw error;
    }
  }
}

The jitter matters. If you have multiple workflow executions hitting the same API, synchronized retries create thundering herds. Randomizing the delay spreads the load.

But retry logic only buys time. If the API is down, your account is throttled, or the model is temporarily overloaded, retries alone will eventually fail the execution. That's why I pair retries with fallback models. If the primary model — say, Mistral Large — returns a 500 or times out after backoff, the workflow calls Anthropic Claude. The fallback doesn't need to be identical in capability; it needs to be good enough to prevent the execution from dying in front of the customer.

I also add self-throttling. A 200-500 millisecond Wait node between iterations in a loop keeps me under rate limits proactively. It's cheaper than retries and reduces latency variance. For a batch job, waiting half a second between items turns a 500-item run from a rate-limiting nightmare into a boring, predictable process.

Human-in-the-Loop: Where and When

Automation should not mean abdication.

For any output that reaches a customer, influences a financial decision, or carries legal weight, I insert a human gate. Not everywhere — that defeats the purpose. But at the point of publication, or at the point of high-stakes routing.

The pattern I use is a Wait node with a webhook callback. The workflow generates the draft, sends it to Slack with Approve and Reject buttons, and pauses. If the reviewer clicks Approve, the workflow publishes. If they reject, it captures feedback and can loop back for regeneration. I always set a 24-hour limit on the Wait node. If nobody reviews in time, the workflow routes to a fallback — a reminder, a skip, or a secondary reviewer — but it never waits indefinitely. An unfinished workflow is a resource leak.

There's a second, lighter place to insert humans: moderation before auto-publishing. For lower-stakes content like social posts, I run a moderation pass with a cheap model checking for hallucinated facts, off-brand tone, and invented links. I structure the moderation output as a structured decision with confidence scoring:

Above 0.85, no issues found → auto-publish.
0.7–0.85 → publish the suggested edit but flag for human review later.
Below 0.7 → stop and wait.

This creates a tiered trust system instead of a binary all-or-nothing gate. The expensive human only sees the edge cases. The cheap model handles the routine safety checks at machine speed.

Audit Trails for AI Decisions

If you can't reconstruct why the AI made a decision, you don't have a production system. You have a black box that occasionally emits money and complaints.

I log two things religiously: the input before processing, and the cost of every call.

Input logging means writing the raw message or request to a tracking sheet before it ever reaches the model. If the agent fails, the data isn't lost, and you have a record for debugging and compliance. In the student support agent, every incoming Udemy message hits Google Sheets first — sender, timestamp, content, thread ID, and conversation history. Only after that row exists does the AI agent see the text. If the workflow crashes five seconds later, the message is still captured.

Cost logging means appending a row after every LLM call with model name, token count, and estimated spend. I keep a live pricing table in a shared sub-workflow and call it after each AI node:

const pricing = {
  'gpt-4o': { input: 2.50, output: 10.00 },
  'gpt-4o-mini': { input: 0.15, output: 0.60 },
  'claude-sonnet-4-20250514': { input: 3.00, output: 15.00 },
};

const cost = (inputTokens * pricing[model].input + outputTokens * pricing[model].output) / 1_000_000;

That data lands in a running sheet:

Timestamp	Workflow	Model	Input Tokens	Output Tokens	Cost (USD)
2025-03-15T10:30:00Z	Support Classifier	gpt-4o-mini	450	85	$0.000119
2025-03-15T10:30:01Z	Support Classifier	claude-sonnet-4-20250514	1,200	650	$0.013350

I set alerts for any single execution that crosses $0.50. That's usually a sign of a runaway agent loop, an unchunked document being fed whole into the prompt, or a memory buffer that grew out of control. Without this telemetry, you find out about the cost problem when the invoice arrives, and by then you've already trained the team to ignore the AI feature because it's "too expensive."

Audit trails also mean logging the routing decision. When the structured parser outputs escalate_to_instructor: true, that boolean gets recorded alongside the reason, the confidence level, and the model that produced it. Six months later, when the business asks why 40% of tickets are escalating, you have the data to answer.

What to Do Monday Morning

You don't need to rebuild everything to make your AI workflows production-grade. You need to add the four layers around the model you already have.

Lock the contract with structured output

Pick the workflow closest to your customers and replace any freeform text parsing with a JSON schema and a validation step. Add a catch path for malformed output and test it with ten diverse inputs before you deploy.

Add cheap-first-then-smart routing

Identify one workflow making more than a hundred LLM calls per day and insert a classification step with a mini model. Route only the complex exceptions to your expensive model. Watch your daily API cost drop by half.

Bound your memory

If you have a conversational workflow, check whether the context window is growing unbounded. Cap it at ten messages, or switch to summary memory, and make sure the session key is actually unique per conversation.

Fix your retry logic

Replace fixed-interval retries with exponential backoff and jitter, and define a fallback model for any workflow where a failed execution would be visible to a customer.

Insert one human gate

Find the highest-stakes output your AI publishes and add a Wait node with a 24-hour deadline before it goes live. Set up the Slack buttons this week.

Create a logging sub-workflow

Write every AI input and every token count to a Google Sheet. Set a $0.50 per-execution alert.

Do that, and you're no longer running a demo on a schedule. You're running a system.