The obvious cut is usually swapping GPT-4o for GPT-4o-mini on a single node and declaring victory. That might shave twenty percent off. It is not enough.
The reductions I aim for — sixty, seventy, eighty percent — come from refusing to make the call in the first place.
They come from caching, from batching, from conditional execution, and from what I call the cheap-first chain. These are not optimizations you bolt on later; they are design decisions you make before the workflow ever hits production. I have rebuilt enough workflows after bill shock to recognize the same four or five leaks every time. This is how I fix them.
Start with the cheapest model or API that could possibly do the job. Escalate to the expensive one only when the cheap one fails, returns a parse error, or hits a confidence threshold you define.
Most tutorials on LLM routing get this backwards. They show you the powerful model first, then mention in a footnote that a smaller one exists. In production, that means every single request burns the full rate. I do the opposite. I default to the small model for classification, extraction, routing, and yes/no decisions. Only the minority of requests that actually need long-form generation or complex reasoning see the large model.
On a support pipeline handling roughly a thousand tickets a day, classifying every ticket with GPT-4o costs about thirty dollars. Routing them through GPT-4o-mini for classification, then escalating only the two hundred tickets that need a custom reply to GPT-4o, drops the daily cost to six dollars and ten cents. That is an eighty percent reduction before you have done anything clever with caching or batching.
In n8n, I implement this with two AI nodes and an IF node. The first node uses a small model with a tightly constrained prompt and a max token limit of ten or twenty.
{
"resource": "chat",
"model": "gpt-4o-mini",
"messages": [
{
"role": "system",
"content": "Classify this ticket into billing, technical, feature_request, or spam. Reply with exactly one word."
},
{
"role": "user",
"content": "={{ $json.emailBody }}"
}
],
"temperature": 0,
"maxTokens": 10
}
If the classification indicates the ticket needs human-level reasoning — or if the small model returns a format I cannot parse — the workflow branches to the second node using the large model. If not, it routes to a template response and stops. The large model is the exception, not the rule.
The legitimate edge case is real: some workflows genuinely need the large model first. Complex contract analysis, nuanced sentiment detection, or multi-step reasoning usually require the heavy model from the start. But in my experience, those workflows are less than twenty percent of what teams actually built. The rest are overpaying for classification tasks dressed up as generation tasks.
Batching is the second place I look. Many APIs charge per request, not per item, and their rate limits force you into slow serial loops if you call them one by one. Batching fixes both problems. But only sometimes.
Default to batching any non-real-time API call that accepts bulk payloads. A geocoding job for five hundred addresses finishes in under two minutes batched instead of eight minutes serially, and it avoids the rate-limit retries that serial calling triggers. When the API has a native bulk endpoint, I use a Code node to chunk items into batches of fifty, then fire each batch with a controlled interval.
const items = $input.all();
const batchSize = 50;
const batches = [];
for (let i = 0; i < items.length; i += batchSize) {
const chunk = items.slice(i, i + batchSize);
batches.push({
json: {
addresses: chunk.map(item => item.json.address),
batchIndex: Math.floor(i / batchSize)
}
});
}
return batches;
Where batching fails is real-time pipelines and APIs without bulk support. If a webhook needs a synchronous response within five seconds, batching introduces latency you cannot afford. If the API has no batch endpoint, you are left with n8n's built-in item batching, which helps with rate limits but does not reduce the total request count. In those cases, I do not batch; I throttle with exponential backoff and controlled concurrency instead.
When you send fifty items and three fail, the API often returns a mixed success-and-error response. You need downstream logic to split the batch, retry the failures individually, and log them. Teams that skip that step discover weeks later that three percent of their data never made it through.
I handle this by inspecting the batch response in a Code node and routing failed items to a retry loop while letting successes pass through. If retrying is not an option, I log the failure and alert rather than silently dropping the record.
Before I call an expensive API, I ask whether the input has actually changed since the last run. If it has not, I skip the call entirely.
This sounds obvious, but most webhook-driven workflows ignore it. A CRM fires an event on every field change, including irrelevant metadata updates like last_viewed_at. If the workflow regenerates an AI product description for every event, it makes five thousand LLM calls a day when only fifty of them involved a meaningful change to the product name, category, features, or price.
I fix this with an input hash. In a Code node, I take only the fields that affect the downstream result, stringify them, and generate an MD5 hash. I compare that hash against the previous run's hash, stored in a database or Google Sheet. If they match, the workflow returns the cached result. If they differ, it proceeds to the expensive API and writes the new hash.
const crypto = require('crypto');
const product = $input.first().json;
const relevant = {
name: product.name,
category: product.category,
features: product.features,
price: product.price
};
const hash = crypto
.createHash('md5')
.update(JSON.stringify(relevant))
.digest('hex');
const previous = $input.first().json._previous_hash;
return [{
json: {
...product,
currentHash: hash,
needsRegeneration: hash !== previous
}
}];
Downstream, an IF node checks needsRegeneration. The false branch returns the previously generated description without touching the LLM. On a catalog with a thousand products and five thousand daily CRM events, this check reduces five thousand OpenAI calls to fifty. At three cents per call, that is one hundred fifty dollars a day saved by a single comparison.
If you include a timestamp, an auto-generated ID, or a last_modified field that changes on
every touch, your hash will never match and you will get zero benefit. Hash only the
semantic inputs that actually alter the output.
Conditional execution skips work when the data has not changed. Caching skips work when the data changes slowly. I use both.
The classic example is an exchange rate lookup. If a workflow converts prices from USD to EUR every time a product is viewed, it might call the rate API five hundred times a day. But exchange rates change daily, not per page view. I cache the response with a twenty-four-hour TTL in a Google Sheet or Postgres table, then check the cache before every call.
const cacheEntry = $input.first().json;
const ttlHours = 24;
if (cacheEntry?.rate && cacheEntry?.cached_at) {
const age = (Date.now() - new Date(cacheEntry.cached_at).getTime()) / (1000 * 60 * 60);
if (age < ttlHours) {
return [{
json: {
rate: cacheEntry.rate,
source: 'cache',
expires_in_hours: Math.round(ttlHours - age)
}
}];
}
}
return [{ json: { source: 'miss', needs_refresh: true } }];
A cache hit costs zero API calls. On a five-hundred-call-per-day workflow, this drops the API usage to one call per day. That is a 99.8 percent reduction on that single endpoint.
TTL discipline matters. A TTL that is too short wastes calls; a TTL that is too long serves stale data. I set the TTL to match the real-world volatility of the data: twenty-four hours for exchange rates, five minutes for stock prices, a week for company metadata. When in doubt, I start long and tighten based on observed data quality issues.
After a successful test run, pin the output data on the node so downstream edits don't trigger re-execution. Combined with swapping the Webhook trigger for a Manual Trigger during development, this eliminates the silent cost of building.
Idempotency is part of the same discipline. Webhook senders retry on timeout, and without deduplication keys, you process the same event twice and pay for both executions. I log every incoming webhook's idempotency key to a table before processing, and I return 200 OK immediately if the key already exists. The sender stops retrying, and I stop double-paying.
There is a class of API that looks harmless in isolation and becomes a hemorrhage in aggregate. The worst offenders are the ones priced at fractions of a penny.
I see this most often with reference data lookups: exchange rates, geocoding, enrichment APIs, and configuration fetches. A workflow processes a hundred orders, and because nobody enabled Execute Once on the config node, it fetches the same exchange rate a hundred times.
The Execute Once checkbox is right there in the node settings. One call instead of a hundred. At a thousand executions a day, that single checkbox saves a hundred dollars a month.
Token waste falls into the same category. LLM pricing is per token, and sending a thirty-page transcript to GPT-4o when only the first five pages and last two pages matter is like mailing a ream of paper when a postcard would do. I pre-process long inputs in a Code node, keeping the head and tail and summarizing the middle with a cheap model before sending the package to the expensive one.
const transcript = $input.first().json.transcript;
const maxChars = 12000;
if (transcript.length <= maxChars) {
return [{ json: { optimizedText: transcript, strategy: 'none' } }];
}
const head = transcript.substring(0, 4000);
const tail = transcript.substring(transcript.length - 3000);
const middle = transcript.substring(4000, transcript.length - 3000);
return [{
json: {
head,
middle,
tail,
strategy: 'head-tail-with-middle-summary',
estimatedTokensSaved: Math.floor(middle.length / 4)
}
}];
Then I send the middle section through GPT-4o-mini for compression, and feed the concatenated result to GPT-4o. On a call transcript pipeline, this cuts the per-transcript cost by roughly seventy-eight percent.
So what happens when you stack these patterns on a single workflow? The savings do not just add up; they multiply, because each layer removes calls that the previous layer would have processed.
Consider a pipeline that handles support tickets, enriches customer data, and generates draft replies. Before optimization:
| Cost Center | Before | Driver |
|---|---|---|
| Classification / routing | $30.00/day | 1,000 calls × $0.03 on GPT-4o |
| Data enrichment | $10.00/day | 1,000 calls × $0.01 |
| Exchange rate lookup | $1.00/day | 1,000 calls × $0.001 |
| Daily total | $41.00 |
After applying the cheap-first chain, conditional execution, caching, and Execute Once:
| Cost Center | After | Driver |
|---|---|---|
| Classification | $0.10/day | 1,000 calls × $0.0001 on GPT-4o-mini |
| Escalated replies | $6.00/day | 200 calls × $0.03 on GPT-4o |
| Data enrichment | $0.50/day | 50 calls × $0.01 after input-hash dedup |
| Exchange rate | $0.001/day | 1 call × $0.001 with 24h TTL |
| Daily total | $6.60 | 84% reduction |
That is an eighty-four percent reduction. Over a month, the difference between $1,230 and $198 pays for actual engineering time.
None of this works if you cannot see where the money is going. I add a lightweight logging node after every paid API call that records the timestamp, model, endpoint, input tokens, output tokens, and estimated cost.
const usage = $input.first().json.usage || {};
const pricing = {
'gpt-4o': { input: 0.0025, output: 0.01 },
'gpt-4o-mini': { input: 0.00015, output: 0.0006 }
};
const model = $input.first().json.model || 'unknown';
const p = pricing[model] || { input: 0, output: 0 };
const cost = ((usage.prompt_tokens || 0) / 1000) * p.input +
((usage.completion_tokens || 0) / 1000) * p.output;
return [{
json: {
timestamp: new Date().toISOString(),
workflow: $workflow.name,
node: $prevNode.name,
model,
input_tokens: usage.prompt_tokens,
output_tokens: usage.completion_tokens,
cost_usd: Math.round(cost * 10000) / 10000
}
}];
I write this to a Postgres table and run a weekly query grouping by workflow and model.
SELECT
workflow_name,
model,
COUNT(*) as call_count,
SUM(estimated_cost_usd) as total_cost
FROM api_usage_log
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY workflow_name, model
ORDER BY total_cost DESC;
Without this log, you are optimizing in the dark. With it, the worst offender is obvious and the payoff from fixing it is quantified before you start.
API cost optimization is not a one-time audit. It is a maintenance habit.
Pull last month's invoice. Name the three most expensive nodes.
Replace any classification, routing, or extraction step with the cheapest model that can handle it. Escalate only on failure.
Before every LLM call that runs on a schedule or webhook, add an input-hash check. If the data has not changed, skip it.
Exchange rates, geocoding, company info — cache them with a TTL that matches how often the data actually changes. Twenty-four hours for FX; a week for company metadata.
Swap active Webhook triggers for Manual Triggers, and pin test data on any node upstream of a paid API.
Any node that fetches shared configuration, tokens, or exchange rates gets the Execute Once checkbox.
Open the weekly cost query. Attack the top item.
The teams that keep their API bills sane are not the ones with the best vendor discounts. They are the ones that treat every call as a failure mode until proven otherwise.