The Economics of Parallel Coding Agents

I watched a terminal tell me it had written a complete reservations module — entity, migration, DTOs, mapper, service, controller, ten stories, sixty-nine green Jest suites — in seven minutes and forty-five seconds.

Then I checked the bill. Four cents.

Not a toy. Not a stubbed-out happy path. A real vertical feature on a real NestJS backend, following the repo's own conventions, with the no-overlap and capacity rules enforced and the full existing suite still passing. DeepSeek charged me less than a US nickel and went back to idling.

I had sat down to write an architecture post — baro versus Claude Code, parallel agents versus one smart session. I got up writing an economics post. The fastest run of the day was also, by two orders of magnitude, the cheapest. And that stopped being a benchmark result and became a question: if bounded code costs four cents, what exactly am I paying frontier prices for?

The timing was not an accident. On May 28, 2026, Anthropic announced dynamic workflows in Claude Code: orchestration that fans a task out across many parallel subagents inside one session, exposed through an effort setting called ultracode that raises effort to xhigh and lets Claude decide when a workflow is worth it. Suddenly every timeline was screenshots of Claude splitting a task, running it in parallel, and coming back with a PR. It is a good direction. It is also the one we had been testing for a while with Mozaik and baro.

baro's bet has always been a little different from "make one agent smarter": turn a coding task into a DAG — plan once, split the work into bounded stories, run them in parallel, then reconcile. Mozaik is the event bus and participant model underneath; baro is the coding product on top. So when Claude Code moved into the same territory, I wanted a real comparison. Not a toy todo app. Not a screenshot race. The same repo, the same starting commit, the same acceptance bar, real pull requests at the end.

fastest execution
7m45s
DeepSeek V4 Flash on the same 10-story DAG
cheapest successful run
$0.04
DeepSeek V4 Flash execution cost
acceptance bar
932
Jest tests green after the Flash run

The task

The repo was the same private NestJS + TypeORM service I used in the 808-test run: shops, tables, menus, promotions, table sessions, role-based access, the kind of backend where conventions matter more than cleverness. The baseline was clean: 64 Jest suites, 808 tests, all green.

The task was a vertical feature: add a complete reservations module. Entity, migration, DTOs, mapper, service, controller, module wiring, and unit specs. The important rules were not cosmetic: no overlapping active reservations for the same table, party size capped by table capacity, time windows validated, status transitions enforced, soft delete, and the same NestJS guard/role conventions as the existing tables module.

I used the same starting commit and the same acceptance bar for every run: build must pass, the full Jest suite must pass, and the implementation has to follow the existing codebase shape rather than inventing a parallel universe.

Is parallel enough?

The first uncomfortable result was that Claude Code was better than I wanted it to be.

The pure-Claude baseline finished in 34 minutes. One Opus session, one workflow, one PR, full suite green. That matters. A single warm Claude Code session has advantages that are easy to underestimate when you are thinking in distributed-system diagrams: it keeps context warm, it avoids subprocess cold starts, and it spends the subscription budget through one session shape Anthropic actually optimizes for.

When I forced baro to use Claude Opus at max effort for every phase, it was slower: about 53 minutes. It also produced the most tests, which is not nothing. But it was not the practical winner. It was expensive in the resource that mattered: Claude session budget.

baro pure-Claude run completing the reservations module in 38:32 execution time
baro pure-Claude · 10 stories · ~38m35s executionopen full replay ↗

That is the first real lesson: parallelizing expensive Claude sessions multiplies expensive Claude sessions. If every story is a fresh claude --print process at max effort, the architecture is parallel, but the economics are bad.

What if I split the bill?

The hybrid run was the first time the shape clicked. Claude did the work I actually wanted Claude for: Architect and Planner. It read the repo, made the design calls, decomposed the feature into ten stories, and wrote the authoritative spec the workers would follow.

Then the stories ran on Codex. In my local Codex config that meant gpt-5.5 with model_reasoning_effort = "high". Not max. Not Claude. Just a capable execution backend pointed at a precise DAG.

baro hybrid run completing ten reservation stories in 11:49 execution time
baro hybrid · Claude plan + Codex stories · 11m49s executionopen full replay ↗

That finished the execution phase in 11 minutes and 49 seconds, opened a green PR, and did it without touching the Claude session bucket for story work. The total wall time was about 26 minutes including planning. At that point I thought the post was going to be about Claude for planning, Codex for execution.

Then I pointed the same DAG at DeepSeek.

Four cents

baro 0.47 added an OpenAI-compatible backend path. That means the same StoryAgent loop can talk to anything exposing a Chat Completions-compatible endpoint. DeepSeek was the obvious stress test: cheap, fast, and strict enough that bad tool-call formatting tends to surface quickly.

I reused the same Claude-authored DAG and ran the ten stories on deepseek-v4-flash.

baro DeepSeek V4 Flash run completing ten reservation stories in 7:45
baro + DeepSeek V4 Flash · 10/10 stories · 7m45s · 1.8× speedupopen full replay ↗

It finished in 7 minutes and 45 seconds. Ten out of ten stories passed. Zero retries. Build green. Full Jest suite green: 69 suites, 932 tests. The DeepSeek dashboard showed four cents.

The total is the headline, but the shape is the lesson — press play on the run above and you can watch it fire. Level 0 lays the foundation: the status enum (0:45) and the TypeORM migration (0:48). Levels 1 and 2 fan out — entities, DTOs, the response mapper, the service — four to seven agents writing at once. Level 3 is the bottleneck: the service unit spec (2:00) and the controller (2:13) cannot start until the things they test exist. That is why parallel execution saved five minutes against sequential — a 1.8× speedup — and also why it cannot save more. A DAG has a critical path, and you only go as fast as your deepest dependency chain.

DeepSeek usage tooltip showing $0.04 for v4-flash and $0.11 for v4-pro
DeepSeek monthly usage page showing $0.16 total API spend

I ran deepseek-v4-pro next on the same DAG. It took 12 minutes and 12 seconds, also green, at eleven cents. Slower than Flash, cleaner than Flash in the PR hygiene, and still absurdly cheap for a full vertical backend feature.

baro DeepSeek V4 Pro run completing ten reservation stories in 12:12
baro + DeepSeek V4 Pro · 10 stories · 12m12s executionopen full replay ↗

Here are all five reservations runs side by side:

Claude Code ultracode #14
Opus 4.8 max, one Claude Code workflow
Execution
included
Total
34m
Cost
Claude subscription
Build + test
70 suites / 923 tests
Diff
24 files, +2,306
baro hybrid #15
Claude Opus max plan + Codex gpt-5.5 high stories
Execution
11m49s
Total
~26m
Cost
ChatGPT subscription
Build + test
69 suites / 905 tests
Diff
25 files, +1,997
baro pure-Claude #16
Claude Opus max everywhere
Execution
~38m35s
Total
~53m
Cost
Claude session-limit heavy
Build + test
70 suites / 934 tests
Diff
27 files, +2,266
baro + DeepSeek V4 Flash #17
same Claude DAG + deepseek-v4-flash stories
Execution
7m45s
Total
~22m projected
Cost
$0.04
Build + test
69 suites / 932 tests
Diff
25 feature files, +2,272 / -2
baro + DeepSeek V4 Pro #18
same Claude DAG + deepseek-v4-pro stories
Execution
12m12s
Total
~27m projected
Cost
$0.11
Build + test
69 suites / 926 tests
Diff
25 feature files, +2,303
The DeepSeek rows reuse the same Claude-generated DAG from the baro hybrid run. That holds planning constant and changes only the story execution backend. The projected total adds the roughly 14 minutes of Claude planning from the original run to the resumed execution time.

Reservations is still a bounded vertical feature, though. Before I let four cents change how I think, I wanted to watch it survive a much bigger, messier task.

Does it hold at scale?

The reservations benchmark was useful, but it was still a bounded vertical feature. I wanted to know what happened when the DAG got wide enough that orchestration failures became part of the result. So I ran a second task on the same MenuService repo: build an Orders domain with order creation, item mutation, payment recording and refunds, kitchen tickets, status propagation, totals, migrations, controllers, and tests.

This was a 24-story plan. Claude Code ultracode ran it as one Claude Code workflow. baro ran the same kind of plan with Codex, DeepSeek V4 Pro, and DeepSeek V4 Flash as execution backends. It was large enough to expose the failure modes the smaller benchmark hid: blocked dependencies, failed story recovery, and the difference between a green test suite and a production-ready domain model.

larger stress test
24 stories
Orders, order items, payments, and kitchen tickets
fastest large run
33m54s
DeepSeek V4 Flash execution, full suite green
large-run cost
$0.30
25.6M input tokens and 338k output tokens
Claude Code ultracode #19
Opus 4.8 max, one Claude Code workflow
Time
62m
Cost
55% Claude session
Build + test
90 suites / 1,242 tests
Diff
81 files, +10,105
Review notegreen; broadest test count, still needed domain review
baro hybrid Codex #20
Claude plan + Codex gpt-5.5 high stories
Time
45m45s
Cost
11% Codex 5h sub
Build + test
83 suites / 1,008 tests
Diff
73 files, +6,396
Review notegreen; leanest diff, weaker domain invariants
baro + DeepSeek V4 Pro #21
same baro DAG + deepseek-v4-pro stories
Time
41m31s
Cost
~$0.62 marginal
Build + test
83 suites / 1,207 tests
Diff
79 files, +10,310
Review notegreen; many tests, same class of domain gaps
baro + DeepSeek V4 Flash #24
same baro DAG + deepseek-v4-flash stories
Time
33m54s
Cost
$0.30
Build + test
83 suites / 1,167 tests
Diff
75 files, +10,344
Review notegreen; fastest/cheapest, not production-ready without review

The most interesting row is the last one. DeepSeek V4 Flash completed the 24-story Orders execution in 33 minutes and 54 seconds, with the full suite green, for thirty cents.

That number is not a merge recommendation. I reviewed the generated Orders PRs, and all of the large runs still needed human domain review. The recurring issues were exactly the kind you expect in a real backend: cross-shop protections, terminal-state immutability, kitchen-ticket invariants, and payment/refund edge cases. But that is also why the result matters. Thirty cents did not produce a perfect production patch. It produced a large, compiling, tested backend implementation that was good enough to review.

The four-cent question

Four cents did not buy me a toy completion. It bought a green PR for a real NestJS feature: entity, migration, DTOs, mapper, controller, service logic, tests, no-overlap rules, capacity checks, lifecycle transitions, and the full existing Jest suite still passing. The Orders task made the same point with a noisier, more realistic workload, for thirty cents.

I do not know DeepSeek's true marginal inference cost. Public API pricing is not a physics constant. It includes strategy, subsidies, capacity allocation, competition, and whatever a provider wants the market to believe this work is worth. But I do know the price I was charged, and I know what came out of it. That is enough to change how I think about agent architecture.

Claude is expensive because it earns a premium in places where ambiguity is high: reading a messy repo, deciding where the seams belong, choosing a plan, catching subtle mistakes, and recovering when the task goes sideways. That does not mean every migration, DTO, mapper, and controller test should pay the same premium.

The expensive mistake is not using Claude. The expensive mistake is using Claude everywhere.

That sentence changes the benchmark. The fastest successful run was not the biggest model. It was not the largest context window. It was not the purest multi-agent architecture. It was a premium planning model producing a precise DAG, followed by a cheap execution model doing bounded work in parallel.

Before running this, I was still half-thinking about agent systems as model competitions. Claude vs Codex. Codex vs DeepSeek. Pure-Claude vs hybrid. That framing is too small. The useful unit is not the model. The useful unit is the phase. Planning is not execution. Reviewing is not migration-writing. Controller wiring is not status-transition design. A coding DAG gives you boundaries, and once you have boundaries, you can price them differently.

That is the economics of parallel coding agents: parallelism is only half the story. If you parallelize work across the same expensive session budget, you can make the budget disappear faster. If you parallelize work into well-scoped stories and route each phase to the backend that makes sense, the cost curve changes.

What I'd actually run

If you remember one thing

  • Plan with Claude (Architect + Planner). Pay the premium where ambiguity is highest.
  • Execute on the cheapest backend that clears the acceptance bar. For bounded stories that is usually not a frontier model.
  • Escalate only the risky stories — or reruns of failed ones — to a stronger model.

Concretely, for a real run today I would not start with pure-Claude max. I would start with something like this:

OPENAI_API_KEY="$DEEPSEEK_API_KEY" \
baro --llm hybrid \
  --story-llm openai \
  --openai-base-url https://api.deepseek.com \
  --story-model deepseek-v4-flash \
  "Add a complete reservations module"

For riskier stories I would route only those to Pro, or rerun failed stories on Pro. For architecture-heavy planning I still want Claude. For broad mechanical execution, I no longer think the premium model should be the default.

What this says about baro

The honest result is not "baro beats Claude Code." Claude Code was the better pure-Claude workflow in this test. It deserves that credit.

The stronger result is that baro lets the unit of routing be smaller than the whole coding session. It can spend Claude on the part where Claude earns it, then spend Codex or DeepSeek on the parts where they are good enough, faster, or dramatically cheaper.

The best architecture was not the one that used the strongest model everywhere. It was the one that let me choose.

That is the post I did not expect to write. I thought I was benchmarking parallel coding agents. I ended up benchmarking the price of not having a routing layer.

baro is on npm and GitHub. The event bus underneath it is Mozaik.

If this kind of agent architecture is interesting to you, join us in the JigJoy Discord. We talk a lot about baro, Mozaik, event-based reactive agents, model routing, and what actually happens when you try to run this stuff on real codebases.

You can also follow or DM me on X / Twitter if you want to compare notes, argue with the benchmark, or talk about parallel coding agents.

Miodrag Todorović

Miodrag Todorović

Co-founder @ JigJoy

My passion is to tell the world the stories about the beautiful stuff we have built.

Different is better than better.
Unnamed LLM spat this quote out

Made with baro, several confused benchmark runs, and less than one dollar of DeepSeek.