Extensions · · 7 min read

Letting Agents Work in Shifts

Or: why one tired agent is worse than five focused ones

Letting Agents Work in Shifts

There's a familiar rhythm to working with an AI coding assistant. You ask for a feature. It writes the code. You spot something off. You ask for a fix. It introduces a new thing that's off. You ask again. Three rounds in, you've spent more time steering than you would have spent just writing the thing yourself, and you're starting to wonder whether the productivity gain is mostly imaginary.

I've been there. Probably you have too.

The thing is, the agent isn't bad at coding. The agent is, on most days, perfectly capable. The problem is what we're asking it to do, which is to be the architect, the implementer, the tester, and the reviewer — all at once, all in the same breath, all without breaking stride. No human team works like that, and for good reason. A developer who writes code and reviews their own code in the same sitting catches fewer bugs than a fresh set of eyes would. We know this. We have entire engineering rituals — code review, pairing, QA — built on the simple observation that the person who made the thing is not the best person to find what's wrong with it.

LLMs are no different. They get tunnel vision too. They commit to an approach and then defend it. They miss what they're not specifically looking for.

Elyra Swarm leans into this. Instead of one agent doing everything, Swarm runs a small pipeline of focused agents in sequence. Each one has a narrow job, a clear name, and — crucially — constraints on what it's allowed to touch.

Install

elyra install npm:@elyracode/swarm

The build pipeline

This is the one you'll reach for most often. You describe what you want, and five agents take turns:

/swarm build add user registration with email verification

Here's the relay race.

Planner goes first, and it's read-only. It pokes around the codebase, maps the architecture, and produces a concrete plan: file paths, function signatures, migration steps, test strategy. Because it physically can't edit anything, it has to actually think — and write a plan clear enough for someone else to follow. The constraint is the feature.

Coder takes the plan and does the work. Edits files, creates new ones, writes the actual code. Follows the plan step by step. Notably, it doesn't write tests. That's the next agent's job, and pretending it isn't keeps the coder honest.

Tester reads what the coder built and writes tests against it. It follows the project's existing patterns, covers happy paths, edge cases, error scenarios. It runs the tests if it can. Because the tester didn't write the implementation, it tends to ask the kinds of questions a reviewer would — what happens if this is null? what if the email is already taken? what if the verification link is clicked twice?

Reviewer is read-only again. It looks at everything the coder and tester produced and checks correctness, security, edge cases, test quality. It sorts findings into critical, warning, and suggestion. It cannot "just fix it," which means it has to articulate the problem clearly enough for the next agent to act on. (You know that feeling when you're reviewing a PR and it would be faster to push a commit than write the comment? The reviewer doesn't have that option.)

Fixer comes last. It reads the review and addresses the findings — critical issues first, warnings where clearly beneficial, skipping the pure style nits. Runs the tests again afterward.

The result is a feature built with separation of concerns, tested by someone who didn't write it, reviewed by someone who didn't implement or test it, and fixed based on specific written feedback. Which is, you'll notice, roughly how a healthy engineering team operates.

The review pipeline

When you want a proper review of code that already exists:

/swarm review the payment processing module

Five stages. All read-only except the last.

The interesting bit is that each reviewer only looks for one kind of problem.

  1. Analyze maps the code structure, data flow, and entry points.

  2. Correctness hunts for logic errors, null handling, off-by-ones, missing error handling. Nothing else.

  3. Security looks for SQL injection, XSS, auth bypasses, data leaks. Nothing else.

  4. Tests looks at coverage gaps, assertion quality, missing edge cases. Nothing else.

  5. Synthesize pulls the three review reports together into a prioritized, actionable list.

The "nothing else" matters. A correctness reviewer who's also keeping half an eye on XSS won't go as deep on the null handling. A security reviewer who's also flagging missing tests won't read auth flows as carefully. Focus beats breadth, and this consistently surfaces more than a single "please review this code" prompt does.

The refactor pipeline

For when you want to change shape without changing behavior:

/swarm refactor the notification system

Four stages:

  1. Analyze (read-only) — understands the current structure, identifies coupling and pain points.

  2. Plan (read-only) — produces a refactoring plan that explicitly preserves behavior.

  3. Implement — executes the plan, updates tests for structural changes.

  4. Verify (read-only) — confirms behavior is preserved, no regressions, no dead code stranded in the wake.

The read-only stages do a lot of heavy lifting here. A refactoring plan written by an agent that can't impulsively start editing tends to be more careful, more complete, and less prone to "well, while I'm in here…" detours.

Why the stages matter

The constraint is the feature. I keep coming back to this because it's the part that's easy to miss.

An agent that can't edit files writes better analysis. An agent that only reviews security finds more security issues. An agent that implements from a pre-written plan makes fewer architectural mistakes than one that's planning and coding simultaneously. None of these are because the underlying model is different — it's the same model. What changes is what it's allowed to do, and what it's been told to focus on.

This mirrors the unwritten rules of any decent engineering team. The person who designs the system shouldn't be the only person reviewing it. The person who writes the code produces better code when they know a reviewer is coming next. The reviewer produces better feedback when they can't just sweep it under the rug by fixing it themselves.

Each Swarm stage gets a specific role and name (PLANNER, CODER, TESTER, REVIEWER, FIXER), focused instructions for one job, a read-only constraint where it helps, and the output of all previous stages as context. No stage ever sees the original prompt in isolation. The coder sees the planner's output. The reviewer sees both the plan and the implementation. The fixer sees the review findings, in context. It's a structured conversation, not a black box.

A few practical notes

Be specific in the task description. "Build user registration" works. "Build user registration with email verification, rate limiting, and a magic-link option" works better. The planner scopes the rest of the pipeline off your description, so a sentence or two more here saves a round trip later.

Run review on your scariest modules. Authentication, payments, anything where a bug has outsized consequences. The multi-pass review is slower than a one-shot prompt, but slower is the point.

Use refactor when you're afraid to touch something. The analyze-plan-implement-verify cycle is built for code you'd otherwise leave alone. The verify stage exists specifically to give you a second look before you commit.

You can still step in. Swarm runs inside your normal Elyra session. If a stage produces something off, correct it before the next stage picks it up. The pipeline isn't an autonomous agent doing things behind your back — it's just structure.

Try it on something small

elyra install npm:@elyracode/swarm

Then start somewhere low-stakes:

/swarm build add a health check endpoint that returns the app version and uptime

Watch the stages flow past. Notice how the planner's output shapes the coder's work, how the tester catches an edge case the coder didn't think about, how the reviewer flags something that would have made it into production otherwise.

After that, try /swarm review on a module you haven't opened in a while. The multi-pass analysis almost always surfaces something — and it's a quiet way to get reacquainted with code you've half-forgotten without committing to changing it.