Your AGENTS.md is full of advice for a dumber model
Every rule your agent helped you write passed review by the model that wrote it — so the errors that survive are the ones that model couldn't see. They sit in your instruction files until a stronger reader shows up. Here's the one prompt to run before you give a new model real work.
There's a guide making the rounds right now — Paweł Huryn's "Claude Fable 5: The Ultimate Guide v2" — and buried inside all the benchmarks is one observation that stopped me cold. It has almost nothing to do with Fable 5 specifically, and everything to do with how all of us actually work.
On day two of testing, mid-task, the model read his instruction file and caught it teaching the exact pattern his own quality gate is built to ban. He hadn't asked for a review. It just noticed the contradiction and flagged it.
His framing is the part worth tattooing somewhere:
A system maintained by model N preserves the errors model N can't see — and they sit there until a stronger reader shows up.
If you've used Elyra for more than a week, you have one of these systems. It's your AGENTS.md, your skills, your memory files. And it is quietly full of advice written by and for a weaker model.
The blind spot, explained
Here's the mechanism, and it's almost unfair. Every rule your agent helped you write passed review by the model that wrote it. So the errors that survive in your instruction files aren't the obvious ones — those got caught. The survivors are precisely the mistakes that model couldn't see. They accumulate, invisibly, until a stronger reader arrives.
Huryn's 320-line instruction file sat on top of 166 files — roughly 300,000 words of rules. When a stronger model finally audited it, here's a sample of what had been hiding in plain sight:
A hardcoded date.
(today is 2026-05-24), written in one session, never noticed again — feeding the wrong day into every session that loaded it afterward.A rule documented with the thing it bans. His style guide forbids em dashes; the file explaining the ban was written with em dashes. Instructions teach by example as much as by rule.
The same rule in three different files — three chances to drift apart.
Guardrails for failure modes the new model doesn't have: "never delegate judgment-heavy work to cheaper models," double-check rituals priced for an older lineup, a threshold that had quietly drifted out of date.
His point about these: a careful grep would catch a couple. The ones that matter, it never would — because they're right for the model they were written for, wrong in meaning rather than syntax. No automated check finds those. And the better your system was for the last model, the more it holds the next one back.
The prompt worth running first
This is the genuinely useful, model-agnostic takeaway, and it works in Elyra today with any capable model. Before you give a new model real work, point it at your own instruction layer and ask it to audit — not fix. Adapted for Elyra's files:
Read your own instruction files end to end — AGENTS.md, the skills in
.elyra/skills/ and ~/.elyra/agent/skills/, and any memory files.
- Where do they contradict each other? Quote both sides.
- Which rules exist to manage a weaker model — guardrails for failure
modes you don't have, recipes for things you no longer need spelled
out, hardcoded facts that have drifted? List them with file:line.
- Which rules teach by bad example — files that violate the patterns
they prescribe?
- What would you delete? What would you keep exactly as is, and why?
Don't fix anything yet. Report first. I decide what gets cut.
The division of labor is the whole trick: the model does the reading, you do the deleting. Run it, then move the genuinely shared findings out of your head and into the repo — trim AGENTS.md, fix the skill, commit it. One audit, and the whole team inherits a cleaner brain.
(Confession, since we preach lean AGENTS.md: ours has a hardcoded testing date convention and a tmux section that assumes a tool not every machine has. We're going to run this on ourselves. Glass houses.)
The honest scoreboard (and a timely asterisk)
What makes Huryn's guide worth reading instead of just quoting is that he retested everything at 20 rounds per cell and let two of his own launch-day headlines fail — corrected in place, not quietly deleted. A few of his measured findings, credited to him:
Correctness was a tie. Across 240 graded answers, both Fable 5 and Opus 4.8 were correct. Speed was the only axis that separated them, and Fable started slower (a ~2.4-second longer pause before the first visible output).
The price unit that matters isn't per token. Fable bills ~2x per token, but on the one planted bug that required reading two files against each other, it caught it far more often — making the cost per deep find roughly a quarter of the cheaper model's. As he puts it: price per token is the wrong unit; price per deep find is the one your invoice feels.
Nesting agents has an invoice. A manager layer over the same work cost ~2.5x median. His rule: nest only when the next level's plan depends on what the previous level found. A flat fan-out beats an org chart for almost everything.
That middle point lands especially close to home — it's exactly why we built budget-aware routing and a /cost burn rate this month: the question isn't "which model is cheapest," it's "which model is cheapest for the result you need."
Now the asterisk we owe you, because honesty is the whole point of a post like this: Fable 5 is unavailable as we publish. A US export-control directive forced Anthropic to suspend it for all customers; a live API call today returns a 404 pointing you to Opus 4.8. So treat "the ultimate guide to Fable 5" as a guide to a moment, not a model you can run this afternoon.
Which, honestly, only sharpens the real lesson.
The lesson that outlives the model
The model in the headline got pulled four days after launch. The benchmarks are already aging — Huryn notes both models now emit 2–3x more tokens on the same problem than their launch-day logs. The specific numbers rot. The structural insight doesn't:
The models get smarter on someone else's schedule. Your instruction files only get smarter on yours.
The teams that get the most out of each new model aren't the ones with the longest AGENTS.md. They're the ones who, when a stronger reader shows up, know which lines to delete.
Go read Huryn's full guide and his raw logs — it's a genuinely rigorous piece of work, and he shows his receipts. Then come back, point your sharpest available model at your own instruction layer, and ask it the one question you never thought to: what in here was written for someone dumber than you?
Happy building — and happy deleting.