Bi-Directional Class-of-Fix Taxonomies

Or: how to let frontier models add to your code-fix corpus without ruining it.

June 2026 · Lazaro Kawer

When you build a system that uses a frontier model to propose code changes, bug fixes, performance optimizations, security patches, whatever, you have to decide how much novelty the model is allowed to propose. A common pattern is to handle this implicitly: curate a corpus of known-good fix patterns and instruct the model to stay close to it. Hallucinated APIs go down. Spurious recommendations go down. The system feels safer.

But there's a second error that the same approach makes worse, and it's almost never measured: the model surfaces something real, the system can't recognize it, and the proposal gets dropped. Over enough iterations, the corpus becomes the ceiling on what your system can find. Every novel pattern the model could discover gets filtered out by the same grounding that was supposed to make the system trustworthy.

I built a different shape of this loop for a production AI code-optimization system, and the most useful thing about it is what it surfaced: optimization patterns the subject-matter experts on the team had not catalogued. Some of those patterns are spread across more code than any one expert reads, which makes them hard to find from the inside.

The two-error problem

Frame the system as a binary classifier over candidate code changes. There are two failure modes.

False positive. The model proposes a "fix" that looks like a known pattern but doesn't actually work, doesn't apply, or is subtly wrong. Most eval research focuses here. Calibrated judges, source-grounding scorers, replay infrastructure, all designed to catch the model agreeing with something that isn't real.

False negative. The model proposes something real that doesn't fit any pattern the system recognizes. The pipeline discards it, often without surfacing it for review at all.

The first error is the one teams build defenses against. The second is the one that costs you most of the value the model could have provided.

A standard grounded pipeline (corpus of approved patterns, in-context retrieval, prompt template, judge with rubric) is set up to catch false positives, at the cost of missing the real ones. Whatever the model proposes that isn't in the corpus either doesn't get prompted into the proposal at all, or gets filtered downstream as "not a recognized pattern."

This trade-off is a fine call for many systems. Leaning hard against false positives is often the right one. It is still a trade-off the system designer should make consciously, and most architectures don't make it visible.

The bi-directional loop

The idea is small. Treat the corpus as a living artifact with two flows.

The pattern isn't new in the abstract. Voyager (Wang et al., 2023) builds a similar skill library through self-proposal and environmental validation in Minecraft. Classical active learning has had this loop shape for thirty years: model proposes candidates, human or automated check validates, corpus grows. Fix-pattern mining in software engineering (PAR, TBar, FixMiner) extracts repair patterns from historical bug fixes and applies them to new code.

Forward (corpus to model). The model uses the taxonomy as grounding. Categories, examples, anti-patterns. Reduces hallucination, anchors generation, makes outputs comparable across runs.

Backward (model to corpus). Model-proposed patterns that don't fit any existing category aren't silently dropped. They route through validation gates. Patterns that survive get admitted to the taxonomy.

Once that second flow exists, the taxonomy starts behaving as a record of what the system has learned rather than a static specification. A new model migration doesn't require rebuilding the corpus from scratch; you re-run the loop, see what new categories the better model surfaces, and validate them. Old categories that no longer get proposed start to age out.

The interesting design work is in the validation gates. If they're too strict, the corpus calcifies and the loop is barely better than the static-corpus baseline. If they're too loose, the corpus drifts toward whatever the model happens to over-generate this week, and the system's behavior becomes a function of its own historical bias.

What the validation gates actually have to do

Five gates have been useful, in increasing order of cost.

Reproducibility. The candidate pattern has to appear consistently across runs with the same input. If the model proposes it once and never again, it's a hallucination that happened to land near a real one.

Cross-model agreement. Two or more frontier models from different families have to land in roughly the same neighborhood when given comparable inputs. If only one model proposes the pattern, you're at risk of encoding that model's bias as a "discovery." If multiple unrelated models converge on it, that's evidence the pattern is in the underlying engineering reality, not in the model's training-set quirks.

Measurable empirical effect. When the pattern is applied to production-realistic code, the relevant metric has to move in the expected direction with a defensible magnitude. This is where most candidate patterns die, and that's healthy. Most of what a frontier model proposes is plausible-sounding noise.

Internal coherence sanity check. A subject-matter expert reads the pattern and the proposed mechanism. The bar to clear is internal consistency: the mechanism story doesn't contradict known constraints. Expert reviewers don't have to agree they would have suggested the pattern. This is fast, low-burden human review (minutes, not days) and it catches the patterns that score well on every automated check but contain a subtle conceptual error a model would not notice.

Source grounding. Can the pattern be traced to a real causal mechanism, or is it just a textual match for the appearance of fixes that worked in the past? Source-grounding scorers are the most labor-intensive layer to build and the most valuable when they work. A pattern that looks correct on every other gate but doesn't have a traceable mechanism is the most dangerous failure mode in the system, because it will be agreed with by judges, replicated by other models (which share training data), and pass empirical effect tests in any sample where the spurious correlation happens to hold.

A pattern that survives all five is admitted to the taxonomy as a new class. The classification gets a confidence level, an attribution (which model first proposed it, which validation runs it passed), and a body of supporting evidence.

A note on cross-model agreement. Of the five gates, this is the only one that runs in both directions. For admission, it's evidence that a pattern lives in the engineering reality and not in one model's training quirks. For deprecation, the same signal flips: a pattern that was previously admitted but stops being proposed by newer or more diverse model families is a candidate for revocation. The mechanism that gets a pattern into the corpus also tells you when to take it out. Most discussions of LLM-corpus loops treat this gate as a one-way valve. Treating it as bidirectional is what gives the system its only real defense against the corpus calcifying around an older generation of models.

The finding I didn't expect

Once the loop was running, what came back wasn't slight variations on patterns the team already knew. Instead, a small but recurring stream of optimization classes that no one on the team had explicitly catalogued. After the fact, you could trace why: the patterns showed up in code where the relevant constraint was distributed across multiple components, or where the underlying primitive only became expensive at the scale the production system actually operated at. Either condition is hard for an individual expert to hold in their head and easy for a model to land on by reading enough adjacent code. The mechanism is straightforward: the model is reading a much wider candidate space than any single human expert can enumerate.

I want to make that observation more concrete and more falsifiable than the standard "the model found things we missed" claim. The bi-directional loop appears to surface such patterns when at least one of two conditions holds: the relevant constraint is distributed across components no single expert owns, or the primitive in question only becomes expensive at the scale the production system actually operates at. If neither condition holds in your codebase, my prior is that the loop will mostly reproduce patterns experts already catalogue, and the marginal value of running it is low. I'd be curious whether other teams running similar loops see the same pattern hold.

This is the part of the result that matters for alignment. The standard mental model of frontier models in technical workflows is assistant: faster typist, better recall. The bi-directional loop suggests a different mental model: search engine over a candidate space humans cannot fully enumerate, gated by validation infrastructure humans can audit. The model functions as a wider-net source of candidates. The validation is what makes the wider net safe.

That's a much more honest framing of what these systems can do well today, and it lands differently than the "we built copilot for X" story. It also makes the design constraint clear: the system is only as good as its validation gates.

What bi-directional taxonomies don't fix

They don't replace human judgment. Every gate I described is either an automated proxy for human judgment or a structured way to make human judgment cheaper to apply at the right moments. The goal is to spend the expert's attention on candidates that have already passed the cheap checks.

They don't make the model an oracle. A pattern admitted to the corpus is "supported by enough evidence to be worth grounding future proposals against," which is weaker than "true." If the underlying mechanism turns out to be wrong, the pattern can be revoked. The taxonomy versioning matters: every entry has a revision history, so when you revoke a pattern you can audit which past proposals would now fail.

They don't eliminate the corpus-drift problem. They make it visible. You can watch the rate at which new patterns are admitted, where they're concentrated, and whether the distribution looks like real engineering reality or like an artifact of the validation gates' blind spots. That visibility is the actual win.

They don't help if the metric you're optimizing for is wrong. Garbage in, smarter garbage out, more efficiently.

The harder open questions

Three questions came up while building this that I don't think have settled answers in the broader AI-engineering field yet.

How conservative should the validation gates be? I argued for the five-gate stack above, but the calibration is project-specific. If your downstream cost of a false positive is catastrophic (say, security patches that have to go to production unreviewed), you want the gates near-impassable, and you should accept that the back-flow will be slow. If your downstream cost is recoverable (a suggested optimization a human reviews before merging), you can let more candidates through and rely on the human gate. There's no universal answer, but the trade-off should be made explicitly with the operator team, not left implicit in code-review thresholds.

How do you avoid encoding model-specific bias in the corpus? Cross-model agreement helps. So does periodically running the loop in "audit mode," re-evaluating existing taxonomy entries with newer or different model families to see which still get proposed and which don't. Entries that stop getting proposed by a diverse set of models are candidates for revocation. This is the closest thing the system has to a way to forget.

Where does attribution sit? When a model proposes a novel pattern, validation gates pass it, and a human reviewer signs off, who gets credit? In practical engineering terms this is bookkeeping. In organizational terms it's a real political problem. It's especially fraught when the proposed pattern improves on something an expert designed. The taxonomy stores the attribution chain (proposed by model X on date Y, validated by gates Z, reviewed by person A) and lets the team decide how to surface it. The taxonomy doesn't settle the question. It just gives the team a record they can argue from.

Why this matters beyond code optimization

The general structure (model-as-candidate-generator, gates-as-validation, corpus-as-versioned-record-of-what-survived) is portable. It applies anywhere you want a frontier model to expand the surface area of what your system considers without losing the ability to audit what got in.

Bug taxonomies. Test case generation. Threat models. Failure-mode catalogs. Code-review heuristics. The pattern works in any domain where the corpus is genuinely incomplete and the validation infrastructure can be made cheaper than the model's proposal rate.

What I've come away convinced of: most of the value of frontier models in engineering workflows comes from how much faster they let you explore a candidate space that was previously bounded by what a small number of experts could enumerate. Any single answer matters much less than the size of the space you can credibly search. The taxonomy and the gates are what convert that exploration into something durable.

Why this matters for alignment

Once a model contributes back to the corpus that grounds its own next iteration, you've created a feedback loop where the model's preferences shape the future model's grounding, which shapes the future model's preferences. Without validation gates, that loop is a quiet way for model-specific biases to harden into something the team treats as engineering reality. In that frame the gates carry the entire load: they're the only mechanism by which the corpus stays anchored to the underlying world rather than drifting toward the model's interior. Source grounding is the most important of the five for the same reason. It's the only check that asks whether a pattern corresponds to something real, rather than whether the model would propose it again.

Most current AI-augmented engineering systems frame the model as good enough to ship, and the framing does the wrong work. The version worth defending is narrower: the model is a useful search step in a pipeline whose validation discipline determines whether it produces compounding value or compounding error. Bi-directional taxonomies are one concrete way to build that pipeline.

This is the design I built for a production AI code-optimization system; the specifics are confidential. If you're thinking about a similar loop for your own work, I'd be curious how the validation calibration shakes out for your domain. Reach me at kawer7@gmail.com or on LinkedIn.