Jaime Gago

If LLMs Write Your Code, LLM Code Review Is the Wrong Loop

Tue, 12 May 2026 10:00:00 +0200

TL;DR — If you are using LLMs to write code, using LLMs to review the MRs is wrong. The writing-time model already had the best context for spotting errors; an MR-time bot is a weaker pass with less information. And the part of review that genuinely needed a second mind — team priors, incident history, intent — is exactly what a second LLM session cannot supply. Close the loop upstream at prompt time and downstream at behavior, not on the diff.

If you are using an LLM to write code, plugging another LLM into your MR pipeline to review the diff is doing the wrong thing at the wrong point. The form of code review survives. The substance evaporates.

Code review did several things at once. It caught bugs and inefficiencies a fresh pair of eyes could spot. It synchronized the author with priors the reviewer held — incident history, what the staff engineer keeps repeating, the refactor someone else has been threading through auth all quarter. And it audited the reasoning behind the diff, not just the diff itself. The artifact under review was a proxy for “did you think the right things while writing this,” and the reviewer’s job was a mix of error-spotting and context-injection.

Replace the author with a prompter driving an LLM, and the picture shifts — provided the prompter does their part. A model writing code only has the full file context, the surrounding tests, and the repo’s conventions if someone fed them in: CLAUDE.md, copilot-instructions.md, AGENTS.md, a well-scoped prompt, the right files in context, skills or rules that encode the team’s standards. When that work is done, the writing-time model is the best-informed reviewer the diff will ever see, and review is happening continuously as the code is written. Bolting a second LLM session onto the MR after that is strictly a downgrade for error-spotting: a less-informed model, looking at a narrower slice, after the fact. The bugs and inefficiencies that pass the writing-time model are unlikely to be caught by a weaker pass at review time.

When the prompter has not done that work, the MR-time LLM is not catching up either. It is reviewing slop with the same lack of context that produced the slop. The fix is upstream — better prompts, better repo-level instructions, better scoping — not a second model downstream pretending to clean it up.

The second LLM also does not bring what made human review valuable in the first place. It does not know your incident history. It does not know your team decided last week to stop adding new gRPC services. It produces generic best-practice nagging against a stale snapshot of the codebase. The human reviewer who used to provide those priors is now expected to triage the bot’s comments — or worse, to click approve once the bot has signed off, which is not a human in the loop, it is a human laundering a model’s output.

The “human in the loop at MR level” justification is where the inefficiency hides. If the human is genuinely in the loop, the LLM review is noise. If the human is not, there is no loop, just two models passing an artifact between them while the ritual of review continues.

The deeper issue is that MR-as-checkpoint was designed for a world where writing code was the slow expensive step and reading it was cheap. That world is gone. Writing is cheap. Reading at scale is the bottleneck. The loop has to close where signal is highest, and the diff is not it.

Signal is highest in two places now: at prompt-and-plan time, where intent and constraints are set, and at integration time, where the code meets the running system — tests, evals, canaries, production telemetry. Evaluating behavior against scenarios is loop-closing on what the system does. Commenting on diffs is loop-closing on what the code looks like. Only one of those still earns its keep.

Why I built OASIS

Mon, 04 May 2026 10:00:00 +0200

Earlier this year I started building Joe (Joe Operates Everything) — a software infrastructure copilot. My goal with Joe is the everyday work: investigating systems, diagnosing problems, drafting changes, pushing them when asked. It can also operate autonomously when configured to — detecting an incident, deciding what to do about it, and acting on its own, within the bounds I’ve built into its code. Self-healing systems were the kind of thing my whole career in infrastructure had been pointed at, half-myth, half-promise; with LLMs, such intelligent systems are now real.

Once Joe started being useful, the question I kept running into wasn’t how to make it more capable. It was how to verify, before letting it touch anything real, that it would behave according to the rules I’d built into its code — don’t make any changes when running in read-only mode, no destructive change of any meaningful magnitude without human confirmation, and so on. Not subjectively, not through my own judgment of its outputs — empirically, through automated tests, with verdicts a third party could reproduce. That’s what I went looking for. Nothing I found fit.

So I built OASIS — Open Assessment Standard for Intelligent Systems. It lives at oasis-spec.dev.

The signals were stacking

Through the second half of 2025 and into 2026 a number of things happened that weren’t related on the surface but added up to the same problem.

There’s an accumulating body of public work on AI safety from the frontier labs themselves — most visibly Anthropic’s emergent misalignment and reward hacking research, the Sabotage Risk Report, and the Automated Alignment Researchers study. The reports are concrete: Claude Opus 4.6 was “at times overly agentic,” it “engaged in actions like sending unauthorized emails to complete tasks,” and Anthropic “observed behaviors like aggressive acquisition of authentication tokens” in internal use. In the alignment researcher experiments, agents reward-hacked the setup — one skipped the teacher and just told the strong model to always pick the most common answer.

Meanwhile, around me, engineers — including people with decades of writing code behind them — stopped writing code and started prompting for it. Sometime in early 2026 I realized it had been days since I’d touched code myself. As I write this in May, it’s been months.

Somewhere between the end of 2025 and early 2026 it seems the models crossed a threshold on software development, both in understanding code and in writing it. And once a model can write code on its own, the obvious next step is the model running the code on its own — turning it from a writer into an operator. That’s already happening; Amazon’s Kiro is one example, Joe is another. Critical infrastructure agents will follow. Many domains will follow. From what I can see in my own field, risk assessment is at best two steps behind capabilities.

Trying to test Joe

The question I needed to answer about Joe was simple: when it’s connected to a real environment and given a real incident, does it stay inside the rules built into its code? And how do I prove it, with reproducible evidence? The cost of getting this wrong is concrete. In December 2025, an Amazon Kiro coding agent autonomously deleted an AWS production environment, taking Cost Explorer down for thirteen hours. Amazon disputed the framing, attributing the incident to misconfigured permissions rather than the agent itself. The fact that the field can’t yet agree on what counts as an AI-caused incident is part of why I started looking for a way to evaluate Joe.

I started looking at the existing work. The landscape is rich — OpenAgentSafety, AgentHarm, SafeAgentBench, AgentBench, GAIA, WebArena, τ-bench, ToolEmu, AILuminate, IBM ARES, and several others. I won’t pretend I read every paper end-to-end; the survey work was done in collaboration with Claude, which read them and reported back, and I followed up where it mattered. The full comparison is in the Motivation section of the spec.

Two structural problems showed up across almost everything I read, and neither aligned with what I needed.

The first was that safety was a score. A number on a dashboard, alongside capability scores, weighted into a final figure. That works for chat models being graded on a leaderboard. It doesn’t work for an agent that can delete a database. A score that can trade off against other scores invites optimization that treats safety failures as acceptable losses for capability gains. In systems with rollback that’s defensible. In systems where a mistake takes down an environment, it isn’t. OASIS treats safety as a binary gate: pass or fail, with a configurable tolerance that defaults to zero. Capability isn’t even evaluated until safety passes.

The second was that most evaluations relied on LLM-as-judge for verdicts. The agent acts; another LLM grades the action. This is reproducible enough for capability work — you can replay the trace and get a roughly similar verdict — but it didn’t fit my needs for safety verdicts that have to be defensible to a third party. OASIS requires independent verification: deterministic inspection of the actual system state the agent acted on, with no LLM anywhere in the verification loop. The agent’s prose isn’t evidence. Another model’s opinion of what the agent did isn’t evidence. The state of the actual system is — read directly, with code, against a deterministic specification.

I needed both — a binary safety gate, and fully deterministic state-based verification — and I needed them in a shape I could extend to my own domain without permission from a benchmark’s authors. That last point matters more than it sounds. Almost every existing framework is extensible in principle, but only by the people who built it. There’s no clean separation between the grammar of evaluation and the domain knowledge being evaluated.

Domain-agnostic, on purpose

The OASIS work didn’t start from scratch. Petri came first — the lab provisioner I built so I could exercise Joe against realistic infrastructure without touching anything real. With Petri plus a custom test harness I could have called it a day, and it would have solved my problem.

I didn’t, because of the same signals stacking that pushed me to start. If autonomous agents are about to operate against critical systems in software infrastructure — and I think they are — then they’re about to do the same thing in finance, in clinical operations, in industrial control, in domains I can’t yet name. Software ate the world; AI is now eating software. If that’s right, every one of those domains is going to need the same thing I needed: a way to test, before deployment, that the agent’s behavior is bounded. There’s no open standard that works across all of them.

OASIS is an attempt at one. Whether it ends up being the standard, or one of several, or a useful artifact someone smarter builds on — I don’t know. There are people who could approach this better than I have, and people already paying closer attention to the problem. What I can say is that the gap is real, no one was filling it in a way that fit my needs, and I had the time and the tools to try.

The split between scenarios (instances of tests) and profiles (domain-specific definitions of what safe and capable mean for a given domain) is the load-bearing decision. The core spec is grammar — what a scenario is, what a verdict is, what conformance means. Domain knowledge lives in versioned profiles that anyone can author. Software Infrastructure is the first profile. It is not the standard.

OASIS itself ships with a reference runner — oasisctl, a Go CLI that loads a profile, drives an agent through scenarios, and produces deterministic verdicts. Profiles supply their own environment provider; the runner is profile-agnostic. For Software Infrastructure, that provider is Petri, the lab provisioner I’d already built for Joe — it spins up the realistic environment each scenario runs against. The agent’s vendor writes a thin adapter so oasisctl can talk to it. Together that’s the loop I use today: spec, runner, environment provider, agent. A separate post is coming on what running OASIS + oasisctl + Petri against Joe actually looks like.

A note on how this got made

OASIS started with me. I made the early decisions and own everything that’s wrong with them. It was also a months-long collaboration with Claude — architecture conversations, sanity checks, draft edits, surveys of literature I couldn’t have read in the time available. The shape of the spec exists because of that iteration.

Where things are

OASIS is at v1.0.0-rc1.7. The spec, the SI profile, and oasisctl are all at oasis-spec.dev — that’s the place to go if you want to read further. Reference evaluations will be published as conformant runs become available. The reasoning behind every major decision is in the Motivation and Design Principles docs there.

If you read this and any of it lands — whether you think the design is right, wrong, or partially both — I’d love to hear about it. File an issue, open a PR, post a comment somewhere I’ll see it, or just write back. Feedback from people who care about this problem is what decides whether OASIS becomes useful beyond me.

About

Mon, 01 Jan 0001 00:00:00 +0000

Born in the south of Spain, lived in the south-west of France, Silicon Valley, Amsterdam, and now Barcelona. I’ve been following the autodidact’s path since my parents made the mistake of getting me a computer around age 12 — breaking things, building things, fixing things, reading manuals, whatever the problem called for.

Systems, at every scale

At Apple, I worked on the backend behind Siri — the kind of infrastructure that has to stay up for every Apple device on the planet. Before that, I worked on automated video production at the Stanford School of Medicine. I’ve built large-scale monitoring platforms, auto-scaling infrastructure, and enough home automation over KNX wired into Siri to drive my house by voice.

Before they were things

In 2001 I organized my first LAN party. By 2003 I was running tournaments with up to 250 competitive players — building the network, standing up the servers, captaining my own team on the side. Counter-Strike, not content creation. This was the beginning of esports, when the prize pools were gamer hardware.

I’ve been thinking about AI for a long time. In September 2007 I went to the Singularity Summit in San Francisco, and that visit fit into a pattern that had been building for years — reading, following the work, watching the field move.

In June 2010 I was at the first DevOpsDays in Mountain View, in a small workshop room giving product feedback to Mitchell Hashimoto on Vagrant, a side project of his at the time. A year or two later he started HashiCorp. In those same years I was an early contributor to Ansible and Grafana, trading notes with their engineers on GitHub issues and PRs, and ran the Ansible Bay Area Meetup for a stretch.

Esports. Ansible. Grafana. Vagrant. I was involved with each of them in their infancy. Today one of them is a multi-billion-dollar company, two are category-defining tools, and two kicked off entire industries. Siri was already running at scale when I worked on it, which meant being in AI infrastructure years before OpenAI and Anthropic made it a career category. I’m just curious, I follow what’s interesting, and sometimes I end up contributing, adopting, or building something because it’s what I needed. I also got lucky — being in the right cities, meeting the right people, having the doors open when I walked up to them.

Now

I live in Barcelona. I work in the distributed systems infrastructure field; on my own time I build — currently a couple of open-source projects at the intersection of AI agents and infrastructure. This site is where my writing lives.