The Jagged Frontier

An agent has never once told me it was guessing. It hands me the wrong answer in the same voice it uses for the right one, formatted just as cleanly, sounding just as sure. That single fact is the reason working with these tools is a skill and not just a convenience. If the failures announced themselves, anyone could use AI well. They do not announce themselves. They arrive wearing the same face as the wins.

The research has a precise name for the shape of the problem. When BCG's consultants used a frontier model on tasks inside its range, they finished about a quarter faster and produced better work. On a task built to sit just outside that range, the same tool made them roughly nineteen percent more likely to get the answer wrong, because it was every bit as fluent on the problem it could not actually do. Dell'Acqua and his coauthors called the boundary a jagged technological frontier, and the word is exact. It is not a clean line with easy on one side and hard on the other. It is jagged. The tool is brilliant at something genuinely difficult and broken at something that looks adjacent and simpler, and the two sit right next to each other with no marker between them.

So the actual skill, the one that decides whether AI helps you or quietly costs you, is reading that boundary in real time. I have written about what working at that edge feels like day to day; this is the mechanism underneath the feeling. I have come to think of it as four moves, and most of working with agents is choosing the right one for the task in front of me.

Delegate, steer, verify, think alone

Delegate is for the deep interior of the frontier. The task is well inside what the tool does reliably, the stakes are low, and I can check the result in seconds. Renaming across a codebase, drafting boilerplate, translating a config from one format to another. I hand it off completely and barely look back. This is where the speed everyone talks about actually lives.

Steer is for inside the frontier but near its texture, where the tool can do the work but not unattended. I stay in the loop, read as it goes, correct course early and often. This is the line I have drawn between agent mode and autonomous mode: the difference is who closes the loop, me or the machine, and near the edge of the frontier it has to be me. Most of my real building is here. It is not hands-off and it is not me typing every character. It is a conversation where I supply the direction and the judgment and the agent supplies the speed and the patience.

Verify is the move people skip, and skipping it is where the bill comes due. The tool produced something plausible, but plausible is not the same as correct, and I cannot trust it on this particular task without checking it against something real: a test, a source, a second pass from a different angle. The output looking right is not evidence. It always looks right. That is the whole danger.

Think alone is the one that is easy to forget you still have. Some tasks are outside the frontier, where the tool will be confidently wrong and I am better off without it. And some tasks are inside it but I should still do them myself, because the thinking is the point. Closing the tool is a legitimate move, not a failure of nerve. The hard part is that from the inside, the task you should hand off and the task you should sit with alone can feel exactly the same.

Why it is hard from the inside

The reason this takes judgment and not a checklist is that the two signals you would naturally use are decoupled. Fluency is constant. Competence is jagged. The model writes with the same easy authority whether it is right or wrong, so the confidence in the output tells you nothing about the correctness of it. Every instinct you have for reading a human, where hesitation and hedging and a careful tone usually track real uncertainty, points you the wrong way.

That decoupling has a name in the literature too. Automation bias is the tendency to over-rely on an automated recommendation precisely when it looks fluent and authoritative, and a 2025 review traces it through medicine, law, and public administration, anywhere a confident machine answer meets a tired human. The fix the researchers settle on is not "trust the AI" and not "be skeptical of the AI." Both of those are too blunt to use. The goal is trust calibration, matching your reliance to the tool's actual capability on this specific task. Appropriate reliance is the entire game, and it has to be recomputed every time, because the frontier is jagged and the task just changed.

Building the fence around the edge

Because I cannot hold all of that in my head on every task, I do not try. I build the judgment into the system instead of leaning on my own attention to supply it fresh each time.

The most useful thing I do is keep an adversarial second agent whose only job is to attack the first one's work, to try to refute it rather than admire it. A model is good at finding holes in output when you point it at the output and tell it to be hostile, and it does not get tired or invested the way I do. I run my code review this way on purpose: review the change, not the typing, and assume the confident draft is hiding something until a second pass proves it is not. I build gates that close the loops I would otherwise leave open, tests that have to pass, checks that have to clear, so that the verify move is not relying on me remembering to make it.

The other half is knowing my own domain's edges, which is the part that does not transfer and cannot be downloaded. Working as a senior engineer with these tools, I can feel where the agent is reliable and where I have to take the wheel, and that feel is just the accumulated memory of every place it has burned me before. It is the most valuable thing I own in this work, and a junior cannot have it yet, which is its own problem and the subject of its own essay.

The frontier moves

One more thing keeps this from ever becoming a fixed rule. The frontier moves. Every model release redraws the line, usually outward, sometimes in strange local ways, and a task that was outside the boundary last quarter is comfortably inside it now. So the boundary is not something you learn once and file away. It is something you keep testing, deliberately, by handing the tool things at the edge of what you think it can do and watching where it breaks. The people who decided two years ago what AI was bad at and stopped checking are now wrong in both directions, trusting it where it has gotten dangerous and avoiding it where it has gotten good.

This is why I never have a clean answer when someone asks whether AI is overhyped. The question assumes a fixed answer to a moving target. The answer is that it is extraordinary inside the frontier and a liability just outside it, and the only durable skill is the one that tells the two apart on a specific task, today, with this model. That skill is not prompting. Prompting is the cheap part, the part a cheat sheet can teach. This is judgment under uncertainty, and it is the thing that decides how heavy the real bill for AI turns out to be. Guess the frontier right and the tool extends you. Guess it wrong and you have just paid full price to be confidently misled.