It isn't. Anthropic tried building a fairly simple piece of software (a C compiler) with a full spec, thousands of human-written tests, and a reference implementation - all of which were made available to the agent and the model trained on. It's hard to imagine a better tested, better-specified project, and we're talking about 20KLOC. Their agents worked for two weeks and produced a 100KLOC codebase that was unsalvageable - any fix to one thing broke another [1]. Again, their attempt was to write software that's smaller, better tested, and better specified than virtually any piece of real software and the agents still failed.
Today's agents are simply not capable enough to write evolvable software without close supervision to save them from the catastrophic mistakes they make on their own with alarming frequency.
Specifically, if you look at agent-generated code, it is typically highly defensive, even against bugs in its own code. It establishes an invariant and then writes a contingency in case the invariant doesn't hold. I once asked it to maintain some data structure so that it could avoid a costly loop. It did, but in the same round it added a contingency (that uses the expensive loop) in the code that consumes the data structure in case it maintained it incorrectly.
This makes it very hard for both humans and the agent to find later bugs and know what the invariants are. How do you test for that? You may think you can spec against that, but you can't, because these are code-level invariants, not behavioural invariants. The best you can do is ask the agent to document every code-level invariant it establishes and rely on it. That can work for a while, but after some time there's just too much, and the agent starts ignoring the instructions.
I think that people who believe that agents produce fine-but-messy code without close supervision either don't carefully review the code or abandon the project before it collapses. There's no way people who use agents a lot and supervise them closely believe they can just work on their own.
"Incomplete specs" is the way of the world. Even highly engineered projects like buildings have "incomplete specs" because the world is unpredictable and you simply cannot anticipate everything that might come up.
Lol I largely agree with my beloved dissenters, just not on the same magnitude. I understand complete specs are impossible and equivalent to source code via declaration. My disagreement is with this particular part:
"t's more like asking the agent to build a building from floorplans and spec, and it produces everything in the right measurements and right colours and passes all tests. Except then you find out that the walls and beams are made of foam and the art is load-bearing. "
If your test/design of a BUILDING doesn't include at simulations/approximations of such easy to catch structural flaws, its just bad engineering. Which rhymes a lot with the people that hate AI. By and large, they just don't use it well.
And sometimes it can't even handle it then. I was recently porting ruby web code to python. Agents were simultaneously surprisingly good (converting ActiveRecord to sqlalchemy ORM) and shockingly, incapably bad.
For example, ruby uses blocks a lot. Ruby blocks are curious little thingies because they are arguably just syntax sugar for a HOF, but man it's great syntax sugar. Python then has "yield" which is simultaneously the same keyword ruby uses for blocks, but works fundamentally differently (instead of just a HOF, it's for generating an iterator/generator) and while there are some decorators that can use yield's ability to "pause" execution in the function to send control flow back out of the function for a moment (@contextmanager) which feels _even more_ like ruby blocks, it's a rather limited trick and requires the decorator to adapt the Generator to a context manager and there's just no good way to generalize that.
Somehow this is the perfect storm to make LLMs completely incapable of converting ruby code that uses blocks for more than the basic iteration used in the stdlib. It will try to port to python code that is either nonsensical, or uses yield incorrectly and doesn't actually work (and in a way that type checkers can even spot). And furthermore, even if you can technically whack it with a hammer until it works with yield, it's often not at all the way to do it. Ruby devs use blocks not-uncommonly while python devs are not really going to be using yield often at all, perhaps outside of @contextmanager. So the right move is usually to just restructure control flow to not need to use blocks/HOFs (or double down and explicitly pass in a function). (Rubyists will cringe at this, and rightly so... Ruby is often extraordinarily expressive).
The fact that such a simple language feature trips them up so completely is pretty odd to me. I guess maybe their training data doesn't include a lot of ruby-to-python conversions. Maybe that's indicative of something, but I digress.
There's a lot of work, good and some really bad, about fisher information and physics. vijay balasubramanian and Jim Sethna may interest you, though sethna is more condensed matter. And of course amari.
Sorry just saw this. The first 50 or so pages are gold, not under a chapter proper. Then most of part IV for non-physicists for exposure to the statistical mechanician's worldview, especially chapters 27-33. I'm not an expert on sections 1-3, so I can't make very high value claims on their relative value. But, everytime I did look into topics covered in those chapters, I found clarity in explanation and the "how" of things by coming back to the book. The entire neural networks section contains tons of nuggets that proved both prescient and (mostly) timeless.
I think personally the unit you measure divergence in just doesn’t matter. Yes, nats is technically superior, but as long as you do it consistently, all that you really want to do is to measure how similar A is to B.
In that sense I think many explanations of KL are very convoluted.
It's been both normalized and suppressed. I'm old enough to remember not being to able to point out SF crime problems without being called a fascist. It's denial, it's perverse. Noah smith claims that our(USA) "solution" to it, besides just ignoring it, was basically giving up on cities and moving to suburbs.
If you want to hear from the man himself, see link below. It was a fairly soft interview. I listened mainly because it was Noah and wasn't expecting him to be so pro-surveillance. So, even though I don't agree with them, it might be worth listening to their reasoning.
You basically have to be involved if you're meta. Even if there's only 5% chance this AI stuff is as disruptive as the labs claim it is, you can't afford to miss out. Even if you're lagging frontier, you must develop the competency internally. Otherwise you ignored a 5% chance of total annihilation, probably even exposing you to shareholder lawsuits.
This has consistently pissed me off. It seems like we all just accepted that whatever they define as "functioning"/"OK" is suitable. I see the status now shows, but there should be a very loud third party ruthlessly running continuous tests against all of them. Ideally it would also provide proof of the degradation we seem to all agree happens (looking at you Gemini). Like a leaderboard focused on actual live performance. Of course they'd probably quickly game that too. But something showing time to first response, "global capacity reached" etc, effective throttling metric, a intelligence metric. Maybe like crowdsourced stats so they can't focus on improving the metrics for just the IPs associated with this hypothetical third party performance watchdog.
The one that pissed me off the most was Gemini API displaying very clearly 1) user cancelled request in Gemini chat app 2) API showing "user quota reached". Both were blatant lies. In the latter case, you could find the actual global quota cause later in the error message. I don't know why there isn't more outrage. I'm guessing this sort of behavior is not new, but it's never been so visible to me.
Gemini had a partial outage on their status page and then once it was over after 10 days, it became a single day partial outage. For me, it was nearly two weeks of 95% failure rates.
Yup displays as an "auth" issue to me. Just a nice reminder that my original plan was to be provider agnostic but everything was working so well with cc I lost sight lol.
reply