More

MeetingsBrowser · 2026-03-18T18:59:15 1773860355

> there are clearly reasons to prefer some possible programs over others even when they all satisfy the same constraints

Maintainability is a big one missing from the current LLM/agentic workflow.

When business needs change, you need to be able to add on to the existing program.

We create feedback loops via tests to ensure programs behave according to the spec, but little to nothing in the way of code quality or maintainability.

MeetingsBrowser · 2026-03-18T18:51:25 1773859885

You (in theory) have more control over the quality of the team you are managing, than the quality of the models you are using.

And the quality of code models puts out is, in general, well below the average output of a professional developer.

It is however much faster, which makes the gambling loop feel better. Buying and holding a stock for a few months doesn't feel the same as playing a slot machine.

PaulHoule · 2026-03-18T18:55:22 1773860122

One difference is those developers are moral subjects who feel bad if they screw up whereas a computer is not a moral subject and can never be held accountable.

https://simonwillison.net/2025/Feb/3/a-computer-can-never-be...

ponector · 2026-03-18T19:58:18 1773863898

Right, you need to hire a scapegoat. Usually tester has that role: little impact but huge responsibility for quality.

est31 · 2026-03-18T19:08:09 1773860889

You have a lot of control over LLM quality. There is different models available. Even with different effort settings of those models you have different outcomes.

E.g. look at the "SWE-Bench Pro (public)" heading in this page: https://openai.com/index/introducing-gpt-5-4/ , showing reasoning efforts from none to high.

Of course, they don't learn like humans so you can't do the trick of hiring someone less senior but with great potential and then mentor them. Instead it's more of an up front price you have to pay. The top models at the highest settings obviously form a ceiling though.

kraemahz · 2026-03-18T19:45:45 1773863145

You also have control over the workflow they follow and the standards you expect them to stick through, through multiple layers of context. Expecting a model to understand your workflow and standards without doing the effort of writing them down is like expecting a new hire to know them without any onboarding. Allowing bad AI code into your production pipeline is a skill issue.

tossandthrow · 2026-03-18T19:25:54 1773861954

What theory is that?

My experience is the absolute opposite. I am much more in control of quality with Ai agents.

I am never letting junior to midlevels into my team again.

In fact, I am not sure I will allow any form of manual programming in a year or so.

MeetingsBrowser · 2026-03-18T19:49:51 1773863391

> I am never letting junior to midlevels into my team again

Exactly. You control the quality of the people in your team. You can train, fire, hire, etc until you get the skill level you want.

You have effectively no control over the quality of the output from an LLM. You get what the frontier labs give you and must work with that.

tossandthrow · 2026-03-18T20:11:19 1773864679

That is not correct.

It is much easier to control quality of an Ai than of inexperienced developers.

MeetingsBrowser · 2026-03-18T21:31:51 1773869511

I think we are talking past each other.

> I am never letting junior to midlevels into my team again

My point is, you control the experience level of the engineers on your team. The fact that you can say you won't let junior or midlevels on your team proves that.

You do not have that level of control with LLMs. Anthropic and OpenAI are roughly the same quality at any given time. The rest are not useful.

DrJokepu · 2026-03-18T19:49:18 1773863358

Eh. You want a good mix of experience levels, what really matters is everyone should be talented. Less experienced colleagues are unburdened by yesterday’s lessons that may no longer be relevant today, they don’t have the same blind spots.

Also, our profession is doomed if we won’t give less experienced colleagues a chance to shine.

tossandthrow · 2026-03-18T20:12:10 1773864730

Our profession is likely doomed not because we don't train people, but by the lack of demand

MeetingsBrowser · 2026-03-18T12:20:53 1773836453

At least I’m not alone.

My company has a vibe coded leaderboard tracking AI usage.

Our token usage and number of lines changed will affect our performance review this year.

jrjeksjd8d · 2026-03-18T21:24:59 1773869099

I have started using the most token-intensive model I can find and asking for complicated tasks (rewrite this large codebase, review the resulting code, etc.)

The agent will churn in a loop for a good 15-20 minutes and make the leaderboard number go up. The result is verbose and useless but it satisfies the metrics from leadership.

MeetingsBrowser · 2026-03-18T21:33:25 1773869605

Congrats on becoming AI native

georgemcbay · 2026-03-18T21:23:07 1773868987

> Our token usage and number of lines changed will affect our performance review this year.

The AI-era equivalent of that old Dilbert strip about rewarding developers directly for fixing bugs ("I'm gonna write me a new mini-van this afternoon!") just substitute intentional bug creation with setting up a simple agent loop to burn tokens on random unnecessary refactoring.

dgellow · 2026-03-18T12:51:46 1773838306

Could you both name and shame?

porridgeraisin · 2026-03-18T12:57:01 1773838621

Name pretty much any company. Every one of my friends have said their company is doing this. Across 3 countries mind you. Especially if they already use microsoft office suite. Those folks got sold copilot on a deal it seems.

MeetingsBrowser · 2026-03-17T20:59:53 1773781193

> or ignore Ellison's legacy, unfortunately.

Can you elaborate?

MeetingsBrowser · 2026-03-17T20:52:37 1773780757

I've tried it, and I'm not convinced I got measurably better results than just prompting claude code directly.

It absolutely tore through tokens though. I don't normally hit my session limits, but hit the 5-hour limits in ~30 minutes and my weekly limits by Tuesday with GSD.

testycool · 2026-03-17T21:02:18 1773781338

Same experience on multiple occasions.

MeetingsBrowser · 2026-03-17T14:24:18 1773757458

I've never needed to give any documentation to buy a movie ticket.

MeetingsBrowser · 2026-03-17T05:18:06 1773724686

Isn’t “bad output” already worst case? Pre-LLMs correct output was table stakes.

You expect your calculator to always give correct answers, your bank to always transfer your money correctly, and so on.

swiftcoder · 2026-03-17T10:17:57 1773742677

> Isn’t “bad output” already worst case?

Worst case in a modern agentic scenario is more like "drained your bank account to buy bitcoin and then deleted your harddrive along with the private key"

> Pre-LLMs correct output was table stakes

We're only just getting to the point where we have languages and tooling that can reliably prevent segfaults. Correctness isn't even on the table, outside of a few (mostly academic) contexts

simonw · 2026-03-17T17:00:42 1773766842

> Worst case in a modern agentic scenario is more like "drained your bank account to buy bitcoin and then deleted your harddrive along with the private key"

Hence my interest in sandboxes!

MeetingsBrowser · 2026-03-17T14:12:37 1773756757

> drained your bank account to buy bitcoin and then deleted your harddrive

These are what I meant by correct output. The software does what you expect it to.

> We're only just getting to the point where we have languages and tooling that can reliably prevent segfaults

This is not really an output issue IMO. This is a failing edge case.

LLMs are moving the industry away from trying to write software that handles all possible edge cases gracefully and towards software developed very quickly that behaves correctly on the happy paths more often than not.

simonw · 2026-03-17T12:28:14 1773750494

I've seen plenty of decision makers act on bad output from human employees in the past. The company usually survives.

MeetingsBrowser · 2026-03-16T18:10:49 1773684649

The question then becomes, can LLMs generate code close to the same quality as professionals.

In my experience, they are not even close.

vannevar · 2026-03-18T14:23:05 1773843785

In a one-shot scenario, I agree. But LLMs make iteration much faster. So the comparison is not really between an AI and an experienced dev coding by hand, it's between the dev iterating with an LLM and the dev iterating by hand. And the former can produce high-quality code much faster than the latter.

The question is, what happens when you have a middling dev iterating with an LLM? And in that case, the drop in quality is probably non-linear---it can get pretty bad, pretty fast.

SR2Z · 2026-03-17T06:48:14 1773730094

Well, if you keep in mind that "professionals" means "people paid to write code" then LLMs have been generating code at the same quality OR BETTER for about a year now. Most code sucks.

If you compare it to beautiful code written by true experts, then obviously not, but that kind of code isn't what makes the world go 'round.

mathgeek · 2026-03-16T19:18:50 1773688730

We should qualify that kind of statement, as it’s valuable to define just what percentile of “professional developers” the quality falls into. It will likely never replace p90 developers for example, but it’s better than somewhere between there and p10. Arbitrary numbers for examples.

MeetingsBrowser · 2026-03-16T19:57:58 1773691078

Can you quantify the quality of a p90 or p10 developer?

I would frame it differently. There are developers successfully shipping product X. Those developer are, on average, as skilled as necessary to work on project X. else they would have moved on or the project would have failed.

Can LLMs produce the same level of quality as project X developers? The only projects I know of where this is true are toy and hobby projects.

mathgeek · 2026-03-16T20:26:52 1773692812

> Can you quantify the quality of a p90 or p10 developer?

Of course not, you have switched “quality” in this statement to modify the developer instead of their work. Regarding the work, each project, as you agree with me on from your reply, has an average quality for its code. Some developers bring that down on the whole, others bring it up. An LLM would have a place somewhere on that spectrum.

MeetingsBrowser · 2026-03-16T18:06:57 1773684417

This only helps if you notice the code is bad. Especially in overlay complex code, you have to really be paying attention to notice when a subtle invariant is broken, edge case missed, etc.

Its the same reason a junior + senior engineer is about as fast as a senior + 100 junior engineers. The senior's review time becomes the bottleneck and does not scale.

And even with the latest models and tooling, the quality of the code is below what I expect from a junior. But you sure can get it fast.

phillipclapham · 2026-03-16T20:22:00 1773692520

This is the most important point in the thread. The study measures code complexity but the REAL bottleneck is cognitive load (and drain) on the reviewer.

I've been doing 10-12 hour days paired with Claude for months. The velocity gains are absolutely real, I am shipping things I would have never attempted solo before AI and shipping them faster then ever. BUT the cognitive cost of reviewing AI output is significantly higher than reviewing human code. It's verbose, plausible-looking, and wrong in ways that require sustained deep attention to catch.

The study found "transient velocity increase" followed by "persistent complexity increase." That matches exactly. The speed feels incredible at first, then the review burden compounds and you're spending more time verifying than you saved generating.

The fix isn't "apply traditional methods" — it's recognizing that AI shifts the bottleneck from production to verification, and that verification under sustained cognitive load degrades in ways nobody's measuring yet. I think I've found some fixes to help me personally with this and for me velocity is still high, but only time will tell if this remains true for long.

tabwidth · 2026-03-16T22:33:11 1773700391

The part that gets me is when it passes lint, passes tests, and the logic is technically correct, but it quietly changed how something gets called. Rename a parameter. Wrap a return value in a Promise that wasn't there before. Add some intermediate type nobody asked for. None of that shows up as a failure anywhere. You only notice three days later when some other piece of code that depended on the old shape breaks in a way that has nothing to do with the original change.

chrisweekly · 2026-03-16T22:24:03 1773699843

> The study found "transient velocity increase" followed by "persistent complexity increase."

Companies facing this reality are of course typically going to use AI to help manage the increased complexity. But that leads quickly to AI becoming a crutch, without which even basic maintenance could pose an insurmountable challenge.

galbar · 2026-03-16T22:27:03 1773700023

>The fix isn't "apply traditional methods"

I would argue they are. Those traditional methods aim at keeping complexity low so that reading code is easier and requires less effort, which accelerates code review.

vesselapi · 2026-03-16T22:50:16 1773701416

This matches what I've seen. The bottleneck moved from writing to reviewing, but we didn't update the process to reflect that. What helped our team was shifting to smaller, more frequent commits with tight scope descriptions — reviewing five 30-line diffs is dramatically less taxing than one 150-line diff, even though the total volume is the same. The cognitive load is nonlinear.

MeetingsBrowser · 2026-03-16T17:22:13 1773681733

> it runs somewhat contrary to standard ideas of class and inequality.

Can you elaborate?

pc86 · 2026-03-16T20:45:01 1773693901

When people think "top executives" they think of a very, very small group of people making tens of millions of dollars a year or much more. The reality is that that's not the case.