devonkelley's comments

devonkelley · 2026-02-23T19:29:24 1771874964

The add/remove churn is the thing nobody talks about enough. We track net lines changed vs gross lines changed as a ratio and it's genuinely horrifying sometimes. An agent will touch 400 lines to net 30 lines of actual progress. And the worst part is the removed code was often fine, the agent just decided to refactor something it didn't need to touch while it was in there. Shorter, tighter loops with clearer scope boundaries have helped us more than any orchestration improvement.

torginus · 2026-02-24T13:19:08 1771939148

The most horrible thing I have seen with AI in coding is that it tends to add vestigial bits of code for no reason - this or that class that nobody uses at no point in the commit history, or stuff that gets added, then we decide to implement the feature in a different way, but parts of the old implementation stick around.

devonkelley · 2026-02-23T19:26:13 1771874773

The approval-link pattern for gating dangerous actions is something I keep coming back to as well, way more robust than trying to teach the agent what's "safe" vs not. How do you handle the case where the agent needs the result of the gated action to continue its chain? Does it block and wait, or does it park the whole task? The suspend/resume problem is where most of these setups get messy in my experience.

ZitchDog · 2026-02-27T23:18:34 1772234314

I haven't needed that yet! But it seems like the agent could easily be notified of completed activities.

devonkelley · 2026-01-30T05:24:43 1769750683

These are real people dude. I know some of them. Some are users or friends who came to comment on our post. They aren't bots and neither am I. Just new to HN.

devonkelley · 2026-01-30T05:34:12 1769751252

@tomhow emailing to verify that I am real, the comments on my post are not bots and I can verify whatever you need me to. Annoying to be flagged when I am just new here and trying to be part of the community.

deaux · 2026-01-30T05:53:26 1769752406

Never said you're a bot or that they're not real people. They're mostly people you know or your sockpuppets. Most people who do this are real people.

https://news.ycombinator.com/item?id=46719034

https://news.ycombinator.com/item?id=46742799

https://news.ycombinator.com/item?id=46290617

https://news.ycombinator.com/item?id=46487533

None of these are "bots".

https://news.ycombinator.com/threads?id=lighthouse1212

https://news.ycombinator.com/item?id=46720287

https://news.ycombinator.com/threads?id=Phil_BoaM

All "real people", all posting multiple LLM generated comments.

devonkelley · 2026-01-30T04:08:14 1769746094

The prompt injection concerns are valid, but I think there's a more fundamental issue: agents are non-deterministic systems that fail in ways that are hard to predict or debug.

Security is one failure mode. But "agent did something subtly wrong that didn't trigger any errors" is another. And unlike a hacked system where you notice something's off, a flaky agent just... occasionally does the wrong thing. Sometimes it works. Sometimes it doesn't. Figuring out which case you're in requires building the same observability infrastructure you'd use for any unreliable distributed system.

The people running these connected to their email or filesystem aren't just accepting prompt injection risk. They're accepting that their system will randomly succeed or fail at tasks depending on model performance that day, and they may not notice the failures until later.

ssvora · 2026-02-02T00:59:23 1769993963

How are these agents stress tested today? Are there tools that are typically being used for QA and/or security?

devonkelley · 2026-01-30T03:58:43 1769745523

Running agents in production, I've stopped trying to figure out why things degrade. The answer changes weekly.

Model drift, provider load, API changes, tool failures - it doesn't matter. What matters is that yesterday's 95% success rate is today's 70%, and by the time you notice, debug, and ship a fix, something else has shifted.

The real question isn't "is the model degraded?" It's "what should my agent do right now given current conditions?"

We ended up building systems that canary multiple execution paths continuously and route traffic based on what's actually working. When Claude degrades, traffic shifts to the backup path automatically. No alerts, no dashboards, no incident.

Treating this as a measurement problem assumes humans will act on the data. At scale, that assumption breaks.

sd9 · 2026-01-30T14:28:39 1769783319

LLM generated comments are so obvious, please just talk from your personal experience. Nobody cares about this imagined experience.

devonkelley · 2026-01-27T17:22:40 1769534560

Wow, yes. You nailed the framing. Autonomous control plane is the perfect way to describe Kalibr.

Defining success: We don't normalize it. Teams define their own outcome signals (latency, cost, user ratings, task completion, etc). You don't need perfect attribution to beat static configs; even noisy signals surface real patterns when aggregated correctly.

Oscillation: Thompson Sampling. Instead of greedily chasing the current best path, we maintain uncertainty estimates and explore proportionally. Sparse or noisy outcomes widen confidence intervals, which naturally dampens switching. Wilson scoring handles the low-sample edge cases without the wild swings you'd get from raw percentages.

Confidence/regret: Explicit in the routing math. Every path carries uncertainty that decays with evidence. The system minimizes cumulative regret over time rather than optimizing point-in-time decisions.

The gap we're closing is exactly what you mentioned. Self-correcting instead of babysat.

adeebvaliulla · 2026-01-27T21:14:17 1769548457

That makes a lot of sense, and I like that you’re being explicit about regret minimization rather than chasing local optima.

The Thompson Sampling + Wilson score combo is a pragmatic choice. In practice, most agent systems I see fail not because they lack metrics, but because they overreact to them. Noisy reward signals plus greedy selection is how teams end up whipsawing configs or freezing change altogether. Treating uncertainty as a first-class input instead of something to smooth away is the right move.

I also agree with your point on attribution. Perfect attribution is a trap. In real production environments, partial and imperfect outcome signals still dominate static configs if the system can reason probabilistically over time. This mirrors what we learned in reliability and delivery metrics years ago: trend dominance beats point accuracy.

One area I’d be curious about as this matures is organizational adoption rather than the math:

- How teams reason about defining outcomes without turning it into a governance bottleneck

- How you help users build intuition around uncertainty and regret so they trust the system when it routes “away” from what feels intuitively right

- Where humans still need to intervene, if anywhere, once the control plane is established

If this holds up across long-tail tasks and low-frequency failures, it feels like a real step toward agents that behave more like adaptive systems and less like fragile workflows with LLMs bolted on.

Appreciate the thoughtful reply.