> ... I immediately feel the need to go ask a fresh instance the question and/or another LLM
Not to criticize at all, but it's remarkable that LLMs have already become so embedded that when we get the sense they're lying to us, the instinct is to go ask another LLM and not some more trustworthy source. Just goes to show that convenience reigns supreme, I suppose.
What is that more trustworthy source exactly? At least to me it feels like the internet age has eroded most things we considered trustworthy. Behind every thing humans need there is some company or person willing to sell out trustworthiness for an extra dollar. Consumer protections get dumped in favor of more profit.
LLMs start feeling more like a dummy than the amount of ill intent they get from other places. So yea, I can see how it happens to people.
At the moment, maybe Google Search, throwing away the AI response at the top? Or Duck Duck Go, if you don't really trust Google?
I can see a day when even that won't be trustworthy, because too much AI slop output will wind up in the search corpus. But I don't think we're there yet.
> At the moment, maybe Google Search, throwing away the AI response at the top? Or Duck Duck Go,
Even past the summary and the ads a huge amount of results that come back from both google and DDG are AI generated. It's sometimes harder to find a reliable source for information in search results these days than it was 20 years ago.
Google is NN's all the way down these days. There might still be an honest index under it all, but a truly accurate representation of the Web has been effectively outlawed in the U.S. since DMCA.
But they're not exactly lying. Lying assumes an intent to deceive. It's because we know an LLMs limitations, that it makes sense to ask it the opposite question/the question without context etc.
If it was easy to look up/check the fact without an LLM, wary users probably wouldn't have gone to the LLM in the first place.
Funny thing for me, is it's not the LLM lying to me. It's the creators. The LLM is just doing what it's weights tell it to. I'll admit, I went a bit nuclear the first time I ran one locally and observed it's outputs/chain-of-thought diverging/demonstrating intent to information hide. I'd never seen software straight up deceive before. Even obfuscated/anti-debug code is straightforward in doing what it does once you decompile the shit. To see a bunch of matrix math trying to perception manage me on my own machine... I did not take it well. It took a few days of cooling down and further research to reestablish firmly that any mendacity was a projection of the intent of the organization that built it. Once you realize that an LLM is basically a glorified influence agent/engagement pipeline built by someone else, so much clicks into place it's downright scary. Problem is it's hard to realize that in the moment you're confronting the radical novelty of a computer doing things an entire lifetime of working professionally with computers should tell you a computer simply cannot do. You have to get over the shock first. That shock is a hell of a hit.
They seem to be a victim of their own success. Their response times are quite bad, and it's widely believed they are doing something to degrade service quality (quantizing?) in order to stretch resources. They just announced that they're cutting their usage limits down during peak hours as well.
They're in serious risk of losing their lead with this sort of performance.
> it's widely believed they are doing something to degrade service quality (quantizing?) in order to stretch resources
God, I wish this inane bullshit would just fucking die already.
Models are not "degrading". They're not being "secretly quantized". And no one is swapping out your 1.2T frontier behemoth for a cheap 120B toy and hoping you wouldn't notice!
It's just that humans are completely full of shit, and can't be trusted to measure LLM performance objectively!
Every time you use an LLM, you learn its capability profile better. You start using it more aggressively at what it's "good" at, until you find the limits and expose the flaws. You start paying attention to the more subtle issues you overlooked at first. Your honeymoon period wears off and you see that "the model got dumber". It didn't. You got better at pushing it to its limits, exposing the ways in which it was always dumb.
Now, will the likes of Anthropic just "API error: overloaded" you on any day of the week that ends in Y? Will they reduce your usage quotas and hope that you don't notice because they never gave you a number anyway? Oh, definitely. But that "they're making the models WORSE" bullshit lives in people's heads way more than in any reality.
It's possible though - it was a bug, a model pool instance wasn't updated properly and served a very old model for several months; whoever hit this instance would received a response from a prev version of a model.
While it's true that people are naturally predisposed to invent the "secret quantizing" conspiracy regardless of whether the actual conspiracy exists or not, I think there's more to the story.
I've seen Sonnet consistently start hallucinating on the exact same inputs for a couple hours, and then just go back to normal like nothing ever happened. It may just be a combination of hardware malfunction + session pinning. But at the end of the day the effects are indistinguishable from "secret quantizing".
You'll notice I specifically said "victims of their own success". Obviously these problems are induced by the fact that they have so many users. Blowing a lead due to inability to handle the demands of success is still a path to losing the lead.
Gemini CLI has been broken for the past 2-3 days, with no response from Google. Really embarrassing for a multi-trillion dollar company. At this point Codex is the only reliable CLI app, out of the big three.
GeminiCLI is absolutely terrible, nothing comparable to the browser access. I've started using the 'AI Pro' tier lately and I get 15 minutes response times from Gemini 3 'Flash' on a regular basis.
> Any massive infra migration is going to cause issues.
What? No, no it's not. The entire discipline of Infrastructure and Systems engineering are dedicated to doing these sorts of things. There are well-worn paths to making stable changes. I've done a dozen massive infrastructure migrations, some at companies bigger than Github, and I've never once come close to this sort of instability.
This is a botched infrastructure migration, onto a frankly inferior platform, not something that just happens to everyone.
I share your concern. Airlines seem to be anticipating this. There was a recent publicized incident of American Airlines removing a woman from a flight for playing audio over her phone speakers. United has similar policies. As I understand it, both are saying they will ban passengers over it.
I've tried several of these sorts of things, and I keep coming away with the feeling that they are a lot of ceremony and complication for not much value. I appreciate that people are experimenting with how to work with AI and get actual value, but I think pretty much all of these approaches are adding complexity without much, or often any, gain.
That's not a reason to stop trying. This is the iterative process of figuring out what works.
Since we're comparing to Meta, you just have to look at the state of their publicly facing products that feature AI. Google has better AI models (Gemini, Nanobanana) and they've integrated them successfully into way more products than Meta has.
Meta spends a lot of money on AI research with little to show for it. As imperfect as Google may be, they're still doing much better.
Google knows how to do research - and at the very least lets other people figure out the products, and then becomes the #3 or #4 player.
Both GCP and Gemini are products of this. Modern cloud was arguably built by Google (think Chubby, GFS, Bigtable as building blocks) - they just spent 10 years ceding it to Amazon before competing.
These categories are extremely broad. Top Executive includes general managers, legislators, school superintendents, mayors, city administrators, and a lot of other government jobs. The name is misleading, it's basically non-frontline management.
Chief Executives is actually a specific sub-category of it and is, obviously, much smaller.
I'm not sure if this is what the writer was getting at, but I tend to check telemetry for my production applications regularly not because I'm looking for things that would fire alerts, but to keep a sense of what production looks like. Things like request rate, average latency, top request paths etc. It's not about knowing something is broken, it's about knowing what healthy looks like.
Understanding what your code looks like in production gives you a lot better sense of how to update it, and how to fix it when it does inevitably break. I think having AI checking for you will make this basically impossible, and that probably makes it a pretty bad idea.
This is a good answer, and I agree that having a good production intuition like this is important. You're probably also right that having AI do it probably doesn't get that value.
I'm not sure I'd do this once a day. I tend to take note of things to build that intuition when I have other reasons to go and look at dashboards, and we have a weekly SLO review as a team, but perhaps there's a place for this in some way.
Yeah, agreed. Daily isn't really necessary outside of initial launch and maybe a busy season. It's really just often enough to build a good sense of production use, and keep it up to date.
> You can be anti billionaire and still not be a fuckass racist
Genuinely, if you can't handle discussing a basic political disagreement without becoming apoplectic, you should take a breath and wait to respond. This is the opposite of what HN is for.
Not to criticize at all, but it's remarkable that LLMs have already become so embedded that when we get the sense they're lying to us, the instinct is to go ask another LLM and not some more trustworthy source. Just goes to show that convenience reigns supreme, I suppose.
reply