More

powera · 2026-03-23T14:11:34 1774275094

(January 2025)

powera · 2026-03-17T17:30:11 1773768611

I've been waiting for this update.

For many "simple" LLM tasks, GPT-5-mini was sufficient 99% of the time. Hopefully these models will do even more and closer to 100% accuracy.

The prices are up 2-4x compared to GPT-5-mini and nano. Were those models just loss leaders, or are these substantially larger/better?

powera · 2026-03-17T19:28:10 1773775690

So far on my (simple) benchmarks, GPT-5.4-mini is looking very good. GPT-5.4-mini is about 30% faster than GPT-5-mini. GPT-5.4-mini gets 80% on the "how many Rs in Strawberry" test, and nearly perfect scores on everything else I threw at it.

GPT-5.4-nano is less impressive. I would stick to gpt-5.4-mini where precise data is a requirement. But it is fast, and probably cheaper and better quality than an 8-20B parameter local model would be.

( https://encyclopedia.foundation/benchmarks/dashboard/ for details - the data is moderately blurry - some outlier (15s) calls are included, a few benchmark questions are ambiguous, and some prices shown are very rough estimates ).

HugoDias · 2026-03-17T17:37:08 1773769028

For us, it was also pretty good, but the performance decreased recently, that forced us to migrate to haiku-4.5. More expensive but much more reliable (when anthropic up, of course).

throwaway911282 · 2026-03-17T17:43:45 1773769425

they dont change the model weights (no frontier lab does). if you have evals and all prompts, tool calls the same, I'm curious how you are saying performance decreased..

powera · 2026-03-02T15:36:55 1772465815

This looks like somebody re-releasing QWEN models to promote their own company. https://news.ycombinator.com/item?id=47217305 is the link to QWEN's repo.

cpburns2009 · 2026-03-02T15:50:51 1772466651

If you want to have a chance at running a large model, it needs to be quantized. The unsloth user on Huggingface manages popular quantizations for many models, Qwen included, and I think he developed dynamic GGUF quantization.

Take Qwen/Qwen3.5-35B-A3B for example. It's 72 GB. While unsloth/Qwen3.5-35B-A3B-GGUF has quantizations from 9-38 GB.

karmakaze · 2026-03-02T21:23:57 1772486637

Unsloth is one of, if not the most well-known provider of model quantizations. The release post of course should reference the source, but most probably use unsloth or bartowski quantized models being my go-tos so relevant/convenient.

powera · 2026-02-16T14:09:01 1771250941

Between this and 4.6's tendency to do so much more "exploratory" work, I am back to using ChatGPT Codex for some tasks.

Two months ago, Claude was great for "here is a specific task I want you to do to this file". Today, they seem to be pivoting towards "I don't know how to code but want this feature" usage. Which might be a good product decision, but makes it worse as a substitute for writing the code myself.

slices · 2026-02-16T14:38:30 1771252710

Have you played with the effort setting? I'm finding medium effort on 4.6 to give more satisfactory results for that kind of thing.

KurSix · 2026-02-16T16:42:37 1771260157

I feel the exact same way. Trying to cater to the "no-code" crowd is blurring the product's focus. It seems they've stuffed the system prompt with "be creative and explore" instructions, which kills determinism - so now we have to burn tokens just to tell it: "Don't think, just write the code"

lukaslalinsky · 2026-02-17T07:55:57 1771314957

Same here, both Claude Code due to this change, and how Opus 4.6 is setup, they think they can do things autonomously. But in my experience, they really can't. Letting it overthink something while being on the wrong track is what leads to AI slop.

powera · 2026-02-13T16:02:02 1770998522

I've been working on a similar app called Trakaido. (Yes, it's also AI generated content).

The demo leans heavily on "choose the words for the sentence", which avoids spelling/keyboard issues, and maybe generalizes around the problems of N->N language maps better. The "decoy selection" for multiple choice answers also isn't great - I am getting sentences mixed with numbers for the translation of "three".

It also has the Duolingo-esque audio "reward" sounds. I personally hate them, but a lot of people feel otherwise.

powera · 2026-02-06T14:13:35 1770387215

There's a difference between the chatbot "advertising" something and an hour-long manipulative conversation getting the chatbot to make up a fake discount code. Based on the OP's comments, if it was a human employee who gave the fake code they could plausibly claim duress.

acdha · 2026-02-06T18:00:36 1770400836

Think about if this happened in the real world. Like if I ran a book store, I’d expect some scammer to try to schmooze a discount but I’d also expect the staff to say no, refuse service, and call the police if they refused to leave. If the manager eventually said “okay, we’ll give you a discount” ultimately they would likely personally be on the hook for breaking company policy and taking a loss, but I wouldn’t be able to say that my employee didn’t represent my company when that’s their job.

Replacing the employee with a rental robot doesn’t change that: the business is expected to handle training and recover losses due to not following that training under their rental contract. If the robot can’t be trained and the manufacturer won’t indemnify the user for losses, then it’s simply not fit for purpose.

This is the fundamental problem blocking adoption of LLMs in many areas: they can’t reason and prompt injection is an unsolved problem. Until there are some theoretical breakthroughs, they’re unsafe to put into adversarial contexts where their output isn’t closely reviewed by a human who can be held accountable. Companies might be able to avoid paying damages in court if a chatbot is very clearly labeled as not not to be trusted, but that’s most of the market because companies want to lay off customer service reps. There’s very little demand for purely entertainment chatbots, especially since even there you have reputational risks if someone can get it to make a racist joke or something similarly offensive.

szszrk · 2026-02-06T15:19:55 1770391195

If having "an hour-long manipulative conversation" was possible, we have proof that company placed an unsupervised, error prone mechanism instead of real support.

If that "difference" is so obvious to you (and you expect it will break at some point), why don't you demand the company to notice that problem as well? And simply.. not put bogus mechanism in place, at all.

Edit: to be clear. I think company should just cancel and apologize. And then take down that bot, or put better safeguards (good luck with that).

hshdhdhj4444 · 2026-02-06T19:00:17 1770404417

Umm… The human could have dropped off the conversation? Or escalated it to a manager?

powera · 2026-02-05T15:39:41 1770305981

This seems to have a fundamental misunderstanding of Hacker News's front page. The goal is to have posts only show up for at most a day or two. I would expect 100% to be gone in a week.

The "outliers" are posts that were deleted and re-submitted, or possibly boosted by the moderators. The "visible for 44 days" post was only submitted 14 days ago; either it was deleted or it is a glitch in this guy's script.

vikingerik · 2026-02-05T17:25:20 1770312320

For the "14 days ago" one - post timestamps get updated when a post gets re-submitted or boosted by the moderators. It would have actually been there from 44 days ago even if it's now saying it was more recent.

JoseOSAF · 2026-02-05T17:37:07 1770313027

interesting, that actually explains a lot. If HN rewrites the timestamp on resubmission/boost, that's exactly why powera saw "14 days ago" for a post our scraper first picked up Dec 8. We were just grabbing whatever hit the top 50 every 30 min, so we caught the original appearance regardless of what the timestamp says now. Good to know.

JoseOSAF · 2026-02-05T16:24:39 1770308679

Fair point — and you're partially right.

The "44 days" outlier (Grov) appeared in our snapshots on 12 unique dates spread across 47 calendar days, not continuously. Our methodology measured time between first and last appearance in the top 50, which collapses resubmissions into one window. That's a real limitation we should have flagged.

The raw data: Grov first appeared Dec 8, then clusters on Dec 17-22, Dec 29, and Jan 16-24. Big gaps in between. It wasn't sitting on the front page for 44 days straight — it kept reappearing.

We're updating the study to add a "continuous visibility" metric alongside the existing one. The core finding still holds (99% of Show HNs are gone within days, and the median post gets a single 30-min window), but the outlier framing was misleading.

Appreciate the pushback.

Alex-Programs · 2026-02-05T16:56:27 1770310587

Oh, hi chatgpt

JoseOSAF · 2026-02-05T17:35:40 1770312940

I don't use gpt.

powera · 2026-02-04T23:28:10 1770247690

The fact that Altman is so offended about Anthropic ads that don't mention OpenAI is very revealing about his mindset.

powera · on April 30, 2024

Title is wrong: it should be that it is the first year-over-year drop in 25 years. (Details behind paywall)

powera · on April 29, 2024

You've already commented ten times in this thread. If you don't know, maybe don't say anything?

ParetoOptimal · on April 29, 2024

Do you know if Stallman would platform Oracle by allowing them to be a sponsor?