Hacker Newsnew | past | comments | ask | show | jobs | submit | powera's commentslogin

(January 2025)

I've been waiting for this update.

For many "simple" LLM tasks, GPT-5-mini was sufficient 99% of the time. Hopefully these models will do even more and closer to 100% accuracy.

The prices are up 2-4x compared to GPT-5-mini and nano. Were those models just loss leaders, or are these substantially larger/better?


So far on my (simple) benchmarks, GPT-5.4-mini is looking very good. GPT-5.4-mini is about 30% faster than GPT-5-mini. GPT-5.4-mini gets 80% on the "how many Rs in Strawberry" test, and nearly perfect scores on everything else I threw at it.

GPT-5.4-nano is less impressive. I would stick to gpt-5.4-mini where precise data is a requirement. But it is fast, and probably cheaper and better quality than an 8-20B parameter local model would be.

( https://encyclopedia.foundation/benchmarks/dashboard/ for details - the data is moderately blurry - some outlier (15s) calls are included, a few benchmark questions are ambiguous, and some prices shown are very rough estimates ).


For us, it was also pretty good, but the performance decreased recently, that forced us to migrate to haiku-4.5. More expensive but much more reliable (when anthropic up, of course).

they dont change the model weights (no frontier lab does). if you have evals and all prompts, tool calls the same, I'm curious how you are saying performance decreased..

This looks like somebody re-releasing QWEN models to promote their own company. https://news.ycombinator.com/item?id=47217305 is the link to QWEN's repo.


If you want to have a chance at running a large model, it needs to be quantized. The unsloth user on Huggingface manages popular quantizations for many models, Qwen included, and I think he developed dynamic GGUF quantization.

Take Qwen/Qwen3.5-35B-A3B for example. It's 72 GB. While unsloth/Qwen3.5-35B-A3B-GGUF has quantizations from 9-38 GB.


Unsloth is one of, if not the most well-known provider of model quantizations. The release post of course should reference the source, but most probably use unsloth or bartowski quantized models being my go-tos so relevant/convenient.


Between this and 4.6's tendency to do so much more "exploratory" work, I am back to using ChatGPT Codex for some tasks.

Two months ago, Claude was great for "here is a specific task I want you to do to this file". Today, they seem to be pivoting towards "I don't know how to code but want this feature" usage. Which might be a good product decision, but makes it worse as a substitute for writing the code myself.


Have you played with the effort setting? I'm finding medium effort on 4.6 to give more satisfactory results for that kind of thing.


I feel the exact same way. Trying to cater to the "no-code" crowd is blurring the product's focus. It seems they've stuffed the system prompt with "be creative and explore" instructions, which kills determinism - so now we have to burn tokens just to tell it: "Don't think, just write the code"


Same here, both Claude Code due to this change, and how Opus 4.6 is setup, they think they can do things autonomously. But in my experience, they really can't. Letting it overthink something while being on the wrong track is what leads to AI slop.


I've been working on a similar app called Trakaido. (Yes, it's also AI generated content).

The demo leans heavily on "choose the words for the sentence", which avoids spelling/keyboard issues, and maybe generalizes around the problems of N->N language maps better. The "decoy selection" for multiple choice answers also isn't great - I am getting sentences mixed with numbers for the translation of "three".

It also has the Duolingo-esque audio "reward" sounds. I personally hate them, but a lot of people feel otherwise.


There's a difference between the chatbot "advertising" something and an hour-long manipulative conversation getting the chatbot to make up a fake discount code. Based on the OP's comments, if it was a human employee who gave the fake code they could plausibly claim duress.


Think about if this happened in the real world. Like if I ran a book store, I’d expect some scammer to try to schmooze a discount but I’d also expect the staff to say no, refuse service, and call the police if they refused to leave. If the manager eventually said “okay, we’ll give you a discount” ultimately they would likely personally be on the hook for breaking company policy and taking a loss, but I wouldn’t be able to say that my employee didn’t represent my company when that’s their job.

Replacing the employee with a rental robot doesn’t change that: the business is expected to handle training and recover losses due to not following that training under their rental contract. If the robot can’t be trained and the manufacturer won’t indemnify the user for losses, then it’s simply not fit for purpose.

This is the fundamental problem blocking adoption of LLMs in many areas: they can’t reason and prompt injection is an unsolved problem. Until there are some theoretical breakthroughs, they’re unsafe to put into adversarial contexts where their output isn’t closely reviewed by a human who can be held accountable. Companies might be able to avoid paying damages in court if a chatbot is very clearly labeled as not not to be trusted, but that’s most of the market because companies want to lay off customer service reps. There’s very little demand for purely entertainment chatbots, especially since even there you have reputational risks if someone can get it to make a racist joke or something similarly offensive.


If having "an hour-long manipulative conversation" was possible, we have proof that company placed an unsupervised, error prone mechanism instead of real support.

If that "difference" is so obvious to you (and you expect it will break at some point), why don't you demand the company to notice that problem as well? And simply.. not put bogus mechanism in place, at all.

Edit: to be clear. I think company should just cancel and apologize. And then take down that bot, or put better safeguards (good luck with that).


Umm… The human could have dropped off the conversation? Or escalated it to a manager?


This seems to have a fundamental misunderstanding of Hacker News's front page. The goal is to have posts only show up for at most a day or two. I would expect 100% to be gone in a week.

The "outliers" are posts that were deleted and re-submitted, or possibly boosted by the moderators. The "visible for 44 days" post was only submitted 14 days ago; either it was deleted or it is a glitch in this guy's script.


For the "14 days ago" one - post timestamps get updated when a post gets re-submitted or boosted by the moderators. It would have actually been there from 44 days ago even if it's now saying it was more recent.


interesting, that actually explains a lot. If HN rewrites the timestamp on resubmission/boost, that's exactly why powera saw "14 days ago" for a post our scraper first picked up Dec 8. We were just grabbing whatever hit the top 50 every 30 min, so we caught the original appearance regardless of what the timestamp says now. Good to know.


Fair point — and you're partially right.

The "44 days" outlier (Grov) appeared in our snapshots on 12 unique dates spread across 47 calendar days, not continuously. Our methodology measured time between first and last appearance in the top 50, which collapses resubmissions into one window. That's a real limitation we should have flagged.

The raw data: Grov first appeared Dec 8, then clusters on Dec 17-22, Dec 29, and Jan 16-24. Big gaps in between. It wasn't sitting on the front page for 44 days straight — it kept reappearing.

We're updating the study to add a "continuous visibility" metric alongside the existing one. The core finding still holds (99% of Show HNs are gone within days, and the median post gets a single 30-min window), but the outlier framing was misleading.

Appreciate the pushback.


Oh, hi chatgpt


I don't use gpt.


The fact that Altman is so offended about Anthropic ads that don't mention OpenAI is very revealing about his mindset.


Title is wrong: it should be that it is the first year-over-year drop in 25 years. (Details behind paywall)


You've already commented ten times in this thread. If you don't know, maybe don't say anything?


Do you know if Stallman would platform Oracle by allowing them to be a sponsor?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: