More

veselin · 2026-03-11T10:09:10 1773223750

I guess we will have a lot more benefits if we can get this to work on something like llama.cpp - since it really has a lot of kernels for different quantizations, a lot of home users, high hardware diversity - so it is a likely place with highest bang for the buck.

I guess they can be a contributor there.

LuxBennu · 2026-03-11T13:19:15 1773235155

This is the right call. llama.cpp has dozens of hand-tuned CUDA kernels across Q4_K_M, Q5_K_S, Q8_0 and other quant formats, each targeting different hardware profiles. An autoresearch approach that could optimize these per-GPU would be huge — right now performance varies wildly between, say, an RTX 3090 and a 5070 Ti on the same quant format because the kernels are tuned for specific architectures. The hardware diversity in the llama.cpp user base is exactly where automated kernel search has the most to gain.

Jhsto · 2026-03-11T14:34:03 1773239643

If I'd like to benchmark a new language / compile backend for LLM inference, what would be some good projects to try? If I'd start from tinygpt, what would make sense as the next step?

jacquesm · 2026-03-11T16:41:30 1773247290

You're talking to a bot.

veselin · 2026-02-23T19:23:41 1771874621

I think they put two things:

* Likely they will seek regulation that would ban some models. Not sure this can work, but they will certainly try.

* Likely they will not release some of their next models in the API.

999900000999 · 2026-02-23T20:27:33 1771878453

I called this out a few days ago.

They'll come up with some excuses to get the Chinese models banned.

Of course this will only work in the US. Every American tech company will have to pay 10 to 20x for tokens.

Imustaskforhelp · 2026-02-23T22:10:04 1771884604

How can Chinese models get banned if American providers can still host it?

Is there even any law which can tell people about it?

999900000999 · 2026-02-23T22:21:25 1771885285

https://www.reuters.com/world/china/us-lawmakers-introduce-b...

https://cyberscoop.com/deepseek-ban-congress-cassidy-rosen-c...

If your company has any government contracts you might not be able to use Chinese models under these bills.

Imustaskforhelp · 2026-02-23T23:52:54 1771890774

Damn but I guess I can still understand it but can there be a blanket ban of Chinese model themselves period or the idea being that you can't use Chinese models for american citizens or similar under the US constitution.

This is a big deal of itself and thanks for sharing but I feel like that the above idea of blanket banning chinese models is defintiely something that American AI companies want and with their lobbying efforts and many other things, I think that its up to genuine debate if this would be reality or not.

hermanzegerman · 2026-02-24T13:18:57 1771939137

Good that the US pissed so much softpower and goodwill away around the World that they won't convince other countries to ban these models. And only banning them domestically would hurt them on their own too much

burnt-resistor · 2026-02-24T18:54:30 1771959270

The winner will be the biggest buyer of TRUMP coin-laundered bribes.

veselin · 2026-02-19T18:16:19 1771524979

I am actually going to complain about this: that neither of the Gemini models are not preview ones.

Anthropic seems the best in this. Everything is in the API on day one. OpenAI tend to want to ask you for subscription, but the API gets there a week or a few later. Now, Gemini 3 is not for production use and this is already the previous iteration. So, does Google even intent to release this model?

veselin · 2026-01-19T20:48:14 1768855694

What is the state of using quants? For chat models, a few errors or lost intelligence may matter a little. But what is happening to tool calling in coding agents? Does it fail catastrophically after a few steps in the agent?

I am interesting if I can run it on a 24GB RTX 4090.

Also, would vllm be a good option?

tgtweak · 2026-01-19T21:41:27 1768858887

I like the byteshape quantizations - they are dynamic variable quantization weights that are tuned for quality vs overall size. They seem to make less errors at lower "average" quantizations than the unsloth 4 bit quants. I think this is similar to variable bitrate video compression where you can keep higher bits where it helps overall model accuracy.

Should be able to run this in 22GB vram so your 4090 (and a 3090) would be safe. This model also uses MLA so you can run pretty large context windows without eating up a ton of extra vram.

edit: 19GB vram for a Q4_K_M - MLX4 is around 21GB so you should be clear to run a lower quant version on the 4090. Full BF16 is close to 60GB so probably not viable.

omgwin · 2026-01-20T05:11:57 1768885917

It's been mentioned that this model is MLA capable, but it seems like the default vLLM params don't use MLA. Seeing ~0.91MB KV Footprint per token right now. Are you getting MLA to work?

regularfry · 2026-01-20T13:27:52 1768915672

It's in the ollama library at q4_K_M, which doesn't quite fit on my 4090 with the default context length. But it only offloads 8 layers to the CPU for me. I'm getting usable enough token rates. That's probably the easiest way to get it. Not tried it with vllm but if it proves good enough to stick with then I might give it a try.

regularfry · 2026-01-20T13:38:29 1768916309

Oh, and on agents: I did give it a go in opencode last night and it seemed to get a bit stuck but I think I probably pushed it too far. I asked it to explain TinyRecursiveModels and pointed it at the git repo URL. It got very confused by the returned HTML and went into a loop. But actually getting to the point of getting content back from a tool call? Absolutely fine.

I'm thinking of giving it a go with aider, but using something like gemma3:27b as the architect. I don't think you can have different models for different skills in opencode, but with smaller local models I suspect it's unavoidable for now.

veselin · 2026-01-09T06:48:45 1767941325

I run evals and the Todo tool doesn't help most of the time. Usually models on high thinking would maintain Todo/state in their thinking tokens. What Todo helps is for cases like Anthropic models to run more parallel tool calls. If there is a Todo list call, then some of the actions after are more efficient.

What you need to do is to match the distribution of how the models were RL-ed. So you are right to say that "do X in 200 lines" is a very small part of the job to be done.

lmeyerov · 2026-01-09T15:48:34 1767973714

Curious what kinds of evals you focus on?

We're finding investigating to be same-but-different to coding. Probably the most close to ours that has a bigger evals community is AI SRE tasks.

Agreed wrt all these things being contextual. The LLM needs to decide whether to trigger tools like self-planning and todo lists, and as the talk gives examples of, which kind of strategies to use with them.

veselin · 2026-01-10T15:43:17 1768059797

I am taking for SWE bench style problems where Todo doesn't help, except for more parallelism.

lmeyerov · 2026-01-10T20:43:25 1768077805

Was guessing that, coding tasks are a valuable but myopic lense :)

I'm guessing a self-updating plan there is sufficient. I'm not actually convinced today's current plan <> todolist flow makes sense - in the linked PLAN.md, it gets unified, and that's how we do ai coding. I don't have evals on this, but from a year of vibes coding/engineering, that's what we experientially reached across frontier coding models & tools. Nowadays we're mixing in evals too, but that's a more complicated story.

veselin · 2025-11-18T14:38:43 1763476723

I work a lot on testing also SWE bench verified. This benchmark in my opinion now is good to catch if you got some regression on the agent side.

However, going above 75%, it is likely about the same. The remaining instances are likely underspecified despite the effort of the authors that made the benchmark "verified". From what I have seen, these are often cases where the problem statement says implement X for Y, but the agent has to simply guess whether to implement the same for other case Y' - which leads to losing or winning an instance.

veselin · 2025-07-23T10:59:06 1753268346

Anybody knows if one can find an inference provider that offers input token caching? It should be almost required for agentic use - first speed, but also almost all conversations start where the previous ended, so cost may end up quite higher with no caching.

I would have expected good providers like Together, Fireworks, etc support it, but I can't find it, except if I run vllm myself on self-hosted instances.

gianpaj · 2025-07-23T11:23:44 1753269824

Alibaba Cloud does: > Supported models. Currently, qwen-max, qwen-plus, qwen-turbo, qwen3-coder-plus support context cache.

zackify · 2025-07-23T11:01:08 1753268468

I know. I cannot believe lm studio. Ollama. Especially model providers, do not offer this yet.

veselin · 2025-07-02T16:07:55 1751472475

I think that people are just too quick to assume this is amazing, before it is there. Which doesn't mean it won't get there.

Somehow if I take the best models and agents, most hard coding benchmarks are at below 50% and even swe bench verified is like at 75 maybe 80%. Not 95. Assuming agents just solve most problems is incorrect, despite it being really good at first prototypes.

Also in my experience agents are great to a point and then fall off a cliff. Not gradually. Just the type of errors you get past one point is so diverse, one cannot even explain it.

veselin · 2025-03-31T12:53:27 1743425607

I noticed a similar trends in selling on X. Put a claim, peg on some product A with good sales - Cursor, Claude, Gemini, etc. Then say, the best way to use A is with our best product, guide, being MCP or something else.

For some of these I see something like 15k followers on X, but then no LinkedIn page for example. Website is always a company you cannot contact and they do everything.

jpadkins · 2025-03-31T15:31:54 1743435114

no linkedIn page is a green flag for me.

veselin · on Dec 18, 2024

Yes. The article is click bait. With such a title I would have expected majority of the area to be dummy, but it is just structurally more silicon, exactly like a picture may be majority of its mass wood.

ItsTotallyOn · on Dec 18, 2024

Your statement is incorrect. The analysis was made by a professional firm - dummy silicon shims are used because the dies are thinned, as per AMD's own disclosures. Those silicon shims are bonded to the compute and SRAM dies.