More

smallnamespace · 2026-03-19T02:43:54 1773888234

It’s architecturally not a good approach. System RAM is much slower so you should put data that doesn’t need to be used often on it. That knowledge is at the application layer. Adding a CUDA shim makes system RAM appear like VRAM, which gets things to run, but it will never run very well.

The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.

jbverschoor · 2026-03-19T07:16:26 1773904586

Not true for unified systems. And for strix halo you need to dedicate the amount which is annoying.

You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest

Tuna-Fish · 2026-03-19T10:48:45 1773917325

On some workloads, swapping is a bad idea.

The fundamental problem here is that the workload of LLMs is (vastly simplified) a repeated linear read of all the weights, in order. That is, there is no memory locality in time. There is literally anti-locality; When you read a set of weights, you know you will not need them again until you have processed everything else.

This means that many of the old approaches don't work, because time locality is such a core assumption underlying all of them. The best you can do is really a very large pool of very fast ram.

In the long term, compute is probably going to move towards the memory.

zozbot234 · 2026-03-19T11:03:28 1773918208

The main blocker with swapping is not even the limited bandwidth, it's actually the extreme write workload on data elements such as the per-layer model activations - and, to a much lesser extent, the KV-cache. In contrast, there are elements such as inactive experts for highly sparse MoE models, where swapping makes sense since any given expert will probably be unused. You're better off using that VRAM/RAM for something else. So the logic of "reserve VRAM for the highest-value uses, use system RAM as a second tier, finally use storage as a last resort or for read-only data" is still quite valid.

rnrn · 2026-03-19T18:01:38 1773943298

How do get the weights for the right set of experts for a given batch of tokens into fast memory at the right time?

The activated experts is only available after routing, at which point you need the weights immediately and will have very poor performance if they are across PCIe

zozbot234 · 2026-03-19T18:17:12 1773944232

Once your model is large enough you'll have to eat the offload cost for something, and it might as well be something where most of that VRAM footprint isn't even used. For current models, inactive experts arguably fit that description best. Of course, it may be the case that shifting that part of the graph to CPU compute is a better deal than paying the CPU-to-GPU cost for the active weights and computing on GPU; that's how llama.cpp does it.

dataflow · 2026-03-19T07:49:41 1773906581

> You’re basically stating that swapping is also a bad idea.

Is that a crazy thing to say? I can't recall the last time I was grateful for swap; it might've been before 2010.

dahart · 2026-03-19T14:11:22 1773929482

Try turning swap off and really find out if you’re not grateful for it. Might be fine if you’re never using all your RAM, but if you are, swap off isn’t fun and you might realize you’ve been unconsciously grateful this whole time. ;) Swap might be important for GPU usage even when not using something like greenboost, since display GPUs sometimes use system RAM to back the GPU VRAM.

dataflow · 2026-03-19T14:41:05 1773931265

> Try turning swap off and really find out if you’re not grateful

Er, I did exactly this over a decade ago and never looked back. It's literally one of the first things I do on a new machine.

> Might be fine if you’re never using all your RAM

That's definitely happened occasionally, and no, swap almost always just makes it worse. The thrashing makes the entire machine unusable instead of making the allocating app(s) potentially unstable. I've recovered most times by just immediately killing the app I'm using. And in fact I have warnings that sometimes tell me fast enough before I reach the limit to avoid such issues in the first place.

literalAardvark · 2026-03-19T12:19:44 1773922784

If you've used any unreserved VM ever you're grateful for swapping.

Somewhat indirectly but still.

Tsiklon · 2026-03-19T12:59:44 1773925184

Strix Halo’s unified setup is pretty cool. In systems with 128GB of memory, in BIOS set the dedicated GPU memory to the smallest permitted and the Drivers will use the whole main memory pool appropriately in Linux and Windows

stuaxo · 2026-03-19T13:25:14 1773926714

Does this work on the open source amdgpu drivers ?

I've been a bit too busy to turn mine on for a while.

0xbadcafebee · 2026-03-19T20:46:53 1773953213

The OSS AMDGPU drivers by default allocate a fixed percentage of system RAM for GTT (up to 75%), they do not automatically use the entire system memory. You can override this with the kernel options I posted in my original comment, but as I mentioned, there are some serious negative consequences. You also may need to disable IOMMU or use PT mode. Personally I have had a lot of crashes as a result of this stuff, so I went back to the defaults and just don't run big models.

The biggest factor of whether AMD GPUs on Linux are a PITA or not is ROCm. Strix Halo is supported in ROCm 6.x, so it should be supported on most platforms (I haven't tested it tho). ROCm 7.x is supposed to be better but not all apps support it yet.

AMD, if you're reading this, please hire more SWEs. Nvidia will continue to dominate until you beat them at software.

Tsiklon · 2026-03-19T13:48:37 1773928117

I’ve had no issues running GPT-OSS 120b with decent performance on the machine (HP Zbook Ultra G1a). Running on Bluefin/Universal Blue and Windows.

imtringued · 2026-03-19T07:57:25 1773907045

It's not true for unified systems, because they have no secondary RAM that could be used to extend the GPU memory.

It's pretty weird to insist on a counterargument that has no implications or consequences to the presented argument.

Yes, swapping is a bad idea.

Your second argument also falls flat, because the standard CUDA hardware setup doesn't use CXL so cache coherence isn't available. You're left with manual memory synchronization. Pretending that GPUs have cache for system RAM when they don't is pretty suspect.

timnetworks · 2026-03-19T04:13:39 1773893619

Some people are not concerned with having it run the fastest, just having it run at all may be enough.

m-schuetz · 2026-03-19T05:53:15 1773899595

From my experience, accessing system RAM from the GPU is so slow, it might as well count as "does not work". It's orders of magnitudes faster to memcpy large swaths of memory that you are going to use to the GPU, rather than accessing system mem from a kernel which then takes ages to wait for that small block/page of memory, then waits again for the next small page/block of memory, etc. Latency hiding doesnt work anymore if the latency is that large.

dahart · 2026-03-19T14:19:38 1773929978

You’re right for some workloads, but not all of them. The same could have been said for disk swap since the beginning though, and people still found it valuable. Disk swapping with spinning drives did used to be multiple orders of magnitude slower than RAM. But it prevented applications or the system from crashing.

Using system memory from the GPU isn’t that bad if your compute is high enough and you don’t transfer that much data. There are commercial applications that support it and only see low 2-digit percentage perf impact and not the multiples you might expect. Plus on Windows on Nvidia hardware, the driver will automatically use system memory if you oversubscribe VRAM, and I believe this was introduced to support running Stable Diffusion on smaller GPUs.

nl · 2026-03-19T05:41:17 1773898877

But then you can use CPU/RAM offload, which already allows you to offload without a kernel module.

jmward01 · 2026-03-19T15:14:51 1773933291

> It’s architecturally not a good approach.

Yes, with current LLMs and current hardware and current supporting software this is a true statement. My point wasn't that this approach suddenly changes that, it was that it makes it easier to explore alternatives that might change that. Let's imagine some possibilities:

- Models that use a lot of weight reuse: If you strategically reuse layers 3-4x that could give a lot of time for async loading of future weights.

- Models that select experts for several layers at a time: Same thing, while crunching on the current layer you have teed-up future layers that can be transferring in

- HW makers start improving memory bandwidth: This is already happening right? AMD and Apple are pushing unified memory architectures with much higher bandwidth but still not quite there compared to GPUs. This could lead to a hybrid approach that makes those machines much more competitive. similarly, HW makers could bring back technologies that died on the vine that could help, things like Intel's optaine come to mind. Start making mass storage as fast as system memory is now and the equation may change.

These are quick dart throws that probably have obvious holes in them but the point is platforms like this help us explore paths that appeared dead-end until that one change makes them viable and then allows them to take over. It may not happen. It may be a dead end. But that logic means we will never go out on a limb and try something new. We need people and tech that challenges assumptions and makes it easy for people to try out ideas to keep the tech ecosystem evolving. This does that. Even if this particular project doesn't succeed it is a great thing to do if for no other reason it likely just spurred a bunch of people to try their own crazy hacks for LLM inference. Maybe it even enabled a use case with GPUs that nobody realized existed and has nothing to do with LLMs.

smallnamespace · 2025-12-28T06:06:58 1766902018

The market cap is unambiguous, a more correct estimate of "how much to buy all the shares?" is situational and would just distract from getting the point across.

smallnamespace · 2025-12-21T23:25:14 1766359514

> According to the researchers, an unpatched Windows PC connected to the Internet will last for only about 20 minutes before it's compromised by malware, on average. That figure is down from around 40 minutes, the group's estimate in 2003.

This was from two decades ago, and cursory searching suggests the average lifetime of an unpatched system is even lower now.

https://www.cnet.com/news/privacy/study-unpatched-pcs-compro...

smallnamespace · 2025-11-17T03:02:34 1763348554

Three facts to consider:

1. CLAUDE.md is not part of the system prompt

2. The Claude Code system prompt almost certainly gives directions about how to deal with MCP tools, and may also include the list of tools

3. Instruction adherence is higher when the instructions are placed in the system prompt

If you put these three facts together then it’s quite likely that Claude Code usage of a particular tool (in the generic sense) is higher as an MCP server than as a CLI command.

But why let this be a limitation? Make an MCP server that calls your bash commands. Claude Code will happily vibe code this for you, if you don’t switch to a coding tool that gives better direct control of your system prompt.

whoknowsidont · 2025-11-17T03:35:21 1763350521

>is higher as an MCP server than as a CLI command.

What do you mean by "higher"?

smallnamespace · 2025-10-15T07:47:56 1760514476

An 14-inch M4 Max Macbook Pro with 128GB of RAM has a list price of $4700 or so and twice the memory bandwidth.

For inference decode the bandwidth is the main limitation so if running LLMs is your use case you should probably get a Mac instead.

dialogbox · 2025-10-15T07:58:03 1760515083

Why Macbook Pro? Isn't Mac Studio is a lot cheaper and the right one to compare with DGX Spark?

AndroTux · 2025-10-15T08:58:37 1760518717

I think the idea is that instead of spending an additional $4000 on external hardware, you can just buy one thing (your main work machine) and call it a day. Also, the Mac Studio isn’t that much cheaper at that price point.

dialogbox · 2025-10-15T10:28:53 1760524133

> Also, the Mac Studio isn’t that much cheaper at that price point.

In the list price, it's 1000 USD cheaper. 3,699 vs 4,699 I know a lot can be relative but that's a lot for me for sure.

AndroTux · 2025-10-15T14:55:23 1760540123

Fair. I looked it up just yesterday so I though I knew the prices from memory, but apparently I mixed something up.

MomsAVoxell · 2025-10-15T09:19:58 1760519998

Being able to leave the thing at home and access it anywhere is a feature, not a bug.

The Mac Studio is a more appropriate comparison. There is not yet a DGX laptop, though.

AndroTux · 2025-10-15T14:57:47 1760540267

> Being able to leave the thing at home and access it anywhere is a feature, not a bug.

I can do that with a laptop too. And with a dedicated GPU. Or a blade in a data center. I though the feature of the DGX was that you can throw it in a backpack.

MomsAVoxell · 2025-10-15T17:05:22 1760547922

The DGX is clearly a desktop system. Sure, it's luggable. But the point is, it's not a laptop.

pantalaimon · 2025-10-15T14:18:24 1760537904

How are you spending $4000 on a screen and a keyboard?

AndroTux · 2025-10-15T14:59:49 1760540389

You're not going to use the DGX as your main machine, so you'll need another computer. Sure, not a $4000 one, but you'll want at least some performance, so it'll be another $1000-$2000.

pantalaimon · 2025-10-15T15:09:58 1760540998

> You're not going to use the DGX as your main machine

Why not?

BoredPositron · 2025-10-16T10:02:33 1760608953

Because Nvidia is incredibly slow with kernel updates and you are lucky if you get them at all after just two years. I am curious if they will update these machines for longer than their older dgx like hardware.

smallnamespace · 2025-10-15T14:48:08 1760539688

I didn't think of it ;)

Now that you bring it up, the M3 ultra Mac Studio goes up to 512GB for about a $10k config with around 850 GB/s bandwidth, for those who "need" a near frontier large model. I think 4x the RAM is not quite worth more than doubling the price, especially if MoE support gets better, but it's interesting that you can get a Deepseek R1 quant running on prosumer hardware.

ChocolateGod · 2025-10-15T08:44:50 1760517890

People may prefer running in environments that match their target production environment, so macOS is out of the question.

bradfa · 2025-10-15T09:19:15 1760519955

The Ubuntu that NVIDIA ship is not stock. They seem to be moving towards using stock Ubuntu but it’s not there yet.

Running some other distro on this device is likely to require quite some effort.

ZiiS · 2025-10-15T12:39:14 1760531954

I think the 'environment' here is CUDA; the OS running on the small co-processor you use to buffer some IO is irrelevant.

pjmlp · 2025-10-15T12:32:11 1760531531

It still is more of a Linux distribution than macOS will ever be, UNIX != Linux.

deviation · 2025-10-15T09:18:59 1760519939

It's a hoop to jump through, but I'd recommend checking out Apple's container/containerization services which help accomplish just that.

https://github.com/apple/containerization/

ChocolateGod · 2025-10-15T16:05:06 1760544306

You're likely still targeting Nvidia's stack for LLMs and Linux's containers on MacOS won't help you there.

smallnamespace · 2025-09-25T12:01:00 1758801660

The platter is a circle so using the uniform distribution [0, 1] is incorrect, you should use the unit circular distribution of [0, 2pi] and also since the platter also spins in a single direction the distance is only computed going one way around (if target is right before current, it's one full spin).

But you can simplify this problem down and ask: with no loss of generality, if your starting point is always 0 degrees, how many degrees clockwise is a random point on average, if the target is uniformly distributed?

Since 0-180 has the same arc length as 180-360 then the average distance is 180 degrees. So average half-platter seek is half of the full-platter seek.

Const-me · 2025-09-25T14:01:50 1758808910

What you wrote only applies to rotational latency, not seek latency. The seek latency is the time it takes for the head to reach the target. Heads only rotate within the small range like [ 0 .. 25° ], they are designed for rapid movements in either direction.

smallnamespace · 2025-08-30T23:20:29 1756596029

> If you're copying and pasting something, there probably isn't a good reason for that.

I would embrace copying and pasting for functionality that I want to be identical in two places right now, but I’m not sure ought to be identical in the future.

fauigerzigerk · 2025-08-31T12:37:50 1756643870

I agree completely. DRY shouldn't be a compression algorithm.

If two countries happen to calculate some tax in the same way at a particular time, I'm still going to keep those functions separate, because the rules are made by two different parliaments idependently of each other.

Referring to the same function would simply be an incorrect abstraction. It would suggest that one tax calculation should change whenever the other changes.

If, on the other hand, both countries were referring to a common international standard then I would use a shared function to mirror the reference/dependency that they decided to put into their respective laws.

smallnamespace · 2025-08-29T00:47:57 1756428477

If you're going to be using USB drives anyway, then using them to move files into the country would be faster.

nine_k · 2025-08-29T01:23:15 1756430595

More dangerous though. You'd need something like truecrypt, too.

youainti · 2025-08-29T03:32:26 1756438346

btw, veracrypt is the name if the follow up project. truecrypt shut down over a decade ago rather abruptly, so anything labeled truecrypt today is suspect as either out of date or potential malware.

cheeseomlit · 2025-08-29T13:47:02 1756475222

Wasn't the conspiracy theory that truecrypt got shut down because it was 'too effective', and the successor projects presumably have intentional backdoors or something?

rOOb85 · 2025-08-30T18:30:30 1756578630

Truecrypt was likely developed by only 1 man, Paul le roux, who likely shut it down because he was on the run for being an international drug/human smuggler/cartel member. It’s kind of a crazy story.

But either way both truecrypt and veracrypt were independently audited and no major flaws were found. Not sure when the last veracrypt audit was done.

estimator7292 · 2025-08-29T04:31:44 1756441904

Nah, just drop a few thousand 1GB flash drives from a plane. Load them with a tor browser, a wireguard client, and instructions on finding a remote exit. Only one copy needs to survive and it can spread very quickly and irreversibly by foot.

ZaoLahma · 2025-08-29T05:31:45 1756445505

Yeah, this is a great approach if you're already at war with a country.

If you're not and they're still allowing your planes to fly through their airspace then this is a great way to ensure that they lock your (and your friends') planes out.

chipsrafferty · 2025-08-31T02:01:45 1756605705

Drop them from commercial planes via the toilet?

daflip · 2025-08-31T07:02:56 1756623776

When you flush the toilet in an airplane the contents is normally vacuumed in to a holding tank which gets emptied after the plane lands.

chipsrafferty · 2025-09-03T14:35:07 1756910107

Then why have people died from getting hit by frozen pee icicles?

slater · 2025-09-03T14:36:32 1756910192

pretty sure that's never happened, it's an urban legend

GJim · 2025-08-29T08:01:55 1756454515

Plugging in a strange USB drive?

What could go wrong.

ForOldHack · 2025-08-29T11:38:36 1756467516

Would you like a short list, a long list or ...

smallnamespace · 2025-07-10T05:05:20 1752123920

Typescript ecosystem calls these "branded types" (branded like cattle, presumably) which I found to be a good evocative name.

smallnamespace · 2025-06-29T12:52:50 1751201570

It's unjust in the same sense that some people complain about capitalism being unjust: some people are wealthy who didn't cosmically deserve it, but just got lucky. There is disagreement over in which way they were lucky (random luck, or lucky to have the right parents, education, genes, etc.)