It’s architecturally not a good approach. System RAM is much slower so you should put data that doesn’t need to be used often on it. That knowledge is at the application layer. Adding a CUDA shim makes system RAM appear like VRAM, which gets things to run, but it will never run very well.
The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.
Not true for unified systems. And for strix halo you need to dedicate the amount which is annoying.
You’re basically stating that swapping is also a bad idea.
And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest
The fundamental problem here is that the workload of LLMs is (vastly simplified) a repeated linear read of all the weights, in order. That is, there is no memory locality in time. There is literally anti-locality; When you read a set of weights, you know you will not need them again until you have processed everything else.
This means that many of the old approaches don't work, because time locality is such a core assumption underlying all of them. The best you can do is really a very large pool of very fast ram.
In the long term, compute is probably going to move towards the memory.
The main blocker with swapping is not even the limited bandwidth, it's actually the extreme write workload on data elements such as the per-layer model activations - and, to a much lesser extent, the KV-cache. In contrast, there are elements such as inactive experts for highly sparse MoE models, where swapping makes sense since any given expert will probably be unused. You're better off using that VRAM/RAM for something else. So the logic of "reserve VRAM for the highest-value uses, use system RAM as a second tier, finally use storage as a last resort or for read-only data" is still quite valid.
How do get the weights for the right set of experts for a given batch of tokens into fast memory at the right time?
The activated experts is only available after routing, at which point you need the weights immediately and will have very poor performance if they are across PCIe
Once your model is large enough you'll have to eat the offload cost for something, and it might as well be something where most of that VRAM footprint isn't even used. For current models, inactive experts arguably fit that description best. Of course, it may be the case that shifting that part of the graph to CPU compute is a better deal than paying the CPU-to-GPU cost for the active weights and computing on GPU; that's how llama.cpp does it.
Try turning swap off and really find out if you’re not grateful for it. Might be fine if you’re never using all your RAM, but if you are, swap off isn’t fun and you might realize you’ve been unconsciously grateful this whole time. ;) Swap might be important for GPU usage even when not using something like greenboost, since display GPUs sometimes use system RAM to back the GPU VRAM.
> Try turning swap off and really find out if you’re not grateful
Er, I did exactly this over a decade ago and never looked back. It's literally one of the first things I do on a new machine.
> Might be fine if you’re never using all your RAM
That's definitely happened occasionally, and no, swap almost always just makes it worse. The thrashing makes the entire machine unusable instead of making the allocating app(s) potentially unstable. I've recovered most times by just immediately killing the app I'm using. And in fact I have warnings that sometimes tell me fast enough before I reach the limit to avoid such issues in the first place.
Strix Halo’s unified setup is pretty cool. In systems with 128GB of memory, in BIOS set the dedicated GPU memory to the smallest permitted and the Drivers will use the whole main memory pool appropriately in Linux and Windows
The OSS AMDGPU drivers by default allocate a fixed percentage of system RAM for GTT (up to 75%), they do not automatically use the entire system memory. You can override this with the kernel options I posted in my original comment, but as I mentioned, there are some serious negative consequences. You also may need to disable IOMMU or use PT mode. Personally I have had a lot of crashes as a result of this stuff, so I went back to the defaults and just don't run big models.
The biggest factor of whether AMD GPUs on Linux are a PITA or not is ROCm. Strix Halo is supported in ROCm 6.x, so it should be supported on most platforms (I haven't tested it tho). ROCm 7.x is supposed to be better but not all apps support it yet.
AMD, if you're reading this, please hire more SWEs. Nvidia will continue to dominate until you beat them at software.
It's not true for unified systems, because they have no secondary RAM that could be used to extend the GPU memory.
It's pretty weird to insist on a counterargument that has no implications or consequences to the presented argument.
Yes, swapping is a bad idea.
Your second argument also falls flat, because the standard CUDA hardware setup doesn't use CXL so cache coherence isn't available. You're left with manual memory synchronization. Pretending that GPUs have cache for system RAM when they don't is pretty suspect.
From my experience, accessing system RAM from the GPU is so slow, it might as well count as "does not work". It's orders of magnitudes faster to memcpy large swaths of memory that you are going to use to the GPU, rather than accessing system mem from a kernel which then takes ages to wait for that small block/page of memory, then waits again for the next small page/block of memory, etc. Latency hiding doesnt work anymore if the latency is that large.
You’re right for some workloads, but not all of them. The same could have been said for disk swap since the beginning though, and people still found it valuable. Disk swapping with spinning drives did used to be multiple orders of magnitude slower than RAM. But it prevented applications or the system from crashing.
Using system memory from the GPU isn’t that bad if your compute is high enough and you don’t transfer that much data. There are commercial applications that support it and only see low 2-digit percentage perf impact and not the multiples you might expect. Plus on Windows on Nvidia hardware, the driver will automatically use system memory if you oversubscribe VRAM, and I believe this was introduced to support running Stable Diffusion on smaller GPUs.
Yes, with current LLMs and current hardware and current supporting software this is a true statement. My point wasn't that this approach suddenly changes that, it was that it makes it easier to explore alternatives that might change that. Let's imagine some possibilities:
- Models that use a lot of weight reuse: If you strategically reuse layers 3-4x that could give a lot of time for async loading of future weights.
- Models that select experts for several layers at a time: Same thing, while crunching on the current layer you have teed-up future layers that can be transferring in
- HW makers start improving memory bandwidth: This is already happening right? AMD and Apple are pushing unified memory architectures with much higher bandwidth but still not quite there compared to GPUs. This could lead to a hybrid approach that makes those machines much more competitive. similarly, HW makers could bring back technologies that died on the vine that could help, things like Intel's optaine come to mind. Start making mass storage as fast as system memory is now and the equation may change.
These are quick dart throws that probably have obvious holes in them but the point is platforms like this help us explore paths that appeared dead-end until that one change makes them viable and then allows them to take over. It may not happen. It may be a dead end. But that logic means we will never go out on a limb and try something new. We need people and tech that challenges assumptions and makes it easy for people to try out ideas to keep the tech ecosystem evolving. This does that. Even if this particular project doesn't succeed it is a great thing to do if for no other reason it likely just spurred a bunch of people to try their own crazy hacks for LLM inference. Maybe it even enabled a use case with GPUs that nobody realized existed and has nothing to do with LLMs.
The market cap is unambiguous, a more correct estimate of "how much to buy all the shares?" is situational and would just distract from getting the point across.
> According to the researchers, an unpatched Windows PC connected to the Internet will last for only about 20 minutes before it's compromised by malware, on average. That figure is down from around 40 minutes, the group's estimate in 2003.
This was from two decades ago, and cursory searching suggests the average lifetime of an unpatched system is even lower now.
2. The Claude Code system prompt almost certainly gives directions about how to deal with MCP tools, and may also include the list of tools
3. Instruction adherence is higher when the instructions are placed in the system prompt
If you put these three facts together then it’s quite likely that Claude Code usage of a particular tool (in the generic sense) is higher as an MCP server than as a CLI command.
But why let this be a limitation? Make an MCP server that calls your bash commands. Claude Code will happily vibe code this for you, if you don’t switch to a coding tool that gives better direct control of your system prompt.
I think the idea is that instead of spending an additional $4000 on external hardware, you can just buy one thing (your main work machine) and call it a day. Also, the Mac Studio isn’t that much cheaper at that price point.
> Being able to leave the thing at home and access it anywhere is a feature, not a bug.
I can do that with a laptop too. And with a dedicated GPU. Or a blade in a data center. I though the feature of the DGX was that you can throw it in a backpack.
You're not going to use the DGX as your main machine, so you'll need another computer. Sure, not a $4000 one, but you'll want at least some performance, so it'll be another $1000-$2000.
Because Nvidia is incredibly slow with kernel updates and you are lucky if you get them at all after just two years. I am curious if they will update these machines for longer than their older dgx like hardware.
Now that you bring it up, the M3 ultra Mac Studio goes up to 512GB for about a $10k config with around 850 GB/s bandwidth, for those who "need" a near frontier large model. I think 4x the RAM is not quite worth more than doubling the price, especially if MoE support gets better, but it's interesting that you can get a Deepseek R1 quant running on prosumer hardware.
The platter is a circle so using the uniform distribution [0, 1] is incorrect, you should use the unit circular distribution of [0, 2pi] and also since the platter also spins in a single direction the distance is only computed going one way around (if target is right before current, it's one full spin).
But you can simplify this problem down and ask: with no loss of generality, if your starting point is always 0 degrees, how many degrees clockwise is a random point on average, if the target is uniformly distributed?
Since 0-180 has the same arc length as 180-360 then the average distance is 180 degrees. So average half-platter seek is half of the full-platter seek.
What you wrote only applies to rotational latency, not seek latency. The seek latency is the time it takes for the head to reach the target. Heads only rotate within the small range like [ 0 .. 25° ], they are designed for rapid movements in either direction.
> If you're copying and pasting something, there probably isn't a good reason for that.
I would embrace copying and pasting for functionality that I want to be identical in two places right now, but I’m not sure ought to be identical in the future.
I agree completely. DRY shouldn't be a compression algorithm.
If two countries happen to calculate some tax in the same way at a particular time, I'm still going to keep those functions separate, because the rules are made by two different parliaments idependently of each other.
Referring to the same function would simply be an incorrect abstraction. It would suggest that one tax calculation should change whenever the other changes.
If, on the other hand, both countries were referring to a common international standard then I would use a shared function to mirror the reference/dependency that they decided to put into their respective laws.
btw, veracrypt is the name if the follow up project. truecrypt shut down over a decade ago rather abruptly, so anything labeled truecrypt today is suspect as either out of date or potential malware.
Wasn't the conspiracy theory that truecrypt got shut down because it was 'too effective', and the successor projects presumably have intentional backdoors or something?
Truecrypt was likely developed by only 1 man, Paul le roux, who likely shut it down because he was on the run for being an international drug/human smuggler/cartel member. It’s kind of a crazy story.
But either way both truecrypt and veracrypt were independently audited and no major flaws were found. Not sure when the last veracrypt audit was done.
Nah, just drop a few thousand 1GB flash drives from a plane. Load them with a tor browser, a wireguard client, and instructions on finding a remote exit. Only one copy needs to survive and it can spread very quickly and irreversibly by foot.
Yeah, this is a great approach if you're already at war with a country.
If you're not and they're still allowing your planes to fly through their airspace then this is a great way to ensure that they lock your (and your friends') planes out.
It's unjust in the same sense that some people complain about capitalism being unjust: some people are wealthy who didn't cosmically deserve it, but just got lucky. There is disagreement over in which way they were lucky (random luck, or lucky to have the right parents, education, genes, etc.)
The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.
reply