Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Modern x86_64 has supported multiple page sizes for a long time. I'm on commodity Zen 5 hardware (9900X) with 128 GiB of RAM. Linux will still use a base page size of 4kb but also supports both 2 MiB and 1 GiB huge pages. You can pass something like `default_hugepagesz=2M hugepagesz=1G hugepages=16` to your kernel on boot to use 2 MiB pages but reserve 16 1 GiB pages for later use.

The nice thing about mimalloc is that there are a ton of configurable knobs available via env vars. I'm able to hand those 16 1 GiB pages to the program at launch via `MIMALLOC_RESERVE_HUGE_OS_PAGES=16`.

EDIT: after re-reading your comment a few times, I apologize if you already knew this (which it sounds like you did).

 help



Right but on Intel the 1G page size has historically been the odd one. For example Skylake-X has 1536 L2 shared TLB entries for either 4K or 2M pages, but it only has 16 entries that can be used for 1G pages. It wasn't unified until Cascade Lake. But Skylake-like Xeon is still incredibly common in the cloud so it's hard to target the later ones.

So for any process that's using less than 16GB, it's a significant performance boost. And most processes using more RAM, but not splitting accesses across more than 16 zones in rapid succession, will also see a performance boost.

My old Intel CPU only has 4 slots for 1GB pages, and that was enough to get me about a 20% performance boost on Factorio. (I think a couple percent might have been allocator change but the boost from forcing huge pages was very significant)


That strikes me as a common hugepages win. People never believe you, though, when you say you can make their thing 20% faster for free.

Then it should be pretty easy to display that 20% "faster for free", no? But as always the devil is in the details. I experimented a lot with huge pages, and although in theory you should see the performance boost, the workloads I have been using to test this hypothesis did not end up with anything statistically significant/measurable. So, my conclusion was ... it depends.

Try a big factorio map just as a test case. It's a bit of an outlier on performance, in particular it's very heavy on memory bandwidth.

Of course, it only helps workloads that exhibit high rates of page table walking per instruction. But those are really common.

Yes, I understand that. It is implied that there's a high TLB miss rate. However, I'm wondering if the penalty which we can quantify as O(4) memory accesses for 4-level page table, which amounts to ~20 cycles if pages are already in L1 cache, or ~60-200 cycles if they are in L2/L3, would be noticeable in workloads which are IO bound. In other words, would such workloads benefit from switching to the huge pages when most of the time CPU anyways sits waiting on the data to arrive from the storage.

In a multi-tenant environment, yes. The faster they can get off the CPU and yield to some other tenant, the better it is.

    > commodity
    > zen 5
    > 128GiB
Are you from the future?

I'm not sure what point you're trying to make.

In the middle of last year, a 9900X was around $350 and 128GB of memory was also around $350. That's very easily "commodity" range.


Damn. I feel old and must've missed that boat. Several other boats too, I guess.

Here I was thinking 16GiB is pretty good. I get to compile LibreOffice in an afternoon. QtWebEngine overnight.

Doesn't 128GiB make rowhammer much more feasible? You'd have 32GiB per DIMM.

Oh well


Two 64GiB DIMMs would be the more likely setup. The current CPUs strongly prefer having only one stick of DDR5 per channel.

The effectiveness of rowhammer depends on how well the manufacturer implemented target row refresh. But the internal ECC on DDR5 should help defend against it somewhat.

Personally I've been in the 24-32GiB range since 2013, and that's despite the fact that I'm still on DDR3.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: