More

menaerus · 2026-03-30T17:56:09 1774893369

> Rust is quite capable of expression templates, as its iterator adapters prove.

AFAIU iterator adapters are not quite what expression templates are because they rely on the compiler optimizations rather than the built-in feature of the language, which enable you to do this without relying on the compiler pipeline.

aw1621107 · 2026-03-30T18:40:40 1774896040

I had always thought expression templates at the very least needed the optimizer to inline/flatten the tree of function calls that are built up. For instance, for something like x + y * z I'd expect an expression template type like sum<vector, product<vector, vector>> where sum would effectively have:

    vector l;
    product& r;
    auto operator[](size_t i) {
        return l[i] + r[i];
    }

And then product<vector, vector> would effectively have:

    vector l;
    vector r;
    auto operator[](size_t i) {
        return l[i] * r[i];
    }

That would require the optimizer to inline the latter into the former to end up with a single expression, though. Is there a different way to express this that doesn't rely on the optimizer for inlining?

menaerus · 2026-03-30T19:31:58 1774899118

Expression templates do not rely on optimizer since you're not dealing with the computations directly but rather expressions (nodes) through which you are deferring the computation part until the very last moment (when you have a fully built an expression of expressions, basically almost an AST). This guarantees that you get zero cost when you really need it. What you're describing is something keen of copy elision and function folding though inlining which is pretty much basics in any c++ compiler and happens automatically without special care.

aw1621107 · 2026-03-30T19:55:26 1774900526

> since you're not dealing with the computations directly but rather expressions (nodes) through which you are deferring the computation part until the very last moment (when you have a fully built an expression of expressions, basically almost an AST).

Right, I understand that. What is not exactly clear to me is how you get from the tree of deferred expressions to the "flat" optimized expression without involving the optimizer.

Take something like the above example for instance - w = x + y * z for vectors w/x/y/z. How do you get from that to effectively

    for (size_t i = 0; i < w.size(); ++i) {
        w[i] = x[i] + y[i] * z[i];
    }

without involving the optimizer at all?

menaerus · 2026-03-31T07:59:41 1774943981

The example is false because that's not how you would write an expression template for given computation so the question being how is it that the optimizer is not involved is also not quite set in the correct context so I can't give you an answer for that. Of course that the optimizer is generally going to be involved, as it is for all the code and not the expression templates, but expression templates do not require the optimizer in the way you're trying to suggest. Expression templates do not rely on O1, O2 or O3 levels being set - they work the same way in O0 too and that may be the hint you were looking for.

aw1621107 · 2026-03-31T11:53:12 1774957992

> The example is false because that's not how you would write an expression template for given computation

OK, so how would you write an expression template for the given computation, then?

> Expression templates do not rely on O1, O2 or O3 levels being set - they work the same way in O0 too and that may be the hint you were looking for.

This claim confuses me given how expression templates seem to work in practice?

For example, consider Todd Veldhuizen's 1994 paper introducing expression templates [0]. If you take the examples linked at the top of the page and plug them into Godbolt (with slight modifications to isolate the actual work of interest) you can see that with -O0 you get calls to overloaded operators instead of the nice flattened/unrolled/optimized operations you get with -O1.

You see something similar with Eigen [2] - you get function calls to "raw" expression template internals with -O0, and you need to enable the optimizer to get unrolled/flattened/etc. operations.

Similar thing yet again with Blaze [3].

At least to me, it looks like expression templates produce quite different outputs when the optimizer is enabled vs. disabled, and the -O0 outputs very much don't resemble the manually-unrolled/flattened-like output one might expect (and arguably gets with optimizations enabled). Did all of these get expression templates wrong as well?

[0]: https://web.archive.org/web/20050210090012/http://osl.iu.edu...

[1]: https://cpp.godbolt.org/z/Pdcqdrobo

[2]: https://cpp.godbolt.org/z/3x69scorG

[3]: https://cpp.godbolt.org/z/7vh7KMsnv

menaerus · 2026-03-31T14:17:21 1774966641

Look, I have just completed work on some high performance serialization library which avoids computing heavy expressions and temporary allocations all by using expression templates and no, optimization levels are not needed. The code works as advertised at O0 - that's the whole deal around it. If you have a genuine question you should ask one but please do not disguise so that it only goes to prove your point. I am not that naive. All I can say is that your understanding of expression templates is not complete and therefore you draw incorrect conclusions. Silly example you provided shows that you don't understand how expression template code looks like and yet you're trying to prove your point all over and over again. Also, most of the time I am writing my comments on my mobile so I understand that my responses sometime appear too blunt but in any case I will obviously not going to write, run or check the code as if I had been on my work. My comments here is not work, and I am not here to win arguments, but most of the time learn from other people's experiences, and sometimes dispute conclusions based on those experiences too. If you don't believe me, or you believe expression templates work differently, then so be it.

menaerus · 2026-03-30T17:46:49 1774892809

No, they are not.

aw1621107 · 2026-03-30T18:49:52 1774896592

They are both; there are things that Rust's macros can do metaprogramming-wise that C++ templates cannot do and vice-versa.

Rust's macros work on a syntactic level, so they are more powerful in that they can work with "normally" invalid code and perform token-to-token transformations (and in the case of proc macros effectively function as compiler extensions/plugins) and less powerful in that they don't have access to semantic information.

aldanor · 2026-03-30T22:17:35 1774909055

Incorrect.

menaerus · 2026-03-30T17:33:18 1774891998

Totally. It's funny how many people do not actually realize this and get stuck in cargo-cult mindset forever, no matter the years of experience.

menaerus · 2026-03-30T17:13:24 1774890804

I have never observed that issue, and I have been using it to build MMLoC repositories. Perhaps the reason being is that I always use it coupled with ccache. Have you tried that?

throwway120385 · 2026-03-30T17:26:55 1774891615

ccache is a workaround for the mtime problem. You can either hash with ccache or hash directly in the build system, but either way there's no substitute for hashing something. Ccache is hashing the input to the build, but there may be elements of the build that lie outside of ccache's awareness that having a hash-aware build system would take care of. Partial rebuilds devolve to a cache invalidation problem pretty quickly either way.

menaerus · 2026-03-30T19:56:42 1774900602

I'm obviously aware that ccache solves the mtime problem which is why I find disingenious to say that switching branches with ninja is "totally unusable". Therefore my question.

Hash-aware build systems like bazel, if that's what you're imputing, are a nightmare to work with and come with their own set of problems which make it much less appealing to work with than (some) limitations found in cmake+ninja

menaerus · 2026-03-27T13:26:10 1774617970

Retrofitting new patterns or ideas is underutilized only when it is not worth the change. string_view example is trivial and anyone who cared enough about the extra allocations that could have happened already (no copy-elision taking place) rolled their own version of string_view or simply used char+len pattern. Those folks do not wait for the new standard to come along when they can already have the solution now.

std::optional example OTOH is also a bad example because it is heavily opinionated, and having it baked into the API across the standard library would be a really wrong choice to do.

VorpalWay · 2026-03-27T15:02:41 1774623761

Existing APIs for file IO in STL don't return string views into the file buffer of the library (when using buffered IO). That is something you could do, as an example.

Optional being opinionated I don't think I agree with. It is better to have an optional of something that can't be null (such as a reference) than have everything be implicitly nullable (such as raw pointers). This means you have to care about the nullable case when it can happen, and only when it can happen.

There is a caveat for C++ though: optional<T&> is larger in memory than a rae pointer. Rust optimises this case to be the same size (one pointer) by noting that the zero value can never be valid, so it is a "niche" that can be used for something else, such as the None variant of the Option. Such niche optimisation applies widely across the language, to user defined types as well. That would be impossible tp retrofit on C++ without at the very least breaking ABI, and probably impossible even on a language level. Maybe it could be done on a type by type basis with an attribute to opt in.

menaerus · 2026-03-27T16:14:59 1774628099

I work on a codebase which is heavily influenced by the same sentiment you share wrt optional and I can tell you it's a nightmare. Has the number of bugs somehow magically decreased? No, it did not, as a matter of fact the complexity that it introduces, which is to be honest coupled along with the monadic programming patterns which are normally enforced within such environments, just made it more probable to introduce buggy code at no obvious advantage but at the great cost - ergonomics, reasoning about the code, and performance. So, yeah, I will keep the position that it is heavily opinionated and not solving any real problem until I see otherwise - the evidence in really complex C++ production code. I have worked with many traditional C and C++ codebases so that is my baseline here. I prefer working with latter.

jandrewrogers · 2026-03-27T15:51:12 1774626672

Niche optimizations are trivial to automate in modern C++ if you wish. Many code bases automagically generate them.

The caveat is that niche optimizations are not perfectly portable, they can have edge cases. Strict portability is likely why the C++ standard makes niche optimization optional.

TuxSH · 2026-03-27T23:36:13 1774654573

> optional<T&>

This is a C++26 feature which will have pointer-like semantics, aren't you confusing it with optional<reference_wrapper<T>> ?

menaerus · 2026-03-27T12:50:03 1774615803

My first thought was also that this is also reminiscent of RLMs - they are ought to solve the same problem as far as my understanding goes. Authors say "Self-improving AI systems aim to reduce reliance on human engineering by learning to improve their own learning and problem-solving processes" which is what RLM is trying to solve so my understanding is that this work shares the same goal but takes a different approach. E.g. instead of using REPL-like environment with multiple (or even single) agents, which is what RLMs are doing, they suggest using agents that can modify themselves. I didn't read the paper so I don't know how this really works but it caught my attention so if you could share more insights I would appreciate it.

menaerus · 2026-03-26T13:00:01 1774530001

The success of the model responding to you with a correct information is a function of giving it proper context too.

That hasn't changed nor I think it will, even with the models having very large context windows (eg Gemini has 2M). It is observed that having a large context alone is not enough and that it is better to give the model sufficiently enough and quality information rather than filling it with virtually everything. Latter is also impossible and does not scale well with long and complicated tasks where reaching the context limit is inevitable. In that case you need to have the RAG which will be smart enough to extract the sufficient information from previous answers/context, and make it part of the new context, which in turn will make it possible for the model to keep its performance at satisfactory level.

menaerus · 2026-03-26T07:07:13 1774508833

Golang has a GC and that makes a lot of things easier.

menaerus · 2026-03-24T16:07:36 1774368456

So perhaps a mixed read+write workload would be more interesting, no? Write-only is characteristic of ingestion workloads. That said, libaio vs io_uring difference is interesting. Did you perhaps run a perf profile to understand where the differences are coming from? My gut feeling is that it is not necessarily an artifact of less context-switching with io_uring but something else.

eivanov89 · 2026-03-24T16:23:12 1774369392

There are a couple of challenges with mixed read+write workloads on NVMe.

In practice, read latency tends to degrade over time under mixed load. We observe this even across relatively short consecutive runs. To get meaningful results, you need to first drive the device into a steady state. In our case, however, we were primarily interested in software overhead rather than device behavior.

For a cleaner comparison, it would probably make sense to use something like an in-memory block device (e.g., ublk), but we didn’t dig into it.

As for profiling: we didn’t run perf, so the following is my educated guess:

1. With libaio, control structures are copied as part of submission/completion. io_uring avoids some of this overhead via shared rings and pre-registered resources. 2. In our experience (in YDB), AIO syscall latency tends to be less predictable, even when well-tuned. 3. Although we report throughput, the setup is effectively latency-bound (single fio job). With more concurrency, libaio might catch up.

We intentionally used a single job because we typically aim for one thread per disk (two at most if polling enabled). In our setup (usually 6 disks), increasing concurrency per device is not desirable.

menaerus · 2026-03-26T08:04:27 1774512267

Some quality thoughts here, thanks.

> In practice, read latency tends to degrade over time under mixed load. We observe this even across relatively short consecutive runs. To get meaningful results, you need to first drive the device into a steady state. In our case, however, we were primarily interested in software overhead rather than device behavior.

I see. Provocative thought in that case would then be - in what % are io_uring improvements (over libaio) undermined by the device behavior (firmware) in mixed workloads. That % could range from noticeably to almost nothing so it might very well affect the experiment conclusion.

For example, if one is posing the question if switching to io_uring is worth it, I could definitely see different outcomes of that experiment in mixed workloads per observations that you described.

> For a cleaner comparison, it would probably make sense to use something like an in-memory block device (e.g., ublk), but we didn’t dig into it.

Yeah but in which case you would then be testing the limits of ublk performance, no? Also, it seems to be implemented on top of io_uring AFAICS.

I have personally learned to make experiments, and derive the conclusions out of them by running them in the environment which is as close as it gets to the one in production. Otherwise, there's really no guarantee that behavior observed in env1 will be reproducible or correlate to the behavior in env2. Env1 in this particular case could be write-only workload while env2 would be a mixed-workload.

> We intentionally used a single job because we typically aim for one thread per disk (two at most if polling enabled). In our setup (usually 6 disks), increasing concurrency per device is not desirable

This is also interesting. May I ask why is that the case? Are you able to saturate the NVMe disk just with a single thread? I assume not but you may be using some particular workloads and/or avoiding kernel that makes this possible.

eivanov89 · 2026-03-26T09:28:02 1774517282

> Provocative thought in that case would then be - in what % are io_uring improvements (over libaio) undermined by the device behavior (firmware) in mixed workloads. That % could range from noticeably to almost nothing so it might very well affect the experiment conclusion.

That’s absolutely fair. Also, it would be useful to test across different devices, since their behavior can vary significantly, especially when preconditioned or under corner-case workloads.

In our case, we focused on scenarios typical for YDB deployments, so we didn’t extend the study further. That said, we believe the observed trends are fairly general.

> For example, if one is posing the question if switching to io_uring is worth it, I could definitely see different outcomes of that experiment in mixed workloads per observations that you described.

I agree that for mixed workloads the outcome may differ. However, for us the primary concern in the AIO vs io_uring comparison is syscall behavior.

It is critical that submission does not block unpredictably. Even without polling, io_uring shows consistently better latency across the full range of iodepths. If device latency dominates (as in your scenario), the relative benefit may shrink, but a faster submission path still helps drive higher effective queue depth and utilize the device better.

> This is also interesting. May I ask why is that the case? Are you able to saturate the NVMe disk just with a single thread? I assume not but you may be using some particular workloads and/or avoiding kernel that makes this possible.

The component we are working on is designed for write-intensive workloads. Due to DWPD constraints, we intentionally limit sustained write throughput to what the device can safely handle over its lifetime. In practice, this is often on the order of ~200–300 MB/s, which a single thread can easily saturate.

At the same time, we care a lot about burst behavior. With AIO, we observed poor predictability: total latency depends heavily on how requests are submitted (especially with batching), and syscall time can grow proportionally to batch size * event count.

io_uring largely eliminates this issue by decoupling submission from syscalls and providing a much more stable submission path. Additionally, for bursty workloads we can use SQPOLL + IOPOLL to further reduce latency in specialized setups.

menaerus · 2026-03-26T12:42:52 1774528972

> Also, it would be useful to test across different devices, since their behavior can vary significantly, especially when preconditioned or under corner-case workloads.

Agreed. And from first-hand experience I know how painful this is, and how proving or disproving the hypothesis you have about certain wheel in the system can turn into a crazy rabbit hole, especially in the infrastructure software which due to ever increasing volume in data (and distribution thereof) is stressing the software and hardware up to their limits.

I used to test my algos across wide range of HW I had access to. It included "slow" HDD and "fast" NVMe disks, even Optane, low and high amount of RAM, slow and fast CPUs, different cache sizes and topologies, NUMA vs no-NUMA etc. This was the case because software developed didn't have the leisure of running within the fully controlled SW/HW so I had to make sure that it runs well across different configurations, even operating systems, microarchitectures, etc.

And it was a challenge to be able to decouple noise from the signal, given how many experiments one had to run and given how volatile (stateful) our HW generally really is, barring all the non-determinism imposed by the software (database kernel + operating system kernel).

> In our case, we focused on scenarios typical for YDB deployments, so we didn’t extend the study further.

Yes, that is fair enough and basically only what matters - not solving a "general" problem but solving a problem at hand has been most successful strategy for me as well.

> It is critical that submission does not block unpredictably. Even without polling, io_uring shows consistently better latency across the full range of iodepths. If device latency dominates (as in your scenario), the relative benefit may shrink, but a faster submission path still helps drive higher effective queue depth and utilize the device better.

Yes, I would probably easily agree that io_uring in general is a better design. C++ executors-like design but in the kernel itself, pretty advanced from what I could tell last time I delved into the implementation details (~2 years ago). Given I had developed an executor-like (userspace) library myself, I figure that in more extreme cases one would like to gain the total control of the IO scheduling and processing process. This is an exercise I would like to do at certain moment.

> ... io_uring largely eliminates this issue by decoupling submission from syscalls and providing a much more stable submission path. Additionally, for bursty workloads we can use SQPOLL + IOPOLL to further reduce latency in specialized setups.

Thanks for sharing the details. I figured there was something peculiar about what you're doing. Quite interesting requirements.

menaerus · 2026-03-23T18:02:37 1774288957

> but there is more shoe consumption than ever before in history.

Is it because the population is constantly growing or is it because per-person shoe-units is increasing due to that person increased wealth or is it because per-person shoe-units is increasing due to 5x lower price of generic shoe-units? How does that exactly transfer to the production of software and market absorbing the software hyper-inflation?