> Inspired by the performance gains and to address the scalability issue of BOLT, we went about designing a scalable infrastructure that can perform BOLT-like post-link optimizations.
> Our experiments on large real-world applications and SPEC with code layout show that Propeller can optimize as effectively as BOLT, with just 20% of its memory footprint and time overhead.
One thing I really wanted to be able to do with Propeller was a post-link optimization of a Go program. Go's compiler is both fast and terrible. If you glance at the text, opportunities for simple optimizations abound. Unfortunately the layout of the program, references between the debug information and the function entry points, and Propeller having been designed for C++ programs kept me from getting very far. I did get a little ways, but not far enough that I could keep a long-running Go program from crashing.
It's like Nat Freidman's "GNU Rope"[1] from 1998 which, if memory serves, was inspired by a tool that did the same thing in IRIX. I'd like to see this functionality built into an OS loader such that the on-disk binaries can remain unchanged (think verifying reproducible builds) but the loader will relocate code in memory to maximize locality of reference.
(An aisde: I was at that talk in 1998 at Atlanta Linux Showcase and remember it fairly well. I'm the guy who asked Nat if he'd considered using a genetic algorithm for the optimization processing.)
Why not hook this into LLVM's compiler infrastructure, instead of recovering the CFG from a binary - and not any binary, but one that has to built with special flags anyway? This seems backwards.
Because if you statically link libraries like the language runtime it allows you to reorder the blocks in those as well to improve locality and instruction cache utilization.
For example if your code calls a particular runtime function shortly after starting, that function (or the basic blocks from that function that are executed frequently) can be placed close to the call site.
If you're statically linking, then LTO would give you cross TU visibility to do the same optimization at link time, not post link when you've lost a lot of fidelity.
Not if it’s a library that you’re not compiling, like language runtimes or system libraries.
LTO requires that you’re compiling all the code to get the benefits.
I’m also not sure if either gcc or clang’s LTO does layout at the basic block level or at the function level. There is a lot of benefit from doing it at the basic block level. You can literally lay out all the code that runs at application start time so that it is mostly adjacent and fall-through from block to block during execution.
This isn’t something you necessarily do instead of LTO, but rather something you can do in addition to it.
you're absolutely right. it seems like a huge number of comilers have had 'global optimization' on their todo list, but never got around to it.
i think this paper (cant find a non-paywalled copy) https://dl.acm.org/doi/10.1145/12276.13338 got around 10% just by attempting to align the allocation of the callers and the callee.
maybe memories are large enough now, and the whole proprietary binary-only library business model is gone, that we should really be abandoning the idea of seperate compilation units as the default.
I think it's a demonstrated fact that nobody cares about the out-of-the-box distro performance, because recompiling Ubuntu for your actual CPU (or anything newer than Opteron) gives large speedups on every program. Doing whole-distro PGO and LTO would be challenging because there's no one true workload upon which all users agree. An optimization for me might be bad for you. Also PGO and LTO will tend to think initialization and startup code is "cold" whereas in a thing like a shell pipeline that exec's many programs may be bottlenecked by those functions. If I profile a long-running daemon and rebuild libc with those profiles, it could make startup times slower for an exec-heavy workload.
> Inspired by the performance gains and to address the scalability issue of BOLT, we went about designing a scalable infrastructure that can perform BOLT-like post-link optimizations.
> Our experiments on large real-world applications and SPEC with code layout show that Propeller can optimize as effectively as BOLT, with just 20% of its memory footprint and time overhead.