BOLT: Binary Optimization and Layout Tool

uwu · on March 22, 2020

similar thing from google: http://lists.llvm.org/pipermail/llvm-dev/2019-September/1353...

> Inspired by the performance gains and to address the scalability issue of BOLT, we went about designing a scalable infrastructure that can perform BOLT-like post-link optimizations.

> Our experiments on large real-world applications and SPEC with code layout show that Propeller can optimize as effectively as BOLT, with just 20% of its memory footprint and time overhead.

thedance · on March 22, 2020

One thing I really wanted to be able to do with Propeller was a post-link optimization of a Go program. Go's compiler is both fast and terrible. If you glance at the text, opportunities for simple optimizations abound. Unfortunately the layout of the program, references between the debug information and the function entry points, and Propeller having been designed for C++ programs kept me from getting very far. I did get a little ways, but not far enough that I could keep a long-running Go program from crashing.

BubRoss · on March 23, 2020

Did you mean bolt instead of propeller? Bolt optimizes binaries, propeller is an llvm fork with extra information and a special linker.

Neither is made to do the optimizations you are talking about, that is what regular llvm does.

dang · on March 22, 2020

Currently discussed at https://news.ycombinator.com/item?id=22655316.

EvanAnderson · on March 22, 2020

It's like Nat Freidman's "GNU Rope"[1] from 1998 which, if memory serves, was inspired by a tool that did the same thing in IRIX. I'd like to see this functionality built into an OS loader such that the on-disk binaries can remain unchanged (think verifying reproducible builds) but the loader will relocate code in memory to maximize locality of reference.

[1] http://lwn.net/1998/1029/als/rope.html

(An aisde: I was at that talk in 1998 at Atlanta Linux Showcase and remember it fairly well. I'm the guy who asked Nat if he'd considered using a genetic algorithm for the optimization processing.)

tjalfi · on March 22, 2020

GNU Rope never shipped but it looks like the source code is here[0].

[0] www.hungry.com/~shaver/gropt.tar.bz2

q3k · on March 22, 2020

Okay, I don't get it.

Why not hook this into LLVM's compiler infrastructure, instead of recovering the CFG from a binary - and not any binary, but one that has to built with special flags anyway? This seems backwards.

marklacey · on March 22, 2020

Because if you statically link libraries like the language runtime it allows you to reorder the blocks in those as well to improve locality and instruction cache utilization.

For example if your code calls a particular runtime function shortly after starting, that function (or the basic blocks from that function that are executed frequently) can be placed close to the call site.

ndesaulniers · on March 22, 2020

If you're statically linking, then LTO would give you cross TU visibility to do the same optimization at link time, not post link when you've lost a lot of fidelity.

marklacey · on March 22, 2020

Not if it’s a library that you’re not compiling, like language runtimes or system libraries.

LTO requires that you’re compiling all the code to get the benefits.

I’m also not sure if either gcc or clang’s LTO does layout at the basic block level or at the function level. There is a lot of benefit from doing it at the basic block level. You can literally lay out all the code that runs at application start time so that it is mostly adjacent and fall-through from block to block during execution.

This isn’t something you necessarily do instead of LTO, but rather something you can do in addition to it.

ndesaulniers · on March 22, 2020

This is the argument the authors of propeller make.

thedance · on March 22, 2020

What if your binary, or one of its libraries, is not built with LLVM?

convolvatron · on March 22, 2020

you're absolutely right. it seems like a huge number of comilers have had 'global optimization' on their todo list, but never got around to it.

i think this paper (cant find a non-paywalled copy) https://dl.acm.org/doi/10.1145/12276.13338 got around 10% just by attempting to align the allocation of the callers and the callee.

maybe memories are large enough now, and the whole proprietary binary-only library business model is gone, that we should really be abandoning the idea of seperate compilation units as the default.

saagarjha · on March 22, 2020

Some discussion from two years ago: https://news.ycombinator.com/item?id=17350122

fefe23 · on March 22, 2020

Why would you use this over the profile guided optimizations that gcc and clang already offer, that use the same data?

What kind of perf benefits can you expect?

What kind of application have you tested this with?

ahupp · on March 22, 2020

From the paper:

"For datacenter applications, BOLT achieves up to 7.0% performance speedups on top of profile-guided function reordering and LTO."

This is used for large binaries, like hhvm. There's a lot more details here: https://research.fb.com/publications/bolt-a-practical-binary...

rurban · on March 22, 2020

Some usage description of mine: http://perl11.org/blog/bolt.html

Propeller looks much better now.

CalChris · on March 22, 2020

I guess maybe the original version of something like this was cord and cord2, done at MIPS in the mid-80s.

http://www.bitsavers.org/pdf/mips/RISCos/3204DOC_RISC_os_Use...

I wonder if those have been open sourced.

psandersen · on March 22, 2020

What would it take to use this on a complete linux distro like ubuntu? Seems like it would be a respectable improvement.

thedance · on March 22, 2020

I think it's a demonstrated fact that nobody cares about the out-of-the-box distro performance, because recompiling Ubuntu for your actual CPU (or anything newer than Opteron) gives large speedups on every program. Doing whole-distro PGO and LTO would be challenging because there's no one true workload upon which all users agree. An optimization for me might be bad for you. Also PGO and LTO will tend to think initialization and startup code is "cold" whereas in a thing like a shell pipeline that exec's many programs may be bottlenecked by those functions. If I profile a long-running daemon and rebuild libc with those profiles, it could make startup times slower for an exec-heavy workload.

coredog64 · on March 22, 2020

Gentoo v2

cwzwarich · on March 22, 2020

The README doesn't link to the final version of their article: https://research.fb.com/publications/bolt-a-practical-binary...

CalChris · on March 22, 2020

I believe the version published at CGO 19 is the final version. It is longer.

https://www.ic.unicamp.br/~ra045840/bolt-cgo19.pdf