In some ways, the actual execution of code is opaque to compilers. Modern x86 processors further divide their instructions into op-codes in the instruction translation units. AMD and Intel both have their approaches to this internal instruction set deeply ingrained into every CPU since perhaps K7 for AMD and Pentium Pro for Intel. Pentium M and later the Core architecture contained op-code fusing where instead of just rearranging op-codes, the op-codes were combined into composite op-codes that could be executed in one step. The opcode fusing + out-of-order execution basically makes the CPU act like a compiler internally for binary. It's a like a JIT run-time for binary that's implemented in hardware.
As far as executable size and performance, compiling with -Os in GCC will occasionally yield a performance increase that might even change across CPU's and architectures as the memory sub-systems hit a good rhythm or there are less misses overall. Usually smaller is better for this. -O3 will occasionally unroll gigantic loops, while using compiler directed optimization to analyze which parts of a binary can benefit overall execution from unrolling vs less misses with smaller executable size can yield even better agreement between memory subsystem performance and execution speed.
Microarchitectures like MIPS have further blind alleys such as branch-delay slots that will finish execution even if a branch instruction -before- the slots is taken. This is an out-of-order program, but putting the burden on the compiler instead of implementing the reordering in hardware actually became a nuisance because the architecture couldn't change how it expected instructions without breaking binary compatibility and the compiler wouldn't have been able to tweak for different CPU's without a fat-binary approach.
As far as executable size and performance, compiling with -Os in GCC will occasionally yield a performance increase that might even change across CPU's and architectures as the memory sub-systems hit a good rhythm or there are less misses overall. Usually smaller is better for this. -O3 will occasionally unroll gigantic loops, while using compiler directed optimization to analyze which parts of a binary can benefit overall execution from unrolling vs less misses with smaller executable size can yield even better agreement between memory subsystem performance and execution speed.
Microarchitectures like MIPS have further blind alleys such as branch-delay slots that will finish execution even if a branch instruction -before- the slots is taken. This is an out-of-order program, but putting the burden on the compiler instead of implementing the reordering in hardware actually became a nuisance because the architecture couldn't change how it expected instructions without breaking binary compatibility and the compiler wouldn't have been able to tweak for different CPU's without a fat-binary approach.