Some notes: 1. Consider M1's 8-wide decoder hit the 5+ GHz clock speeds that Int...

neonsunset · on Aug 11, 2024

> Not sure what you mean by "peephole heuristic optimizations"

Post-emit or within-emit stage optimization where a sequence of instructions is replaced with a more efficient shorter variant.

Think replacing pairs of ldr and str with ldp and stp, changing ldr and increment with ldr with post-index addressing mode, replacing address calculation before atomic load with atomic load with addressing mode (I think it was in ARMv8.3-a?).

The "heuristic" here might be possibly related to additional analysis when doing such optimizations.

For example, previously mentioned ldr, ldr -> ldp (or stp) optimization is not always a win. During work on .NET 9, there was a change[0] that improved load and store reordering to make it more likely that simple consecutive loads and stores are merged on ARM64. However, this change caused regressions in various hot paths because, for example, previously matched ldr w0, [addr], ldr w1, [addr+4] -> modify w0 -> str w0, [addr] pair got replaced with ldp w0, w1, [add] -> modify w0, str w0 [addr].

Turns out this kind of merging defeated store forwarding on Firestorm (and newer) as well as other ARM cores. The regression was subsequently fixed[1], but I think the parent comment author may have had scenarios like these in mind.

[0]: https://github.com/dotnet/runtime/pull/92768

[1]: https://github.com/dotnet/runtime/pull/105695

hajile · on Aug 11, 2024

1. Why would you WANT to hit 5+GHz when the downsides of exponential power take over? High clocks aren't a feature -- they are a cope.

AMD/Intel maintain I-cache and maintain a uop cache kept in sync. Using a tiny part to pre-decode is different from a massive uop cache working as far in advance as possible in the hopes that your loops will keep you busy enough that your tiny 4-wide decoder doesn't become overwhelmed.

2. The float workload was always BS because you can't run nothing but floats. The integer workload had 22.1w total core power and 4.8w power for the decoder. 4.8/22.1 is 21.7%. Even the 1.8w float case is 8% of total core power. The only other argument would be that the study is wrong and 4.8w isn't actually just decoder power.

3. We're talking about worst cases here. Nothing stops ARM cores from creating a "work pool" of upcoming branches in priority order for them to decode if they run out of stuff on the main branch. This is the best of both worlds where you can be faster on the main branch AND still do the same branchy code trick too.

4. This is the tail wagging the dog (and something else if your numbers are correct). Complex x86 instructions have garbage performance, so they are avoided by the compiler. The problem is that you can't GUARANTEE those instructions will NEVER be used, so the mere specter of them forces complex algorithms all over the place where ARM can do more simple things.

In any case, your numbers raise a VERY interesting question about x86 being RISC under the hood.

Consider this. Say that we have 1024 bytes of ARM code (256 instructions). x86 is around 15% smaller (871.25 bytes) and with the longer 4.25 byte instruction average, x86 should have around 205 instructions. If ARM is generating 19.3% more uops than instructions, we have about 305 uops. x86 with just 4.7% more has 215 uops (the difference here is way outside any margins of error here).

If both are doing the same work, x86 uops must be in the range of 30% more complex. Given the limits of what an ALU can accomplish, we can say with certainty that x86 uops are doing SOMETHING that isn't the RISC they claim to be doing. Perhaps one could claim that x86 is doing some more sophisticated instructions in hardware, but that's a claim that would need to be substantiated (I don't know what ISA instructions you have that give a 15% advantage being done in hardware, but aren't already in the ARM ISA and I don't see ARM refusing to add circuitry for current instructions to the ALU if it could reduce uops by 15% either).

8. https://en.wikipedia.org/wiki/Peephole_optimization

The final optimization stage is basically heuristic find & replace. There could in theory be a mathematically provable "best instruction selection", but finding it would require trying every possible combination which isn't possible as long as P=NP holds true.

My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though. Instead, you take your basic, short instruction and pad it with useless bytes. Add a couple useless bytes to a bunch of instructions and you now have the right length to push the function over to the cache boundary without adding any NOPs.

But the issues go deeper. When do you use a REX prefix? You may want it so you can use 16 registers, but it also increases code size. REX2 with APX is going to increase this issue further where you must juggle when to use 8, 16, or 32 registers and when you should prefer the long REX2 because it has 3-register instructions. All kinds of weird tradeoffs exist throughout the system. Because the compilers optimize for the CPU and the CPU optimizes for the compiler, you can wind up in very weird places.

In an ISA like ARM, there isn't any code density weirdness to consider. In fact, there's very little weirdness at all. Write it the intuitive way and you're pretty much guaranteed to get good performance. Total time to work on the compiler is a zero-sum game given the limited number of experts. If you have to deal with these kinds of heuristic headaches, there's something else you can't be working on.

dzaima · on Aug 11, 2024

> My favorite absurdity of x86 (though hardly the only one) is padding. You want to align function calls at cacheline boundaries, but that means padding the previous cache line with NOPs. Those NOPs translate into uops though.

I'd call that more neat than absurd.

> You may want it so you can use 16 registers, but it also increases code size.

RISC-V has the exact same issue, some compressed instructions having only 3 bits for operand registers. And on x86 for 64-bit-operand instructions you need the REX prefix always anyways. And it's not that hard to pretty reasonably solve - just assign registers by their use count.

Peephole optimizations specifically here are basically irrelevant. Much of the complexity for x86 comes from just register allocation around destructive operations (though, that said, that does have rather wide-ranging implications). Other than that, there's really not much difference; all have the same general problems of moving instructions together for fusing, reordering to reduce register pressure vs putting parallelizable instructions nearer, rotating loops to reduce branches, branches vs branchless.

hajile · on Aug 11, 2024

RISC-V has a different version of this issue that is pretty straight-forward. Preferring 2-register operations is already done to save register space. The only real extra is preferring the 8 registers C uses for math. After this, it's all just compression.

x86 has a multitude of other factors than just compression. This is especially true with standard vs REX instructions because most of the original 8 instructions have specific purposes and instructions that depend on them for these (eg, Accumulator instructions with A register, Mul/div using A+D, shift uses C, etc). It's a problem a lot harder than simple compression.

Just as cracking an alphanumeric password is exponentially harder than a same-length password with numbers only, solving for all the x86 complications and exceptions is also exponentially harder.

dzaima · on Aug 11, 2024

If anything, I'd say x86's fixed operands make register allocation easier! Don't have to register-allocate that which you can't. (ok, it might end up worse if you need some additional 'mov's. And in my experience more 'mov's is exactly what compilers often do.)

And, right, RISC-V even has the problem of being two-operand for some compressed instructions. So the same register allocation code that's gone towards x86 can still help RISC-V (and vice versa)! On RISC-V, failure means 2→4 bytes on a compressed instruction, and on x86 it means +3 bytes of a 'mov'. (granted, the additioanal REX prefix cost is separate on x86, while included in decompression on RISC-V)

hajile · on Aug 11, 2024

With 16 registers, you can't just avoid a register because it has a special use. Instead, you must work to efficiently schedule around that special use.

Lack of special GPRs means you can rename with impunity (this will change slightly with the load/store pair extension). Having 31 truly GPR rather than 8 GPR+8 special GPR also gives a lot of freedom to compilers.

dzaima · on Aug 11, 2024

Function arguments and return values already are effectively special use, and should frequently be on par if not much more frequent than the couple x86 instructions with fixed registers.

Both clang and gcc support calls having differing used calling conventions within one function, which ends up effectively exactly identical to fixed-register instructions (i.e. an x86 'imul r64' can be done via a pseudo-function where the return values are in rdx & rax, an input is in rax, and everything else is non-volatile; and the dynamically-choosable input can be allocated separately). And '__asm__()' can do mixed fixed and non-fixed registers anyway.

hajile · on Aug 11, 2024

Unlike x86, none of this is strictly necessary. As long as you put things back as expected, you may use all the registers however you like.

dzaima · on Aug 11, 2024

The option of not needing any fixed register usage would apply to, what, optimizing compilers without support for function calls (at least via passing arguments/results via registers)? That's a very tiny niche to use as an argument for having simplified compiler behavior.

And good register allocation is still pretty important on RISC-V - using more registers, besides leading to less compressed instruction usage, means more non-volatile register spilling/restoring in function prologue/epilogue, which on current compilers (esp. clang) happens at the start & end of functions, even in paths that don't need the registers.

That said, yes, RISC-V still indeed has much saner baseline behavior here and allows for simpler basic register allocation, but for non-trivial compilers the actual set of useful optimizations isn't that different.

hajile · on Aug 11, 2024

Not just simpler basic allocation. There are fewer hazards to account for as well. The process on RISC-V should be shorter, faster, and with less risk that the chosen heuristics are bad in an edge case.

clamchowder · on Aug 11, 2024

1. Performance. Also Arm implemented instruction cache coherency too.

Predecode/uop cache are both means to the same end, mitigating decode power. AMD and Intel have used both (though not on the same core). Arm has used both, including both on the same core for quite a few generations.

And a uop cache is just a cache. It's also big enough on current generations to cache more than just loops, to the point where it covers a majority of the instruction stream. Not sure where the misunderstanding of the uop cache "working as far in advance is possible" comes from. Unless you're talking about the BPU running ahead and prefetching into it? Which it does for L1i, and L2 as well?

2. "you can't run nothing but floats" they didn't do that in the paper, they did D += A[j] + B[j] ∗ C[j]. Something like matrix multiplication comes to mind, and that's not exactly a rare workload considering some ML stuff these days.

But also, has a study been done on Arm cores? For all we know they could spend similar power budgets on decode, or more. I could say an Arm core uses 99% of its power budget on decode, and be just as right as you are (they probably don't, my point is you don't have concrete data on both Arm and x86 decode power, which would be necessary for a productive discussion on the subject)

3. You're describing letting the BPU run ahead, which everyone has been doing for the past 15 years or so. Losing fetch bandwidth past a taken branch is a different thing.

4. Not sure where you're going. You started by suggesting Arm has less micro-op expansion than x86, and I provided a counterexample. Now you're talking about avoiding complex instructions, which a) compilers do on both architectures, they'll avoid stuff like division, and b) humans don't in cases where complex instructions are beneficial, see Linux kernel using rep movsb (https://github.com/torvalds/linux/blob/5189dafa4cf950e675f02...), and Arm introducing similar complex instructions (https://community.arm.com/arm-community-blogs/b/architecture...)

Also "complex" x86 instructions aren't avoided in the video encoding workload. On x86 it takes ~16.5T instructions to finish the workload, and ~19.9T on Arm (and ~23.8T micro-ops on Neoverse V2). If "complex" means more work per instruction, then x86 used more complex instructions, right?

8. You can use a variable length NOP on x86, or multiple NOPs on Arm to align function calls to cacheline boundaries. What's the difference? Isn't the latter worse if you need to move by more than 4 bytes, since you have multiple NOPs (and thus multiple uops, which you think is the case but isn't always true, as some x86 and some Arm CPUs can fuse NOP pairs)

But seriously, do try gathering some data to see if cacheline alignment matters. A lot of x86/Arm cores that do micro-op caching don't seem to care if a function (or branch target) is aligned to the start of a cacheline. Golden Cove's return predictor does appear to track targets at cacheline granularity, but that's a special case. Earlier Intel and pretty much all AMD cores don't seem to care, nor do the Arm ones I've tested.

Anyway, you're making a lot of unsubstantiated guesses on "weirdness" without anything to suggest it has any effect. I don't think this is the right approach. Instead of "tail wagging the dog" or whatever, I suggest a data-based approach where you conduct experiments on some x86/Arm CPUs, and analyze some x86/Arm programs. I guess the analogy is, tell the dog to do something and see how it behaves? Then draw conclusions off that?

hajile · on Aug 11, 2024

1. The biggest chip market is laptops and getting 15% better performance for 80% more power (like we saw with X Elite recently) isn't worth doing outside the marketing win of a halo product (a big reason why almost everyone is using slower X Elite variants). The most profitable (per-chip) market is servers. They also prefer lower clocks and better perf/watt because even with the high chip costs, the energy will wind up costing them more over the chip's lifespan. There's also a real cost to adding extra pipeline stages. Tejas/Jayhawk cores are Intel's cancelled examples of this.

L1 cache is "free" in that you can fill it with simple data moves. uop cache requires actual work to decode and store elements for use in addition to moving the data. As to working ahead, you already covered this yourself. If you have a nearly 1-to-1 instruction-to-uop ratio, having just 4 decoders (eg, zen4) is a problem because you can execute a lot more than just 4 instructions on the backend. 6-wide Zen4 means you use 50% more instructions than you decode per clock. You make up for this in loops, but that means while you're executing your current loop, you must be maxing out the decoders to speculatively fill the rest of the uop cache before the loop finishes. If the loop finishes and you don't have the next bunch of instructions decoded, you have a multi-cycle delay coming down the pipeline.

2. I'd LOVE to see a similar study of current ARM chips, but I think the answer here is pretty simple to deduce. ARM's slide says "4x smaller decoders vs A710" despite adding a 5th decoder. They claim 20% reduction in power at the same performance and the biggest change is the decoder. As x86 decode is absolutely more complex than aarch32, we can only deduce that switching from x86 to aarch64 would be an even more massive reduction. If we assume an identical 75% reduction in decoder power, we'd move from 4.8w on haswell the decoder down to 1.2w reducing total core power from 22.1 to 18.5 or a ~16% overall reduction in power. This isn't too far from to the power numbers claimed by ARM.

4. This was a tangent. I was talking about uops rather than the ISA. Intel claims to be simple RISC internally just like ARM, but if Intel is using nearly 30% fewer uops to do the same work, their "RISC" backend is way more complex than they're admitting.

8. I believe aligning functions to cacheline boundaries is a default flag at higher optimization levels. I'm pretty sure that they did the analysis before enabling this by default. x86 NOP flexibility is superior to ARM (as is its ability to avoid them entirely), but the cause is the weirdness of the x86 ISA and I think it's an overall net negative.

Loads of x86 instructions are microcode only. Use one and it'll be thousands of cycles. They remain in microcode because nobody uses them, so why even try to optimize and they aren't used because they are dog slow. How would you collect data about this? Nothing will ever change unless someone pours in millions of dollars in man-hours into attempting to speed it up, but why would anyone want to do that?

Optimizing for a local maxima rather than a global maxima happens all over technology and it happens exactly because of the data-driven approach you are talking about. Look for the hot code and optimize it without regard that there may be a better architecture you could be using instead. Many successes relied on an intuitive hunch.

ISA history has a ton of examples. iAPX432 super-CISC, the RISC movement, branch delay slots, register windows, EPIC/VLIW, Bulldozer's CMT, or even the Mill design. All of these were attempts to find new maxima with greater or lesser degrees of success. When you look into these, pretty much NONE of them had any real data to drive them because there wasn't any data until they'd actually started work.

clamchowder · on Aug 11, 2024

1. Yeah I agree, both X Elite and many Intel/AMD chips clock well past their efficiency sweet spot at stock. There is a cost to extra pipeline stages, but no one is designing anything like Tejas/Jayhawk, or even earlier P4 variants these days. Also P4 had worse problems (like not being able to cancel bogus ops until retirement) than just a long pipeline.

Arm's predecoded L1i cache is not "free" and can't be filled with simple data moves. You need predecode logic to translate raw instruction bytes into an intermediate format. If Arm expanded predecode to handle fusion cases in A715, that predecode logic is likely more complex than in proir generations.

2. Size/area is different from power consumption. Also the decoder is far from the only change. The BTBs were changed from 2 to 3 level, and that can help efficiency (could make a smaller L2 BTB with similar latency, while a slower third level keeps capacity up). TLBs are bigger, probably reducing page walks. Remember page walks are memory accesses and the paper earlier showed data transfers count for a large percentage of dynamic power.

4. IMO no one is really RISC or CISC these days

8. Sure you can align the function or not. I don't think it matters except in rare corner cases on very old cores. Not sure why you think it's an overall net negative. "feeling weird" does not make for solid analysis.

Most x86 instructions are not microcode only. Again, check your data with performance counters. Microcoded instructions are in the extreme minority. Maybe microcoded instructions were more common in 1978 with the 8086, but a few things have changed between then and now. Also microcoded instructions do not cost thousands of cycles, have you checked? i.e. a gather is ~22 micro ops on Haswell, from https://uops.info/table.html Golden Cove does it in 5-7 uops.

ISA history has a lot of failed examples where people tried to lean on the ISA to simplify the core architecture. EPIC/VLIW, branch delay slots, and register windows have all died off. Mill is a dumb idea and never went anywhere. Everyone has converged on big OoO machines for a reason, even though doing OoO execution is really complex.

If you're interested in cases where ISA does matter, look at GPUs. VLIW had some success there (AMD Terascale, the HD 2xxx to 6xxx generations). Static instruction scheduling is used in Nvidia GPUs since Kepler. In CPUs ISA really doesn't matter unless you do something that actively makes an OoO implementation harder, like register windows or predication.