> Why was it not a problem for Intel to add mmx, several sse iterations, avx, av...

BeeOnRope · on March 30, 2022

This part about "a lot of die space" is largely false. The die shots show that AVX-512 adds only a modest amount of area, mostly for the wider register file and 64-byte path to memory.

The "huge power hungry" execution units are in fact largely the same execution units used for AVX and AVX2 - the CPU simply uses 2 adjacent 256-bit AVX/2 units as a single AVX-512 unit: this is the so-called port 0+1 fusion. Only the larger shuffles and the (optional) extra FMA unit on port 5 take significant addition space.

In fact, the shared core die area for SKL (no AVX-512) and SKX (has AVX-512) is the same! There is some dead space on the SKL die due to lack of SKL support but it is small: you can bet Intel didn't ship 100s of millions of SKL and SKL-derivative chips with a big space-wasting AVX-512 unit which couldn't be used on it: they could have made a new, more targeted layout if that was the case.

Aardwolf · on March 29, 2022

MMX was 64-bit wide, SSE 128-bit wide and AVX 256-bit wide, and they succeeded each other rapidly (when compared to the situation now). AVX512 is another doubling, so what's the hold up with this one compared to the previous two doublings?

Maybe it's time for AMD to do something about this, like when they took the initiative to create x86-64

snek_case · on March 29, 2022

The right thing to do IMO would be to have vector instructions with a variable number of operands. Quit adding wider instructions, just tell the instruction how many operands you want to add/multiply and let the CPU take care of it. This could also allow the CPU to decide what's the most efficient way to prefetch, etc., or to run these operations asynchronously.

paulmd · on March 29, 2022

That's exactly what ARM did with SVE, so it's easy to discuss the pros and cons.

The biggest problem, to my understanding, is that SIMD extensions are mostly targeted with optimized hand-written assembly code. It's fine to say that the SIMD is variable-width, but you still have to run this on a processor where the hardware is not variable-width. Not just the SIMD unit, but also the feedpaths/etc. And some things a "generic" framework might let you do, might have catastrophic performance consequences.

In practice a lot of this will be avoided by just knowing the hardware you're going to target and not doing the things that hurt performance, but that also completely misses the point of variable-width, you might as well just target the hardware directly at that point.

And actually your data may not be "variable-width" either.

snek_case · on March 30, 2022

It seems to me it's extremely common to work with vectors that are much larger than the host CPU's SIMD width. In that case you could get much better performance letting the CPU schedule a large batch of work rather than trying to micro-optimize for multiple possible targets.

Software is typically compiled for x86-64, but not necessarily for each specific x86-64 processor available. You often just get a generic precompiled binary. That's not really optimized for the machine you're running the software on.

Maybe you want to have some restrictions on the size of the vectors you're working with to make it more convenient for the hardware, like "should be a multiple of 4x64 bits", but I still think there could be big gains with variable width in most cases.

vlovich123 · on March 30, 2022

I thought the same but that’s not how that works. It’s variable width at the assembly level. You give it entire buffers with the total length and it internally uses the appropriate width (or tells you the width it used during the loop? I forget that detail). Anyway, it’s not a SW framework but a core part of the CPU implemented in Hw.

BeeOnRope · on March 30, 2022

That's not how SVE works. Any given SVE CPU still has a fixed vector width at runtime, but the CPU exposes instructions which tell you the vector width and other convenience instructions for performing loop counter math and handling the "final" iteration which may not use a full vector.

Then you need to write your loops in such a way that they work with a variable width: i.e., doing more iterations if the vector width is smaller.

mcronce · on March 29, 2022

Zen 4 is reported to have AVX-512 support. I'm not sure if that'll be included on Ryzen or not, though; only Epyc is "confirmed" as far as I know.

jhgb · on March 30, 2022

It would be strange if AMD didn't offer this on Ryzens. First, they don't have a history of locking people out of instruction sets. Second, they need it for being competitive with Intel anyway. Third, they're using the same die for multiple purposes, not like Intel with separate server dies for Xeons.

arcticbull · on March 29, 2022

As the 4th doubling, that's a 16X increase over the base.

512-bit registers, ALUs, data paths, etc, are all really really physically big.

monocasa · on March 29, 2022

My sense is that it's because it happened about when Intel's litho progress began stalling out. It was designed for a world the uarch folks were expecting with smaller transistors than they ended up getting.

zdw · on March 29, 2022

The power/clockspeed hit for AVX512 is less on the most recent Intel CPUs, per previous discussion: https://news.ycombinator.com/item?id=24215022

adrian_b · on March 29, 2022

The size of the die space and the amount of the power have very little to do with AVX-512.

AVX-512 is a better instruction set and many tasks can be done with fewer instructions than when using AVX or SSE, resulting in a lower energy consumption, even at the same data width.

The increase in size and power consumption is due almost entirely to the fact that AVX-512 has both twice the number of registers and double-width registers in comparison with AVX. Moreover the current implementations have a corresponding widening of the execution units and datapaths.

If SSE or AVX would have been widened for higher performance, they would have had the same increases in size and power, but they would have remained less efficient instruction sets.

Even in the worst AVX-512 implementation, in Skylake Server, doing any computation in AVX-512 mode reduces a lot the energy consumption.

The problem with AVX-512 in Skylake Server and derived CPUs, e.g. Cascade Lake, is that those Intel CPUs have worse methods of limiting the power consumption than the contemporaneous AMD Zen. Whatever method was used by Intel, it reacted too slow during consumption peaks. Because of that, the Intel CPUs had to reduce the clock frequency in advance whenever they feared that a too large power consumption could happen in the future, e.g. when they see a sequence of AVX-512 instructions and they fear that more will follow.

While this does not matter for programs that do long computations with AVX-512, when the clock frequency really needs to go down, it handicaps the programs that execute only a few AVX-512 instructions, but enough to trigger the decrease in clock frequency, which slows down the non-AVX-512 instructions that follow.

This was a serious problem for all Intel CPUs derived from Skylake, where you must take care to not use AVX-512 instructions unless you intend to use many of them.

However it was not really a problem of AVX-512 but of Intel's methods for power and die temperature control. Those can be improved and Intel did improve them in later CPUs.

AVX-512 is not the only one that caused such undesirable behaviors. Even in much older Intel CPUs, the same kind of problems appear when you are interested to have maximum single-thread performance, but some random background process starts on another previously idle core. Even if that background process consumes a negligible power, the CPU is afraid that it might start to consume a lot and it reduces drastically the maximum turbo frequency compared with the case when a single core was active, causing the program that interests you to slow down.

This is exactly the same kind of problem, and it is visible especially on Windows, which has a huge quantity of enabled system services that may start to execute unexpectedly, even when you believe that the computer should be idle. Nevertheless, the people got used to this behavior, especially because it was little that they could do about it, so it was much less discussed than the AVX-512 slowdown.

oakwhiz · on March 30, 2022

When AVX first came out it did make the process of overclock stability tuning more difficult, because the was a desire to test under both AVX and non-AVX pathological load conditions. Sometimes the change in voltage drop over the die, along with the frequency change, would cause instability in one case but not the other.

Having gone through the process, I could understand how this would be an annoying feature that violated previous assumptions about how Intel CPUs work.