One of these has to be true (or both true): 1. ARM is inherently more efficient ...

kllrnohj · on Aug 11, 2024

Performance per watt is a bad metric. You want instead performance for a given power budget (eg, how much performance can I get at 15w? 30w? etc...)

Otherwise you can trivially win 100% of performance/watt comparisons by just setting clocks to the limit of the lowest usable voltage level.

For example compare the 7950X to the 7950X@65w using the officially supported eco mode option: https://www.anandtech.com/show/17585/amd-zen-4-ryzen-9-7950x...

Cinebench R23 MT:

7950X stock: 225 points/watt

7950X @ 65w eco mode: 482 points/watt

Over 2x perf/watt improvement on the exact same chip, and a power efficiency that tops the charts, beating every laptop chip in that notebookcheck test by a large amount as well. And yet how many 7950x owners are using the 65w eco mode option? Probably none. Because perf/watt isn't actually meaningful. Rather it's how much performance can I get for a given power budget.

aurareturn · on Aug 11, 2024

>Performance per watt is a bad metric.

It isn't. It needs context.

Sure, you can get an Intel Celeron to have more perf/watt than an M1 if you give the Celeron low enough wattage.

The key here is the absolute performance.

In this case, both the M3 and the X Elite are not only significantly more efficient than both Zen4 and Zen5 in ST, they are also straight up faster while being more efficient.

AnthonyMouse · on Aug 11, 2024

> 1. ARM is inherently more efficient than x86 CPUs in most tasks

> 2. Nuvia and Apple are better CPU designers than AMD and Intel

The third possibility is that they just pick a different point on the efficiency curve. You can double power consumption in exchange for a few percent higher performance, double it again for an even smaller increase.

The max turbo on the i9 14900KS is 253 W. The power efficiency is bad. But it generally outperforms the M3, despite being on a significantly worse process node, because that's the trade off.

AMD is only on a slightly worse process node and doesn't have to do anything so aggressive, but they'll also sell you whatever you want. The 8845HS and 8840U are basically the same chip, but former has around double the TDP. In exchange for that you get ~2% more single thread performance and ~15% more multi-thread performance. Whereas the performance per watt for the 8840U is nearly that of the M3, and the remaining difference is basically the process node.

aurareturn · on Aug 11, 2024

>The third possibility is that they just pick a different point on the efficiency curve. You can double power consumption in exchange for a few percent higher performance, double it again for an even smaller increase.

This only makes sense if the Zen5 is actually faster in ST than the M3. In this case, the M3 is 1.24x faster and 3.4x more efficient in ST than Zen5.

AMD's Zen5 chip is just straight up slower in any curve.

>The max turbo on the i9 14900KS is 253 W. The power efficiency is bad. But it generally outperforms the M3, despite being on a significantly worse process node, because that's the trade off.

It's not a trade off that Intel wants. The 14900KS runs at 253w (sometimes 400w+) because that's the only way Intel is able to stay remotely competitive at the very high end. An M3 Max will often match a 14900KS in performance using 5-10% of the power.

AnthonyMouse · on Aug 11, 2024

> This only makes sense if the Zen5 is actually faster in ST than the M3. In this case, the M3 is 1.24x faster and 3.4x more efficient in ST than Zen5.

It makes sense if Zen5 is faster in MT, since that's when the CPUs will be power limited, and it is. For ST the performance generally isn't power-limited for either of them and then the M3 is on a newer process node.

It also depends on the benchmark. For example, Zen5 is faster in ST on Cinebench R23. It's not obvious what's going on with R24, but it's a difference in the code rather than the hardware.

The power numbers in that link also don't inspire a lot of confidence. They have two systems with the same CPU but one of them uses 119.3W and the other one uses 46.7W? Plausibly the OEMs could have configured them differently but that kind of throws out the entire premise of using the comparison to measure the efficiency of the CPUs. The number doesn't mean anything if the power consumption is being set as a configuration parameter by the OEM and the number of watts going to the display or a discrete GPU are an uncontrolled hidden variable.

> It's not a trade off that Intel wants.

It's the one they've always taken, even when they were unquestionably in the lead. They were selling up to 150W desktop processors in 2008, because people buy them, because they're faster.

Now they have to do it just to be competitive because their process isn't as good, but the process is a different thing than the ISA or the design of the CPU.

aurareturn · on Aug 11, 2024

>It also depends on the benchmark. For example, Zen5 is faster in ST on Cinebench R23. It's not obvious what's going on with R24, but it's a difference in the code rather than the hardware.

Cinebench R23 uses Intel Embree underneath. It's hand optimized for AVX instruction set and poorly translated to NEON. It's not even clear if it has any NEON optimization.

dagmx · on Aug 12, 2024

R23 doesn’t have the same SIMD optimizations available for ARM as it does for x86.

Anyone using R23 instead of R24, is putting arm at a disadvantage. Notebookcheck is often called out for this and haven’t really addressed why they stick with R23 beyond not wanting to redo tests for older hardware. They are by far the outlier for performance numbers and why the discussion around performance gets muddied.

AnthonyMouse · on Aug 12, 2024

> R23 doesn’t have the same SIMD optimizations available for ARM as it does for x86.

The single-thread benchmark is SIMD-heavy?

Now it just sounds like Cinebench ST is a useless benchmark because it's putting a parallelizable SIMD workload on a single core. In real life you'd always be running those multi-threaded, whereas the reason people care about ST performance is for the serialized branch-heavy spaghetti code that inherently only runs on one core. "Run the SIMD code, but clamp it to a single thread" is a garbage proxy for that.

dagmx · on Aug 12, 2024

Yes, there’s no difference between the single and multi core benchmark other than how many threads get spun up.

I’m not sure why you’re trying to equate simd with parallelization. Tbh, a lot of your response seems odd to me because it’s making several incorrect assumptions.

You can’t really escape parallelization with how any modern core works, even on a single core. You may have certain operations process concurrently depending on how the cores resources are available at a given time and what is needed.

Regardless, SIMD isn’t concurrency. It’s batching.

There’s still significant benefit to having SIMD on a single threaded task. There’s a lot of thread overhead to using multiple cores to do something, whereas SIMD lets you effectively batch things on a single core.

AnthonyMouse · on Aug 12, 2024

SIMD workloads generally imply that you're doing the same operation repeatedly. It's literally in the name; single instruction, multiple data. There are occasional cases where that happens but doesn't parallelize well. TLS is probably a good example because you might have to encrypt a network packet and it's big enough to benefit from SIMD but not big enough that the overhead of splitting it across cores is worth it.

But most of the time if you're doing the same operation repeatedly you'll benefit from using more cores. Even for TLS, the client might not split the individual connection across multiple cores, but the server is going to handle multiple clients at once in parallel. Heavy workloads like video encoding make this even more apparent. In general the things that benefit from SIMD are parallel tasks that benefit from multiple cores.

Compare this with, say, a browser running JavaScript in a single tab. There is nothing to put on another core, you don't know what instructions will be executed next until you get there. This is where people actually care about single-thread performance, and where processors achieve it by using branch prediction etc. But these exercise very different parts of the CPU than SIMD-heavy workloads. The latter can easily fill the execution units of a wide processor that would be stymied by the former.

dagmx · on Aug 12, 2024

This feels like a really absurd stretch of trying to discern SIMD away from stuff like standard integer and float operations.

Maybe in the early 90s, but they’re such a part of processor design that you can’t realistically avoid them.

Especially for rendering, which is matrix math heavy, you’d have to design something completely bespoke to avoid it. SIMD is a natural necessity for rendering with any kind of performance.

And because SIMD is such a part of every mainstream processor, it’s very important that benchmarks show how well they perform.

I also don’t understand why you think a JavaScript runtime wouldn’t use SIMD. V8 can make use of SIMD, whether directly targeted or indirectly via the compiler that compiled the runtime itself.

If you want to stress very specific parts of a processor, then use something like SPEC. Cinebench is meant to be a realistic reflection of production rendering.

AnthonyMouse · on Aug 12, 2024

> Especially for rendering, which is matrix math heavy, you’d have to design something completely bespoke to avoid it. SIMD is a natural necessity for rendering with any kind of performance.

Well sure, but rendering is a classically parallel operation which is regularly implemented as threaded.

> I also don’t understand why you think a JavaScript runtime wouldn’t use SIMD. V8 can make use of SIMD, whether directly targeted or indirectly via the compiler that compiled the runtime itself.

JavaScript runtimes are executing code, so they'll implement the whole gamut and their execution will depend on what kind of code it actually is. But the common JavaScript code, and the kind presumably being tested in JavaScript benchmarks because it's what people care about, isn't implementing a video encoder using SIMD. It's manipulating DOM objects and parsing short pieces of text input, which is branch-heavy code with lots of indirection and very little use of SIMD if any.

> If you want to stress very specific parts of a processor, then use something like SPEC. Cinebench is meant to be a realistic reflection of production rendering.

Which is kind of my point. Production rendering is going to be threaded and max out all the cores, which is Cinebench MT. "CineBench ST" is measuring something that nobody does in real life and doesn't even really correlate with the things people actually do.

It doesn't represent real threaded workloads (which optimally run on many low-clocked cores, not one high-clocked one) nor real serialized workloads (which are full of conditional jumps and cache misses).

dagmx · on Aug 12, 2024

Rendering is parallel, yes, but it also makes heavy use of SIMD to accelerate the operations per thread. One does not obviate the other.

In the most trivial case, sure, the JavaScript runtimes won’t compile to use SIMD but there’s lot of cases where they will as well as part of their JIT. I think you’re trivializing how they work.

And back to the main point, it doesn’t really matter if you believe Cinebench reflects real world rendering. The fact is that R23 uses SIMD for x86, but not for ARM. R24 rectifies that. Both R23 and R24 use the same rendering code path regardless of running in single or multi threaded mode.

So using R23 as benchmarks for efficiency and performance will naturally benefit x86 significantly. There’s a reason none of the people who push the “AMD is almost the same” use the fairer benchmark to do so. R24 really highlights the actual discrepancy when both are given a fair playing field.

AnthonyMouse · on Aug 12, 2024

> The fact is that R23 uses SIMD for x86, but not for ARM. R24 rectifies that. Both R23 and R24 use the same rendering code path regardless of running in single or multi threaded mode.

But that's not the issue. Even if Cinebench R23 isn't a valid comparison, Zen5 also faster for Cinebench R24 MT.

Cinebench ST (R24 or R23) turns out to be a silly benchmark, because nobody in real life is going to artificially limit their renderer to one thread, but a renderer limited to one thread is also a bad proxy for real single-threaded workloads.

What it's mostly telling you is how wide the CPU is. Which only matters for real single-threaded code if the CPU can find enough instruction-level parallelism to exploit (which that test doesn't probe), and only matters for multi-threaded code to the extent that the processor can maintain that instruction density without becoming limited by thermals/cache/memory/etc. when the code is running on all the cores, which is the thing the MT benchmark tests.

dagmx · on Aug 12, 2024

Again, I think your logic here is flawed. It’s still valuable to know how a single core behaves for a rendering work load.

I’ve worked professionally in feature film production. We test single threaded performance all the time.

It tells us the performance characteristics of the nodes on our farm, and lets us more accurately get a picture of how jobs will scale and/or be packed.

It’s not uncommon to have a single thread rendering, for example to update part of an image while an artist works on other parts.

It’s not just a test of how wide the chip is. It also tests things like how it actually handles the various instruction sets from a real world codebase. Not all processors that are equally “wide” (not the right term but whatever) handle AVX the same, and you need to know how a single core behaves for that. It’s also useful to see how the cores actually behave on their own so you can eliminate the overhead of thread synchronization and system scheduling affecting you.

Wytwwww · on Aug 11, 2024

> An M3 Max will often match a 14900KS in performance using 5-10% of the power.

On Cinebench and Passmark CPU the 14900K is 50-60% faster so I'm not sure that's true.

merb · on Aug 14, 2024

The 14900ks and ryzen 7950 can basically turbo way over 253w as long as the chip stays cool. Both chips have two (or more?) eco modes. It’s actually super silly. Because the 125w mode is actually most often good enough

kllrnohj · on Aug 11, 2024

> 1. ARM is inherently more efficient than x86 CPUs in most tasks

I'm not sure how you're reaching the conclusion of "most tasks" when Cinebench R24 is the only test you used because R23, which doesn't agree, was rejected for hand-wavey nebulous reasons, and nothing else was tested.

R24 is hardly a representative workload of "most tasks" nor is it claiming/trying to be.

hajile · on Aug 11, 2024

Anandtech shows[0] that M3 is massively ahead in integer performance, but slightly behind in float performance on Spec 2017.

Integer workloads are by far the most common, but they tend to not scale to multiple cores very well. Most workloads that scale well across cores also benefit from big FP/SIMD units too.

Put another way, the real issue with R24 is that it makes HX370 look better than it would look in more normal consumer workloads.

[0] https://www.anandtech.com/show/21485/the-amd-ryzen-ai-hx-370...

kllrnohj · on Aug 11, 2024

> that M3 is massively ahead in integer performance

The M3 is certainly an impressive chip, but note that it's only massively ahead in some of the int tests. It's not a consistent gap.

> Integer workloads are by far the most common, but they tend to not scale to multiple cores very well.

The HX370 does better than the Me in specint MT though.

But regardless the anandtech results paint a much closer picture than the single R24 results that GP used as the basis of the efficiency thesis.

aurareturn · on Aug 11, 2024

The HX370 should win in SPECINT MT. It has 12 cores to the M3's 8 cores and it runs at significantly higher power.

Compare HX370 SPECINT MT To an M3 Pro and let's see the results.

kllrnohj · on Aug 11, 2024

> [HX370] runs at significantly higher power.

It used 33w. Meanwhile the M3 result came from a 2023 MacBook Pro 14-Inch, which certainly has the potential for a TDP of around that. If you can find SPECINT MT numbers w/ power data for an M3 Pro lets see it. Or even just power data for an M3 non-pro in the 14" MBP. A quick search isn't turning up any.

aurareturn · on Aug 13, 2024

M3's CPU uses around ~10w while running ST tasks.

The ~30w TDP of an M3 Pro is if it's running CPU and GPU tasks at max load.

aurareturn · on Aug 11, 2024

>I'm not sure how you're reaching the conclusion of "most tasks" when Cinebench R24 is the only test you used because R23, which doesn't agree, was rejected for hand-wavey nebulous reasons, and nothing else was tested.

There are no hand-wavey nebulous reasons.

Cinebench R23 uses Intel Embree engine, which is hand optimized for x86 CPUs. That's why x86 CPUs look far better than ARM CPUs in it.

If there is an application that is purely hand optimized for ARM, and then compiled for x86, do you think it's fair to use it to compare the two architectures?

SPEC & GB6 mostly agrees with Cinebench 2024.

trynumber9 · on Aug 11, 2024

The Snapdragon X Elite is on the same node and when actually doing a lot of work (i.e. cores loaded) it is close enough to HX 370 while delivering similar throughput.

Why wouldn't the inherent inefficiency of x64 be as noticeable in MT when all the inefficient cores are working? Because it is running at lower clocks? Then what allows it to match the SDXE in throughput? Does that need to lower its clock even more? I'm not seeing what makes it inherent.

rafaelmn · on Aug 11, 2024

From you link - Intel is topping the performance charts (alongside AMD in SC) - they probably tune power usage agressively to achieve these results.

I would guess it's more to do with coming from desktop CPU design to mobile vs. phones to laptops.

aurareturn · on Aug 11, 2024

>From you link - Intel is topping the performance charts (alongside AMD in SC) - they probably tune power usage agressively to achieve these results.

Cinebench 2024 ST:

* M3: 142 points

* X Elite: 123 points

* AMD HX 370: 116 points

* AMD 8845HS: 102 points

* Intel 155H: 108 points

Amongst each company's best laptop ST SoCs, no, Intel and AMD are far behind in both ST scores and perf/watt.

If you're referring to desktop speeds, then yes, Intel's 14900k does top the charts in ST Cinebench but it likely uses well over 100w.

I mostly care about laptop SoCs. In the case of the M3, it doesn’t even have a fan.

rafaelmn · on Aug 11, 2024

That's what I'm thinking - they make trade-offs to reach peak performance in desktop designs that don't translate optimally to laptops and when you start from mobile designs you probably made the opposite trade-offs - that would be my guess for the discrepancy.

hajile · on Aug 11, 2024

Laptops vastly outsell desktops, so this tradeoff means hurting the majority of your customers to please a small minority. Servers also care about perf/watt a LOT and they are the highest profit margin segment.

Why would AMD choose a target that hurts the majority of their market unless there wasn't another good option available?

rafaelmn · on Aug 11, 2024

The architecture started in desktop space and data center/mobile was an afterthought up until Intel shitting the bed repeatedly. If they redesigned from ground up they could probably get better instructions/watt but that would look terrible if it wasn't accompanied by a perf boost over previous generation. Just like Apple doesn't seem to scale well with more power.

aurareturn · on Aug 11, 2024

> Just like Apple doesn't seem to scale well with more power.

How do you know this? When has Apple ever given 400w to the M3 Max like the 14900KS can get up to?

PS. An M4 running at 7w is faster in ST than a 14900KS running in 250w+.

aurareturn · on Aug 11, 2024

I'm pretty sure that M3 Max closely matches the 14900k in ST speeds but using something like 5 - 10% of the power.

rafaelmn · on Aug 11, 2024

Not sure - they had power/thermal envelope in desktop parts and no difference in performance AFAIK.

jeswin · on Aug 11, 2024

In the same article, if you picked their R23 benchmarks the advantage vanishes in MT benchmarks; the 3nm M3 Max is actually behind 4nm Strix Point in efficiency.

aurareturn · on Aug 11, 2024

Cinebench R23 is hand optimized for X86. It uses Intel Embree engine underneath.

That's why Cinebench R23 heavily favors x86.

sudosysgen · on Aug 11, 2024

Why are you comparing a chip optimized for performance with a chip optimized for efficiency? Take an Ultrabook Zen5, not the HX370.