How does POWER8 compare to x86, e.g. Haswell? Just skimming some of the architec...

nkurz · on Oct 22, 2014

IBM has solid documentation of more of the Power8 details here: http://www.setphaserstostun.org/power8/POWER8_UM_external_22...

Going only by specs, it seems that in many other cases each Power8 core has about 2-4x the resources of a Haswell core:

  64KB vs 32KB L1D
  512KB vs 64KB L2
  8MB vs 2.5MB L3
  16 vs 10(?) outstanding L1 requests
  8 vs 4 instructions issued per cycle
  2 vs 1 stores per cycle (or 4 vs 2 loads if no stores)
  3 vs 4/5 cycle L1 latency for 64-bit data
  5 vs 6/7 cycle L1 latency for vector data
  2048 vs 512 TLB entries (integrated with huge pages)

Specs that aren't higher than Haswell are usually the same --- skimming, I haven't found any that are lower. This makes me think that a 4xSMT or 8xSMT Power8 core will probably be about equivalent to a 2xSMT (hyperthreaded) Haswell core. Dedicated core operation (non-hyperthreaded) will vary more based on workload, but likely be 1x to 2x in favor of Power8. But these are guesses based on written specs -- I'm eager to see more real benchmarks.

justincormack · on Oct 22, 2014

Your huge page comment also reminded me that Power has 64k page support, rather than the 4k or 2MB that amd64 has, and this is normally the pag size (eg in RHEL) which reduces TLB pressure a lot.

apaprocki · on Oct 23, 2014

In AIX the text/data page size can be controlled independently per-process with environment variables. What drives me up a wall is that the per-process setting does not actually flow through to their POSIX layer. Have any code that calls sysconf(_SC_PAGESIZE)? The page size is 8k! No, wait...

jedbrown · on Oct 22, 2014

> 9x bandwidth to RAM (230 GB/s vs. 25.6)

Haswell-EP supports 4 DDR4-2133 channels, or 68 GB/s (theoretical). TDP on POWER8 is almost double and it's much more expensive, so we should think about it as having perhaps double the DRAM bandwidth (still excellent).

Note that POWER8 gets this bandwidth through many more channels, which favors the use of many memory streams. This has been the case historically, with POWER favoring data structures that result in many streams, while the same transformation has been catastrophic for memory performance on Blue Gene's PPC, with its limited prefetch engine and small number of outstanding memory requests.

rdtsc · on Oct 22, 2014

That is interesting. RAM bandwidth could significantly impact a lot of applications. I guess the only ways it to measure and compare.

The problem I see with power is it has a different endiannes and even though compilers know how to handle it, a lot of libraries and code might just assume some specific ordering (little endian) and thus fail un-expectedly on POWER.

davidw · on Oct 22, 2014

A lot of those problems were solved years ago. I remember a big push in the late 90ies, on Debian, to ensure proper endianness, because of the many architectures on which it ran.

wmf · on Oct 22, 2014

That's why Power is now little-endian.

pavlov · on Oct 22, 2014

Wasn't PowerPC always able to switch between little-endian and big-endian at boot? Maybe they migrated that capability to POWER.

(It's been a long time since I last typed "PowerPC", gave me a strange feeling of nostalgia...)

apaprocki · on Oct 22, 2014

POWER architecture can switch between BE/LE via a special purpose register (SPR). This can even be done to support mixing BE/LE threads on the core. The software complexity of managing that, though, has made them decide to enable the entire OS as BE or LE. One cool thing is that the support will be extended so that the chips can support running either BE or LE guest VMs natively.

edit: Straight from IBM https://www.ibm.com/developerworks/community/blogs/fe313521-...

justincormack · on Oct 22, 2014

Most of them, although not the Apple G5s which are BE only.

PhuFighter · on Oct 22, 2014

Power has been little endian for a while. No one bothered to tell the software guys to write a little endian OS and tools for it....

wmf · on Oct 22, 2014

My point is that little-endian Ubuntu is now available.

justincormack · on Oct 22, 2014

Yes, and IBM is pushing this (plus the v2 ABI which is a bit more like x86 and simplified in some small ways).

gsnedders · on Oct 22, 2014

As far as I'm aware, IBM intend on maintaining both big and little endian Linux ports (and toolchains).

rdtsc · on Oct 22, 2014

I had no idea. Haven't looked it since many years ago.

Well, heck, that's great then.

jacques_chester · on Oct 22, 2014

> What is it good for?

Running very demanding single-system software. It's the sort of box you'd install DB2 or Oracle on, especially if you've already tied your business to IBM solutions in the past.

PhuFighter · on Oct 22, 2014

It depends on what you mean by "demanding". I think that single threaded applications on x86 (with nothing else running) will smoke the Power8 system. But if you have lots of threads and high levels of concurrency going on, i think that the results will be very much workload dependent.

Marat_Dukhan · on Oct 22, 2014

+ Hardware Transactional Memory that isn't buggy

apaprocki · on Oct 23, 2014

Curious aside -- is there any blog / article out there where someone has tried to max out network bandwidth on an x86 box? I checked a POWER8 we have today and it has 14 PCIe3 slots (mix of 8x/16x), so theoretically one could load up 14 40GbE NICs on it. I'd be interested to see some kind of I/O shootout to see where both systems hit a wall.

e.g. Snabb Switch 20x10GbE on x86 https://lukego.github.io/blog/2013/06/23/echoing-packets-wit...

wmf · on Oct 24, 2014

I have some x86 servers with seven slots and 14 10G ports, but we only managed to get 8 ports working in netmap. Intel has done some crazy stuff with DPDK and I've seen a demo of over 300 Gbps on a 4S machine.

I've also seen some stuff on Power8 networking but it's not public.

apaprocki · on Oct 22, 2014

Worth mentioning that POWER is the only chip I'm aware of that has hardware DFP instructions.

desdiv · on Oct 22, 2014

In case anyone else is confused: DFP = Decimal Floating Point, i.e. native base 10 arithmetics on the FPU.

Googling "DFP instructions" led me to nothing but DoubleClick links.

polack · on Oct 22, 2014

Sounds interesting! What's needed from the user in order to use this feature? Will it be possible to use from high level languages like java or do you have to use assembly? Can think of quite a lot of applications that would benefit A LOT from this.

apaprocki · on Oct 23, 2014

When you compile C99 apps using _Decimal{32,64,128} types, and you enable a flag on IBM's compiler to use the DFP instructions and tell it you're compiling for at least a POWER6 (e.g., -qarch=pwr6 -qtune=pwr7) it will emit the optimized code. The same types are available in C++ apps as well. (I'm not sure if GCC/LLVM also support emitting the instructions -- haven't checked.)

If higher level languages implemented in C/C++ implement their decimal support using the built-in C/C++ types, they'll get the speed boost.

desdiv · on Oct 22, 2014

High level languages like Python, Ruby, and Go could take advantage of this without having to make any code changes, provided that the appropriate libraries are updated to use this new hardware.

Java, however, is a special case. Since Java bytecode does not have a BigDecimal type, there's no way to update the JVM to take advantages of this hardware unfortunately.

krylon · on Oct 22, 2014

I think I remember reading that zSeries has Decimal Floating Point, too, and that IBM's JVM does take advantage of the instructions on z/OS.

Of course, even if that is true, it is a different architecture. But still, if they made the change on one architecture, it would be silly not to make it on the other one as well.

Unless that was intentional to keep some sort of advantage for zSeries.

wmf · on Oct 22, 2014

I would assume that the IBM JVM uses DFP instructions to implement the BigDecimal class.

apaprocki · on Oct 23, 2014

Yes, according to IBM's presentations, BigDecimal in their JVM uses 64-bit DFP instructions on their hardware. Minimum hardware level is POWER6 or Z10 (Z9 supported via microcode).

http://www.ibm.com/developerworks/rational/cafe/docBodyAttac...

benjarrell · on Oct 22, 2014

Based on experience with POWER7, emptying your wallet.

PhuFighter · on Oct 22, 2014

it's surprising how much cheaper these Power8 boxes are compared to the Power7 boxes. You can get a tyan 2U, 1 socket system for $2800 FOB HK: http://www.tyan.com/campaign/openpower/