Going only by specs, it seems that in many other cases each Power8 core has about 2-4x the resources of a Haswell core:
64KB vs 32KB L1D
512KB vs 64KB L2
8MB vs 2.5MB L3
16 vs 10(?) outstanding L1 requests
8 vs 4 instructions issued per cycle
2 vs 1 stores per cycle (or 4 vs 2 loads if no stores)
3 vs 4/5 cycle L1 latency for 64-bit data
5 vs 6/7 cycle L1 latency for vector data
2048 vs 512 TLB entries (integrated with huge pages)
Specs that aren't higher than Haswell are usually the same --- skimming, I haven't found any that are lower. This makes me think that a 4xSMT or 8xSMT Power8 core will probably be about equivalent to a 2xSMT (hyperthreaded) Haswell core. Dedicated core operation (non-hyperthreaded) will vary more based on workload, but likely be 1x to 2x in favor of Power8.
But these are guesses based on written specs -- I'm eager to see more real benchmarks.
Your huge page comment also reminded me that Power has 64k page support, rather than the 4k or 2MB that amd64 has, and this is normally the pag size (eg in RHEL) which reduces TLB pressure a lot.
In AIX the text/data page size can be controlled independently per-process with environment variables. What drives me up a wall is that the per-process setting does not actually flow through to their POSIX layer. Have any code that calls sysconf(_SC_PAGESIZE)? The page size is 8k! No, wait...
Haswell-EP supports 4 DDR4-2133 channels, or 68 GB/s (theoretical). TDP on POWER8 is almost double and it's much more expensive, so we should think about it as having perhaps double the DRAM bandwidth (still excellent).
Note that POWER8 gets this bandwidth through many more channels, which favors the use of many memory streams. This has been the case historically, with POWER favoring data structures that result in many streams, while the same transformation has been catastrophic for memory performance on Blue Gene's PPC, with its limited prefetch engine and small number of outstanding memory requests.
That is interesting. RAM bandwidth could significantly impact a lot of applications. I guess the only ways it to measure and compare.
The problem I see with power is it has a different endiannes and even though compilers know how to handle it, a lot of libraries and code might just assume some specific ordering (little endian) and thus fail un-expectedly on POWER.
A lot of those problems were solved years ago. I remember a big push in the late 90ies, on Debian, to ensure proper endianness, because of the many architectures on which it ran.
POWER architecture can switch between BE/LE via a special purpose register (SPR). This can even be done to support mixing BE/LE threads on the core. The software complexity of managing that, though, has made them decide to enable the entire OS as BE or LE. One cool thing is that the support will be extended so that the chips can support running either BE or LE guest VMs natively.
Running very demanding single-system software. It's the sort of box you'd install DB2 or Oracle on, especially if you've already tied your business to IBM solutions in the past.
It depends on what you mean by "demanding". I think that single threaded applications on x86 (with nothing else running) will smoke the Power8 system. But if you have lots of threads and high levels of concurrency going on, i think that the results will be very much workload dependent.
Curious aside -- is there any blog / article out there where someone has tried to max out network bandwidth on an x86 box? I checked a POWER8 we have today and it has 14 PCIe3 slots (mix of 8x/16x), so theoretically one could load up 14 40GbE NICs on it. I'd be interested to see some kind of I/O shootout to see where both systems hit a wall.
I have some x86 servers with seven slots and 14 10G ports, but we only managed to get 8 ports working in netmap. Intel has done some crazy stuff with DPDK and I've seen a demo of over 300 Gbps on a 4S machine.
I've also seen some stuff on Power8 networking but it's not public.
Sounds interesting! What's needed from the user in order to use this feature? Will it be possible to use from high level languages like java or do you have to use assembly?
Can think of quite a lot of applications that would benefit A LOT from this.
When you compile C99 apps using _Decimal{32,64,128} types, and you enable a flag on IBM's compiler to use the DFP instructions and tell it you're compiling for at least a POWER6 (e.g., -qarch=pwr6 -qtune=pwr7) it will emit the optimized code. The same types are available in C++ apps as well. (I'm not sure if GCC/LLVM also support emitting the instructions -- haven't checked.)
If higher level languages implemented in C/C++ implement their decimal support using the built-in C/C++ types, they'll get the speed boost.
High level languages like Python, Ruby, and Go could take advantage of this without having to make any code changes, provided that the appropriate libraries are updated to use this new hardware.
Java, however, is a special case. Since Java bytecode does not have a BigDecimal type, there's no way to update the JVM to take advantages of this hardware unfortunately.
I think I remember reading that zSeries has Decimal Floating Point, too, and that IBM's JVM does take advantage of the instructions on z/OS.
Of course, even if that is true, it is a different architecture. But still, if they made the change on one architecture, it would be silly not to make it on the other one as well.
Unless that was intentional to keep some sort of advantage for zSeries.
Yes, according to IBM's presentations, BigDecimal in their JVM uses 64-bit DFP instructions on their hardware. Minimum hardware level is POWER6 or Z10 (Z9 supported via microcode).
it's surprising how much cheaper these Power8 boxes are compared to the Power7 boxes. You can get a tyan 2U, 1 socket system for $2800 FOB HK: http://www.tyan.com/campaign/openpower/
* 4x hardware threads per core (8-way SMT vs. 2)
* 1/4th FP throughput per core (8 SP flops/cycle vs. 32)
* 3x bandwidth to RAM (230 GB/s vs. 68) [edit: updated for Haswell-EP]
https://en.wikipedia.org/wiki/POWER8#Specifications
http://www.redbooks.ibm.com/abstracts/tips1153.html
What is it good for?