>At Google/FB scale tweaks to improve cost may make sense but it doesnt seem like architecturally speaking there is some major design decision being left on the table anymore (I'd love to hear a specific counterpoint though!).
I'd expect if I knew any, they'd be under NDA. I'll just point out that a GPU, even one with specific "ML cores" as NVIDIA calls them, is going to have a bunch of silicon that is being used inefficiently (for more "conventional" GPU uses). There's room for cost saving there. Perhaps NVIDIA eventually moves into that space and produces ML-only chips, but they don't appear to be heading in that direction yet.
There are many researchers not bound by NDAs making novel chip architectures, some of which are on HN.
Secondly, GPUs don't really have very much hardware specialized for graphics anymore. If we called it a TPU I'm not sure you'd be making the same point :P At a company like MS/FB/Google where you don't need to leverage selling the same chip for gaming/vr/ML/mining for economy of scale, like you said you can reduce your transistor count and have the same compute fabric. This would reduce your idle power consumption through leakage but you wouldn't expect a huge drop in power during active compute. Because the smaller precision compute ends up increasing the compute per byte, you either need to find more parallelism to RAM, get faster RAM, reduce the clockrate of compute, or reduce the number of compute elements to find a balanced architecture with lower precision. If you just shrink the number of compute elements - voila! - you're close to what NVIDIA is doing with ML cores.
> Secondly, GPUs don't really have very much hardware specialized for graphics anymore. If we called it a TPU I'm not sure you'd be making the same point :P
I think this is semantics. Modern GPUs do have a lot of hardware that isn't specialized for machine learning. My limited knowledge says that very recent NVIDIA GPUs have some things that vaguely resemble TPU cores ("Tensor cores"), but they also have a lot of silicon for classic CUDA cores. Which I was calling "graphics" hardware, but might better be described as "silicon that isn't optimized for the necessary memory bandwidth for DL". So its still used non-optimally.
To be clear, you can still use the CUDA cores for DL. We did that just fine for a long time, they're just decidedly less efficiently than Tensor cores.
I'd expect if I knew any, they'd be under NDA. I'll just point out that a GPU, even one with specific "ML cores" as NVIDIA calls them, is going to have a bunch of silicon that is being used inefficiently (for more "conventional" GPU uses). There's room for cost saving there. Perhaps NVIDIA eventually moves into that space and produces ML-only chips, but they don't appear to be heading in that direction yet.