In these deep relu networks which are not renormalized in between, (like overfeat which has no normalization), some of the activations become pretty big in size! (in the order of 1e3).
You also cant clip them and get away with it, you have to either renormalize the layers to do half-precision (and live with the extra cost) or stick to full-precision. I was doing fun stuff early this year that did fixed-precision nets (8-bit/16-bit). Things get very interesting :)
That's plenty IMO for most inputs and weights. Where it gets tricky is in accumulation. You could constrain the weights for each unit I guess, but this is the sort of work best done under the hood rather than by the data scientist IMO. I'd personally choose 32-bit accumulation just because it would drastically simplify code development.
I've also worked with fixed precision elsewhere. It's awesome if you understand the dynamic range of your application. It's a migraine headache if you don't.
You also cant clip them and get away with it, you have to either renormalize the layers to do half-precision (and live with the extra cost) or stick to full-precision. I was doing fun stuff early this year that did fixed-precision nets (8-bit/16-bit). Things get very interesting :)