Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It’s quadratic if you implement the transformer naiively, but if you add a KV cache it’s linear compute at the cost of correspondingly linear growth in memory.


This is false. The const of producing a single token is linear but the cost of producing an entire sequence of length N is O(N^2) still (which is always what we meant when we talked about quadratic cost not the cost of a single token).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: