It’s quadratic if you implement the transformer naiively, but if you add a KV ca... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		vlovich123 on April 5, 2025 \| parent \| context \| favorite \| on: The Llama 4 herd It’s quadratic if you implement the transformer naiively, but if you add a KV cache it’s linear compute at the cost of correspondingly linear growth in memory.

hexomancer on April 5, 2025 [–]

This is false. The const of producing a single token is linear but the cost of producing an entire sequence of length N is O(N^2) still (which is always what we meant when we talked about quadratic cost not the cost of a single token).

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact