Hacker Newsnew | past | comments | ask | show | jobs | submit | Icko_'s commentslogin

distillation and teacher-student models are definitely way older than 2024.


My point is: OpenAI raised $40 billion and Anthropic raised $10 billion, claiming they needed the money to buy more expensive Nvidia servers to train bigger models. Then Chinese experts basically said, no you don't. And they proved it.


[flagged]


Those of us who live outside the US don't think like that. We love German cars, Chinese phones, Korean TV series and Vietnamese food. Our taste is not guided by ideology, it's guided by quality and value.

In school, for example, they taught me that Russia won the "space race" by sending Yuri Gagarin into orbit and making him the first human in space, and the US won the "moon race." What I'm saying is, not all of us live in the same country or have to choose between black or white. We think it's fair to give credit where it's due.


Not even that. With KV-caching, it's linear with the size of the context; and if someone figured out a way to have e.g. NlogN complexity, I imagine with KV-caching it may go down to logN complexity. (If the new algorithm permits that.)


When people say that attention is quadratic, they mean that the cost to process n tokens is O(n²), so the amortized cost per token is indeed O(n). KV-caching is a way to maintain that amortized cost when appending tokens one at a time instead of ingesting the whole sequence at once. But in the end people want to be able to generate multiple tokens, so we're back at O(n²) total time again.

IIRC there are some FFT-based attention alternatives where encoding has complexity O(n log n), but there's no feasible way to cache anything and after appending a single token it costs O(n log n) again, so if you generate n tokens in sequence, the cost is actually O(n² log n).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: