Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Then you need to do a little bit deeper research. No one just applies sparse attention at inference time for a model not trained for it. They do this at training time because otherwise the task performance degrades too much.
 help



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: