Advantage of SML in between some outputs cannot be compressed without losing context, so a small model does that job. It works but most of these solutions still have some tradeoff in real world applications.
We compress tool outputs at each step, so the cache isn't broken during the run. Once we hit the 85% context-window limit, we preemptively trigger a summarization step and load that when the context-window fills up.
For auto-compact, we do essentially the same Anthropic does, but at 85% filled context window. Then, when the window is 100% filled, we pull this precompaction + append accumulated 15%. This allows to run compaction instantly
If it's the latter, then users will pay for the entire history of tokens since the change uncached: https://platform.claude.com/docs/en/build-with-claude/prompt...
How is this better?