A more efficient inference architecture
Leverages perplexity to switch to larger LLMs when a small 'draft' model is uncertain. If MoE and SpecDec had a baby. Works best with two models, one of them being distilled from the other.
You're using the same model for all of these tokens!
Why use the same big model for both easy (low-perplexity) and hard (high-perplexity) tokens? It's like using a supercomputer to add 2 + 2. Wasting time, money, and energy on something a basic calculator could handle instantly. For most tokens, a smaller, faster model is more than enough.
The plan
Download two models. By default, have the smaller 'draft' model output tokens sequentially. If the perplexity of the token is above a certain threshold, delete it and have that specific token done by the larger model.
Limitations
There are a couple of limitations when it comes to using this inference architecture. One area for improvement is what is done with the draft model's high perplexity tokens: currently, they are used to mesure perplexity and then deleted. In the future, there could be a much smaller router trained on a dataset of tokens and perplexities that wouldn't have the need for that extra unused token.