A more efficient inference architecture

Leverages perplexity to switch to larger LLMs when a small 'draft' model is uncertain. If MoE and SpecDec had a baby. Works best with two models, one of them being distilled from the other.

This is painfully inefficient:

Human:

Write the following:
"The cat sat on the mat. The dog also came to sit on the mat."
Then, in one sentence, propose a novel theory that unifies quantum physics and general relativity.

Assistant:

The cat sat on the mat. The dog also came to sit on the mat.

Spacetime emerges from quantum entanglement across a holographic causal lattice, with gravity manifesting as a macroscopic statistical flow of information constrained because of the topological boundary.

Low Perplexity (Predictable)

Medium Perplexity

High Perplexity

Very High Perplexity (Uncertain)

Simplified for illustrative purposes. These perplexities weren't generated by an actual model.
Tokens ≠ Words

You're using the same model for all of these tokens!

Why use the same big model for both easy (low-perplexity) and hard (high-perplexity) tokens? It's like using a supercomputer to add 2 + 2. Wasting time, money, and energy on something a basic calculator could handle instantly. For most tokens, a smaller, faster model is more than enough.

The plan

Download two models. By default, have the smaller 'draft' model output tokens sequentially. If the perplexity of the token is above a certain threshold, delete it and have that specific token done by the larger model.

from llama_cpp import Llama
import math

RED   = "\033[91m"
RESET = "\033[0m"

def load_model(path):
    return Llama(
        model_path=path,
        chat_format=None,
        logits_all=True,
        verbose=False
    )

def get_next_token_and_logprob(llm, prompt):
    out = llm(
        prompt,
        max_tokens=1,
        logprobs=1,
        echo=False
    )
    choice = out['choices'][0]
    return choice['text'], choice['logprobs']['token_logprobs'][0]

def perplexity_from_logprob(logprob):
    return math.exp(-logprob)

def generate_token_with_fallback(small, large, prompt, threshold):
    token, logprob = get_next_token_and_logprob(small, prompt)
    ppl = perplexity_from_logprob(logprob)
    
    if ppl > threshold:
        token, _ = get_next_token_and_logprob(large, prompt)
        print(f"{RED}{token}{RESET}", end="", flush=True)
    else:
        print(token, end="", flush=True)
    
    return token

def main():
    small = load_model("qwen2.5-0.5b-instruct-q4_k_m.gguf")
    large = load_model("qwen2.5-1.5b-instruct-q4_k_m.gguf")
    threshold = 5.0

    prompt = "My name is A"
    print(prompt, end="", flush=True)

    for _ in range(100):
        nxt = generate_token_with_fallback(small, large, prompt, threshold)
        prompt += nxt

    print()


main()

Limitations

There are a couple of limitations when it comes to using this inference architecture. One area for improvement is what is done with the draft model's high perplexity tokens: currently, they are used to mesure perplexity and then deleted. In the future, there could be a much smaller router trained on a dataset of tokens and perplexities that wouldn't have the need for that extra unused token.