💭 Possibly nonsense For some non-text problem, like predicting one system state from a history of past states: Express the problem with new unique tokens Tune an existing small llm on the prediction task. It'll learn to embed the new tokens. Optionally heavily ablate the tuned llm, making it as small as possible without hurting performance on your dataset.

Since llms are already able to work effectively broadly over many domains, the idea is to try to leverage existing circuitry.