Guide Labs Open Sources an Interpretable 8B Language Model

Guide Labs, the YC W24 company building interpretable foundation models, has open sourced Steerling-8B, an 8-billion-parameter language model that can trace every token it generates back to the input tokens that mattered, the internal concepts used, and the training data that contributed. The company raised a $9 million seed round led by Initialized Capital, with participation from Pioneer Fund, Tectonic Ventures, Y Combinator, Lombardstreet Ventures, and E14 Fund.

Guide Labs was founded by CEO Julius Adebayo, who originated the research underlying Steerling-8B during his doctoral studies at MIT. He co-authored a widely cited paper in 2018 that demonstrated the unreliability of existing methods for understanding deep learning models, work that ultimately led to this different approach: instead of trying to reverse-engineer a black-box model after training, build one that's interpretable from the start.

The key architectural idea is decomposing the model's internal representations into three pathways: roughly 33,000 labeled concepts that humans provide, around 100,000 concepts the model discovers on its own during training, and a small residual that captures everything else. On a held-out validation set, over 84% of token-level contribution comes from the concept module, meaning the model genuinely routes its predictions through concepts rather than relying on opaque residual channels.

Steerling is built on a causal discrete diffusion model backbone, which lets it steer generation across multi-token spans rather than only at the next token. The team constrains the model with training loss functions that ensure signal flows through concepts without a fundamental tradeoff with performance. The concepts feed into logits through a linear path, so every prediction decomposes exactly into per-concept contributions that can be edited at inference time without retraining.

That means you can suppress or amplify specific concepts on the fly. If a deployment needs to block copyrighted material, exclude protected attributes from financial decisions, or steer tone, operators can do that through direct concept-level interventions rather than retraining with RLHF.

Trained on 1.35 trillion tokens, Steerling-8B performs within roughly 5% of models trained on 2 to 7 times more data. It outperforms both LLaMA 2-7B and DeepSeek-7B across standard benchmarks despite using fewer FLOPs.

"This model demonstrates that training interpretable models is no longer a sort of science; it's now an engineering problem. We figured out the science and we can scale them, and there is no reason why this kind of model wouldn't match the performance of the frontier level models."

— Julius Adebayo, CEO, Guide Labs (TechCrunch)

One natural concern with fixed concept categories is whether they eliminate the emergent generalization ability that makes large models useful. Adebayo says that still happens: his team tracks what they call "discovered concepts" that the model found on its own, like quantum computing. These sit alongside the human-labeled concepts and account for a significant share of the model's predictive capacity.

The model weights are available on Hugging Face, with inference code on GitHub and a PyPI package. The source code and model weights are released under the Apache License 2.0.

Guide Labs says it will publish a series of follow-up posts covering concept steering benchmarked against RLHF, what the model discovered on its own, alignment without fine-tuning, training data memorization and data valuation, and a comparison of inherent vs. post-hoc interpretability. The next step for the company is to build a larger model and begin offering API and agentic access to users.

About Us