Road to VLA
Mar 25th 2026 ·Yahya Masri
I have spent years building with language models, but I kept feeling a gap between using these systems and truly understanding them. This project is my attempt to close that gap by learning how a VLA works from first principles.
How a VLA works
Why did I start this project?
I wanted to do something genuinely difficult to prove to myself that I can go beyond using AI systems and actually understand how they work under the hood. Building toward a VLA felt like the right challenge because it combines perception, language reasoning, and action in one stack.
- I have used LLM tooling in practice, but I want first-principles understanding instead of surface-level familiarity.
- There still is not a simple, practical, beginner-friendly path that shows how to reason from model outputs all the way to physical actions.
My working philosophy for this project is: build before over-consuming theory. I want to prototype, fail, and debug small components first, then study papers with sharper questions. That way I am not just repeating terminology, I am forming real intuition.
I also want this to shape how I think: slower, more deliberate, and more grounded in fundamentals. Instead of treating advanced systems like black boxes, I want to document each piece, explain it clearly, and share progress publicly as I go.
Throughout this project, I am trying to learn by drawing system diagrams, writing down assumptions, and validating each step with implementation. The goal is to make the learning path inspectable and reproducible.
Before moving forward, one clarification: this is not trying to be a perfect reproduction of any production VLA. It is my first-principles attempt to understand and build the core ideas end to end.
What is a VLA?
Lorem ipsum dolor sit amet, consectetur adipiscing elit. In ullamcorper scelerisque turpis, ac luctus ligula hendrerit sed. Aliquam erat volutpat. Etiam sed augue a risus vulputate consequat eu non magna.
Quick primer:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer at dui feugiat, iaculis turpis in, convallis sem. Vivamus non posuere turpis, sed consectetur elit.
Another section
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Quisque quis tincidunt odio. Integer vel nisl sit amet lectus tristique viverra. Suspendisse luctus gravida justo, sed suscipit urna porttitor in.
Placeholder for gradient, training, or evaluation visualization.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec auctor hendrerit dui, ac volutpat erat volutpat et. Mauris blandit posuere viverra. Sed malesuada feugiat sapien, quis suscipit nulla pretium id.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Integer varius mi in augue sollicitudin, ac fermentum lorem consequat.
Footnotes
[1] Footnote template: use a marker in the main text, then define the full note here for context and sourcing. ↩ back
References
- Reference Title One — short note describing why this reference is useful.
- Reference Title Two — short note about what this source helped clarify.
Important resources
- Physical Intelligence (π) — physical intelligence homepage (VLA research lab).
- TurboQuant Paper — algorithm that compresses AI models (especially LLM KV caches and vector databases) without losing accuracy, while also making them faster and cheaper to run.
- OpenAI's Parameter Gulf — research competition challenging participants to train the most efficient, high-performance language model possible under extreme constraints.
- NVIDIA Cosmos-Reason1 (Project Page) — project overview from NVIDIA Research on physical common sense and embodied reasoning.
- Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning (arXiv 2503.15558v1) — technical paper detailing the model architecture, four-stage training pipeline, and physical AI reasoning benchmarks.