Hugging Face Blog · 17 Mar

Holotron-12B - High Throughput Computer Use Agent

modelagents

H Company has released Holotron-12B, a 12-billion parameter multimodal model specifically designed for computer-use agents operating in interactive environments. Unlike most multimodal models optimized for static vision or instruction-following, Holotron-12B serves as a policy model for agents that must perceive, decide, and act efficiently.

The model is post-trained from NVIDIA's Nemotron-Nano-2 VL base using H Company's proprietary data mixture. This collaboration demonstrates how much further the NVIDIA foundation can be pushed with targeted training for agentic workloads.

Holotron-12B's key innovation is its hybrid State-Space Model and attention architecture, which provides superior scalability for long-context inference. Unlike vanilla transformers that store growing KV caches, SSMs maintain only constant state per layer regardless of sequence length, dramatically reducing memory requirements.

In benchmark testing on a single H100 GPU with vLLM, Holotron-12B achieved over 2x higher throughput than Holo2-8B, reaching 8.9k tokens/s at maximum concurrency. The efficient VRAM utilization allows much larger batch sizes on the same hardware, making it ideal for throughput-bound applications like data generation and online reinforcement learning.

The model was trained in two stages, with supervised fine-tuning on approximately 14 billion tokens focused on screen understanding, grounding, and UI interactions. This specialized training pushed WebVoyager performance from 35.1% on the base model to 80.5%, exceeding previous Holo2-8B results.

Beyond web navigation, Holotron-12B shows substantial improvements on localization benchmarks including OS-World-G, GroundUI, and WebClick. The model demonstrates that NVIDIA's Nemotron VL architecture provides a strong foundation for real-world multimodal agents when paired with appropriate training.

Holotron-12B is now available on Hugging Face under the NVIDIA Open Model License. H Company has also announced plans to post-train NVIDIA's upcoming Nemotron 3 Omni, leveraging enhanced hybrid SSM-Attention and MoE architectures for even greater reasoning capabilities.

Read original → huggingface.co