AllenAI: Molmo2 8B
Molmo2 8B from AllenAI is a compact multimodal model with video and image input support, but its benchmark scores are very low across reasoning and coding, making it suitable only for lightweight or experimental use cases.
Assessment date: March 14, 2026
Our methodology takes into account a range of factors including pricing, functionality, capabilities, benchmark performance, and real-world applicability. Rankings are reviewed and updated regularly as new models are released. Issues with our rankings? Contact us
Molmo2-8B is an open vision-language model developed by the Allen Institute for AI (Ai2) as part of the Molmo2 family, supporting image, video, and multi-image understanding and grounding. It is based on Qwen3-8B and uses SigLIP 2 as its vision backbone, outperforming other open-weight, open-data models on short videos, counting, and captioning, while remaining competitive on long-video tasks.
Capabilities
Architecture
| Modality | Text + Image + Video → Text |
| Tokenizer | Other |
| Parameters | 8B |
Benchmark Scores
Evaluations
Benchmark data from Artificial Analysis and Hugging Face
Model Information
Pricing
| Token Type | Cost per 1M tokens | Cost per 1K tokens |
|---|---|---|
| Input | $0.20 | $0.000200 |
| Output | $0.20 | $0.000200 |
Live Performance
Live endpoint metrics — refreshed every 30 minutes.
External Resources
Data sourced from OpenRouter API, Artificial Analysis and Hugging Face Open LLM Leaderboard. Scores are editorially curated by our team.
Last updated: March 15, 2026 7:52 pm