Xiaomi: MiMo-V2-Omni

Xiaomi: MiMo-V2-Omni

xiaomi · Released Mar 18, 2026 Professional New
74.7
Our Score

Performance Profile

Intelligence7.4Technical7.1Value7.5Content7
Intelligence 7.4/10
Technical 7.1/10
Content 7/10
Value 7.5/10

MiMo-V2-Omni is a frontier omni-modal model that natively processes image, video, and audio inputs within a unified architecture. It combines strong multimodal perception with agentic capability - visual grounding, multi-step planning, tool use, and code execution - making it well-suited for complex real-world tasks that span modalities, 256K context window.

$0.40 / 1M
Input Price
$2.00 / 1M
Output Price
262,144 tokens
Context Window
65,536 tokens
Max Output

Capabilities

Tool Use Function Calling Vision

Architecture

ModalityText + Image + Audio + Video → Text
TokenizerOther

Performance Indices

Source: Artificial Analysis

43.4 Intelligence Index
35.5 Coding Index
63 Agentic Index

Benchmark Scores

Intelligence

GPQA Diamond 82.8% Graduate-level scientific reasoning
HLE 19.9% Humanity's Last Exam
SciCode 36.7% Scientific computing

Technical

TerminalBench Hard 34.8% Agentic terminal tasks
τ²-Bench 91.2% Conversational agent benchmark

Content

IFBench 53.5% Instruction following
LCR 66.7% Long-context reasoning

Benchmark data from Artificial Analysis and Hugging Face

Model Information

OpenRouter ID xiaomi/mimo-v2-omni
Providerxiaomi
Release Date March 18, 2026
Context Length262,144 tokens
Max Completion65,536 tokens
Status Active

Pricing

Token Type Cost per 1M tokens Cost per 1K tokens
Input $0.40 $0.000400
Output $2.00 $0.002000

Live Performance

Live endpoint metrics — refreshed every 30 minutes.

100%
Avg Uptime
1,400ms
Best Latency (TTFT)
67 tok/s
Best Throughput
1/1
Active Endpoints
Available via: Xiaomi

Leaderboard Categories