Baidu: ERNIE 4.5 VL 424B A47B
Baidu's ERNIE 4.5 VL 424B A47B supports vision and offers a 123K context window at competitive pricing, but with no benchmark data available its real-world performance cannot be assessed — businesses should approach with caution.
Assessment date: March 12, 2026
Our methodology takes into account a range of factors including pricing, functionality, capabilities, benchmark performance, and real-world applicability. Rankings are reviewed and updated regularly as new models are released. Issues with our rankings? Contact us
ERNIE-4.5-VL-424B-A47B is a multimodal Mixture-of-Experts (MoE) model from Baidu’s ERNIE 4.5 series, featuring 424B total parameters with 47B active per token. It is trained jointly on text and image data using a heterogeneous MoE architecture and modality-isolated routing to enable high-fidelity cross-modal reasoning, image understanding, and long-context generation (up to 131k tokens). Fine-tuned with techniques like SFT, DPO, UPO, and RLVR, this model supports both “thinking” and non-thinking inference modes. Designed for vision-language tasks in English and Chinese, it is optimized for efficient scaling and can operate under 4-bit/8-bit quantization.
Capabilities
Architecture
| Modality | Text + Image → Text |
| Tokenizer | Other |
| Parameters | 424B |
Model Information
Pricing
| Token Type | Cost per 1M tokens | Cost per 1K tokens |
|---|---|---|
| Input | $0.42 | $0.000420 |
| Output | $1.25 | $0.001250 |
External Resources
Data sourced from OpenRouter API, Artificial Analysis and Hugging Face Open LLM Leaderboard. Scores are editorially curated by our team.
Last updated: March 13, 2026 7:52 pm