TNSA Benchmark 2025
AI Model
Benchmarks
Comprehensive performance comparison across text, reasoning, coding, vision, and multimodal capabilities for leading AI models.
15+ Models
30+ Benchmarks
3 Categories
Table 1: Text, Reasoning & Coding
Includes: NGen (3.9 & 3.5), Qwen3 Text (30B & 4B), DeepSeek V3, and Llama (3.3 & 4)
| Category | Benchmark | NGen 3.9 MaxV3 | NGen 3.9 Pro | Qwen3 30B A3B | Qwen3 4B Think | Qwen3 4B (2507) | NGen 3.5 Max | NGen 3.5 Pro | DeepSeek V3 | Llama 3.3 70B | Llama 4 Maverick |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Knowledge | MMLU-Pro | 81.8 | 77.2 | 78.5 | 70.4 | 74 | 75.8 | 73.4 | 81.2 | 68.9 | 59.6 |
| MMLU-Redux | 93.1 | 90.4 | 89.5 | 83.7 | 86.1 | 90.2 | 87.5 | — | — | — | |
| GPQA | 73.8 | 63.1 | 65.8 | 55.9 | 65.8 | 71.2 | 61.4 | 68.4 | — | — | |
| SuperGPQA | 55.8 | 49.7 | 51.8 | 42.7 | 47.8 | 57.8 | 51.2 | — | — | — | |
| Reasoning | AIME 25 | 88.3 | 72.6 | 70.9 | 65.6 | 81.3 | 89.1 | 71.6 | — | — | — |
| HMMT 25 | 61.5 | 48.1 | 49.8 | 42.1 | 55.5 | 63.1 | 49.2 | — | — | — | |
| LiveBench | 77.8 | 69.6 | 74.3 | 63.6 | 71.8 | 76.1 | 68.4 | — | 50.7 | — | |
| Coding | LiveCodeBench | 61.2 | 54.4 | 57.4 | 48.4 | 55.2 | 62.5 | 55.6 | 49.2 | — | 68.5 |
| CFEval | 1952 | 1771 | 1940 | 1671 | 1852 | 1912 | 1730 | — | — | — | |
| OJBench | 23.9 | 22.1 | 20.7 | 16.1 | 17.9 | 22.9 | 21.2 | 24 | — | — | |
| Alignment | IFEval | 93.4 | 87.9 | 86.5 | 81.9 | 87.4 | 92.8 | 86.4 | — | 92.1 | — |
| Arena-Hard v2 | 40.9 | 19.7 | 36.3 | 13.7 | 34.9 | 41.8 | 20.4 | — | — | — | |
| WritingBench | — | — | 77 | 73.5 | 83.3 | — | — | — | — | — | |
| Multilingual | MultiIF | 83.3 | 72.3 | 72.2 | 66.3 | 77.3 | 81.2 | 70.9 | — | — | — |
| INCLUDE | 70.4 | 67.8 | 71.9 | 61.8 | 64.4 | 71.2 | 66.9 | — | — | — | |
| Text Agents | BFCL-v3 | — | — | 69.1 | 65.9 | 71.2 | — | — | — | — | — |
| TAU1-Retail | — | — | 61.7 | 33.9 | 66.1 | — | — | — | — | — |
Table 2: Vision & Multimodal
Includes: Qwen3-VL, Gemini 2.5, GPT-5 Nano, NGen 3.5, and Reference Models
| Category | Benchmark | Qwen3-VL 8B | Qwen3-VL 4B | Gemini 2.5 Flash Lite | GPT-5 Nano High | NGen 3.5 Max | NGen 3.5 Pro | Llama 3.2 11B | Other Best (Ref) |
|---|---|---|---|---|---|---|---|---|---|
| STEM & Reasoning | MMMU (Val) | 74.1 | 70.8 | 73.4 | 75.8 | 75.6 | 73.4 | 50.7 | 75.6 (InternVL) |
| MMMU (Pro) | 60.4 | 57 | 59.7 | 57.2 | — | — | — | 57.1 (GLM-4) | |
| MathVista | 81.4 | 79.5 | 72.8 | 71.5 | 83.2 | 79.7 | 73.7 | 81.8 (MiMo) | |
| MathVision | 62.7 | 60 | 52.1 | 62.2 | — | — | — | 60.4 (MiMo) | |
| MathVerse | 77.7 | 75.2 | 69.6 | 74.2 | 68.1 | 52.8 | — | 71.5 (MiMo) | |
| General VQA | MMBench | 87.5 | 86.7 | 82.7 | 80.3 | — | — | — | 85.8 (GLM-4) |
| RealWorldQA | 73.5 | 73.2 | 72.2 | 71.8 | 76.9 | 76.9 | — | 72.3 (InternVL) | |
| MMStar | 75.3 | 73.2 | 69.1 | 68.6 | — | — | — | 72.9 (GLM-4) | |
| OCR & Doc | DocVQA | 95.3 | 94.2 | 92.5 | 88.2 | 95.6 | 90.3 | 88.4 | 95.7 (MiMo) |
| OCRBench | 819 | 808 | 825 | 753 | — | — | — | 880 (InternVL) | |
| ChartQA | — | — | — | — | 85.9 | 82.2 | 83.4 | 90.0 (InternVL) | |
| AI2D | 84.9 | 84.9 | 85.7 | 81.9 | 95.5 | 90.1 | — | 87.9 (GLM-4) | |
| InfoVQA | 86 | 83 | 81.5 | 68.6 | — | — | — | 88.0 (MiMo) | |
| Video | VideoMME | 71.8 | 68.9 | 72.7 | 66.2 | — | — | — | 73.0 (Keye) |
| MVBench | 69 | 69.3 | — | — | 74.7 | 74.9 | — | 72.1 (InternVL) | |
| MLVU | 75.1 | 75.7 | 78.5 | 69.2 | — | — | — | 73.0 (InternVL) | |
| Agents | ScreenSpot (Mobile) | 93.6 | 92.9 | — | — | 68.2 | 64.2 | — | 87.3 (MiMo) |
| AndroidWorld | 50 | 52 | — | — | — | — | — | 41.7 (GLM-4) | |
| OSWorld | 33.9 | 31.4 | — | — | — | — | — | 14.9 (GLM-4) |
Inference Performance & Specifications
Context length capabilities across different models
| Model | Context Length (Tokens) |
|---|---|
| NGen 3.9 Pro | 262,144 |
| NGen 3.9 MaxV3 | 262,144 |
| NGen 3.5 Pro | 262,144 |
| NGen 3.5 Max | 262,144 |
| Llama 3.2 11B Vision | 131,072 |
| Llama 3.3 70B FP8 | 131,072 |
| Llama 4 Maverick 17B | 1,024,000 |
| DeepSeek V3 | 131,072 |
| Gemini 2.5 Flash-Lite | 1,048,576 |
| GPT-5 Nano High | 131,072 |
1M+ context models