TNSA Benchmark 2025

AI Model
Benchmarks

Comprehensive performance comparison across text, reasoning, coding, vision, and multimodal capabilities for leading AI models.

15+ Models
30+ Benchmarks
3 Categories

Table 1: Text, Reasoning & Coding

Includes: NGen (3.9 & 3.5), Qwen3 Text (30B & 4B), DeepSeek V3, and Llama (3.3 & 4)

CategoryBenchmarkNGen 3.9 MaxV3NGen 3.9 ProQwen3 30B A3BQwen3 4B ThinkQwen3 4B (2507)NGen 3.5 MaxNGen 3.5 ProDeepSeek V3Llama 3.3 70BLlama 4 Maverick
KnowledgeMMLU-Pro81.877.278.570.47475.873.481.268.959.6
MMLU-Redux93.190.489.583.786.190.287.5
GPQA73.863.165.855.965.871.261.468.4
SuperGPQA55.849.751.842.747.857.851.2
ReasoningAIME 2588.372.670.965.681.389.171.6
HMMT 2561.548.149.842.155.563.149.2
LiveBench77.869.674.363.671.876.168.450.7
CodingLiveCodeBench61.254.457.448.455.262.555.649.268.5
CFEval1952177119401671185219121730
OJBench23.922.120.716.117.922.921.224
AlignmentIFEval93.487.986.581.987.492.886.492.1
Arena-Hard v240.919.736.313.734.941.820.4
WritingBench7773.583.3
MultilingualMultiIF83.372.372.266.377.381.270.9
INCLUDE70.467.871.961.864.471.266.9
Text AgentsBFCL-v369.165.971.2
TAU1-Retail61.733.966.1

Table 2: Vision & Multimodal

Includes: Qwen3-VL, Gemini 2.5, GPT-5 Nano, NGen 3.5, and Reference Models

CategoryBenchmarkQwen3-VL 8BQwen3-VL 4BGemini 2.5 Flash LiteGPT-5 Nano HighNGen 3.5 MaxNGen 3.5 ProLlama 3.2 11BOther Best (Ref)
STEM & ReasoningMMMU (Val)74.170.873.475.875.673.450.775.6 (InternVL)
MMMU (Pro)60.45759.757.257.1 (GLM-4)
MathVista81.479.572.871.583.279.773.781.8 (MiMo)
MathVision62.76052.162.260.4 (MiMo)
MathVerse77.775.269.674.268.152.871.5 (MiMo)
General VQAMMBench87.586.782.780.385.8 (GLM-4)
RealWorldQA73.573.272.271.876.976.972.3 (InternVL)
MMStar75.373.269.168.672.9 (GLM-4)
OCR & DocDocVQA95.394.292.588.295.690.388.495.7 (MiMo)
OCRBench819808825753880 (InternVL)
ChartQA85.982.283.490.0 (InternVL)
AI2D84.984.985.781.995.590.187.9 (GLM-4)
InfoVQA868381.568.688.0 (MiMo)
VideoVideoMME71.868.972.766.273.0 (Keye)
MVBench6969.374.774.972.1 (InternVL)
MLVU75.175.778.569.273.0 (InternVL)
AgentsScreenSpot (Mobile)93.692.968.264.287.3 (MiMo)
AndroidWorld505241.7 (GLM-4)
OSWorld33.931.414.9 (GLM-4)

Inference Performance & Specifications

Context length capabilities across different models

ModelContext Length (Tokens)
NGen 3.9 Pro262,144
NGen 3.9 MaxV3262,144
NGen 3.5 Pro262,144
NGen 3.5 Max262,144
Llama 3.2 11B Vision131,072
Llama 3.3 70B FP8131,072
Llama 4 Maverick 17B1,024,000
DeepSeek V3131,072
Gemini 2.5 Flash-Lite1,048,576
GPT-5 Nano High131,072
1M+ context models
Built with v0