TNSA Benchmark 2025

AI Model
Benchmarks

Comprehensive performance comparison across text, reasoning, coding, vision, and multimodal capabilities for leading AI models.

15+ Models

30+ Benchmarks

3 Categories

Table 1: Text, Reasoning & Coding

Includes: NGen (3.9 & 3.5), Qwen3 Text (30B & 4B), DeepSeek V3, and Llama (3.3 & 4)

Category	Benchmark	NGen 3.9 MaxV3	NGen 3.9 Pro	Qwen3 30B A3B	Qwen3 4B Think	Qwen3 4B (2507)	NGen 3.5 Max	NGen 3.5 Pro	DeepSeek V3	Llama 3.3 70B	Llama 4 Maverick
Knowledge	MMLU-Pro	81.8	77.2	78.5	70.4	74	75.8	73.4	81.2	68.9	59.6
	MMLU-Redux	93.1	90.4	89.5	83.7	86.1	90.2	87.5	—	—	—
	GPQA	73.8	63.1	65.8	55.9	65.8	71.2	61.4	68.4	—	—
	SuperGPQA	55.8	49.7	51.8	42.7	47.8	57.8	51.2	—	—	—
Reasoning	AIME 25	88.3	72.6	70.9	65.6	81.3	89.1	71.6	—	—	—
	HMMT 25	61.5	48.1	49.8	42.1	55.5	63.1	49.2	—	—	—
	LiveBench	77.8	69.6	74.3	63.6	71.8	76.1	68.4	—	50.7	—
Coding	LiveCodeBench	61.2	54.4	57.4	48.4	55.2	62.5	55.6	49.2	—	68.5
	CFEval	1952	1771	1940	1671	1852	1912	1730	—	—	—
	OJBench	23.9	22.1	20.7	16.1	17.9	22.9	21.2	24	—	—
Alignment	IFEval	93.4	87.9	86.5	81.9	87.4	92.8	86.4	—	92.1	—
	Arena-Hard v2	40.9	19.7	36.3	13.7	34.9	41.8	20.4	—	—	—
	WritingBench	—	—	77	73.5	83.3	—	—	—	—	—
Multilingual	MultiIF	83.3	72.3	72.2	66.3	77.3	81.2	70.9	—	—	—
Multilingual	INCLUDE	70.4	67.8	71.9	61.8	64.4	71.2	66.9	—	—	—
Text Agents	BFCL-v3	—	—	69.1	65.9	71.2	—	—	—	—	—
Text Agents	TAU1-Retail	—	—	61.7	33.9	66.1	—	—	—	—	—

Includes: Qwen3-VL, Gemini 2.5, GPT-5 Nano, NGen 3.5, and Reference Models

Category	Benchmark	Qwen3-VL 8B	Qwen3-VL 4B	Gemini 2.5 Flash Lite	GPT-5 Nano High	NGen 3.5 Max	NGen 3.5 Pro	Llama 3.2 11B	Other Best (Ref)
STEM & Reasoning	MMMU (Val)	74.1	70.8	73.4	75.8	75.6	73.4	50.7	75.6 (InternVL)
	MMMU (Pro)	60.4	57	59.7	57.2	—	—	—	57.1 (GLM-4)
	MathVista	81.4	79.5	72.8	71.5	83.2	79.7	73.7	81.8 (MiMo)
	MathVision	62.7	60	52.1	62.2	—	—	—	60.4 (MiMo)
	MathVerse	77.7	75.2	69.6	74.2	68.1	52.8	—	71.5 (MiMo)
General VQA	MMBench	87.5	86.7	82.7	80.3	—	—	—	85.8 (GLM-4)
	RealWorldQA	73.5	73.2	72.2	71.8	76.9	76.9	—	72.3 (InternVL)
	MMStar	75.3	73.2	69.1	68.6	—	—	—	72.9 (GLM-4)
OCR & Doc	DocVQA	95.3	94.2	92.5	88.2	95.6	90.3	88.4	95.7 (MiMo)
	OCRBench	819	808	825	753	—	—	—	880 (InternVL)
	ChartQA	—	—	—	—	85.9	82.2	83.4	90.0 (InternVL)
	AI2D	84.9	84.9	85.7	81.9	95.5	90.1	—	87.9 (GLM-4)
	InfoVQA	86	83	81.5	68.6	—	—	—	88.0 (MiMo)
Video	VideoMME	71.8	68.9	72.7	66.2	—	—	—	73.0 (Keye)
	MVBench	69	69.3	—	—	74.7	74.9	—	72.1 (InternVL)
	MLVU	75.1	75.7	78.5	69.2	—	—	—	73.0 (InternVL)
Agents	ScreenSpot (Mobile)	93.6	92.9	—	—	68.2	64.2	—	87.3 (MiMo)
	AndroidWorld	50	52	—	—	—	—	—	41.7 (GLM-4)
	OSWorld	33.9	31.4	—	—	—	—	—	14.9 (GLM-4)

Context length capabilities across different models

1M+ context models