Full evaluation results across reasoning, science, medical, coding, and retrieval benchmarks
Last updated
May 27, 2026
Primary source
Official Meta AI announcementInterpretation
Strong on health, charts, and science; weaker on coding and abstract reasoning.
| Category | Benchmark | Muse Spark | GPT-5.4 | Gemini 3.1 Pro | Notes | Source |
|---|---|---|---|---|---|---|
| Overall | AA v4.0 | 52 | 57 | 57 | Comprehensive index | Artificial Analysis |
| Chart Understanding | CharXiv | 86.4 | 82.8 | 80.2 | Chart comprehension | Meta / AA |
| Medical | HealthBench Hard | 42.8 | 40.1 | 20.6 | Medical QA | Meta / HealthBench |
| Deep Search | DeepSearchQA | 74.8 | — | 69.7 | Research retrieval | Meta |
| Reasoning | HLE (Fast) | 36.5 | 43.9 | 48.4 | — | Meta |
| Reasoning | HLE (Contemplating) | 50.2 | 43.9 | 48.4 | Extended reasoning | Meta |
| Science | FrontierScience | 38.3 | 36.7 | 23.3 | Research frontier | Meta / FrontierScience |
| Abstract | ARC AGI 2 | 42.5 | 76.1 | 76.5 | Pattern reasoning | Meta / AA |
| Coding | Terminal-Bench 2.0 | 59.0 | 75.1 | 68.5 | Agentic coding | Meta / Terminal-Bench |
Sources combine Meta's launch materials and public benchmark summaries. Treat all launch-week benchmark data as subject to revision as third-party evaluators publish updated results.
Muse Spark leads on chart understanding, medical reasoning, deep search, science, and extended reasoning.
Try Muse Spark at meta.ai →