This page reports the results of benchmarking the models available on Openrouter against their ability to calculate the derivative. You can read the motivation for this benchmark here.
The benchmark, in its current form, is mostly helpful. We would have ideally preferred to have tested more derivatives until the accuracy of the top models was lower, but we are unfortunately constrained by the budget we dedicated to this. If you wish to sponsor this project with a BYOK, please reach out at hello@randomdomain.co.za.
The key motivation for this project is to determine if an LLM can, with pure symbolic manipulation, do a derivative correctly, because we believe that the derivative, at some level, measures an LLMs ability to follow a set of rules strictly to a definite and verifiable result.
Some caveats:
This table shows the best performing model(s) from each provider.
| Provider | Top Model(s) | Accuracy | Correct | Total | Incorrect | Errors |
|---|---|---|---|---|---|---|
Gemini-2.5-Flash
|
98.9 ± 1.1% | 62 | 62 | 0 | 0 | |
| openai |
Gpt-5
|
98.9 ± 1.1% | 61 | 61 | 0 | 0 |
| qwen |
Qwen3-Max
|
98.9 ± 1.1% | 61 | 61 | 0 | 0 |
| x-ai |
Grok-4
|
98.9 ± 1.1% | 61 | 61 | 0 | 0 |
| anthropic |
Claude-3.7-Sonnet:thinking
|
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| switchpoint |
Router
|
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| z-ai |
Glm-4.5
|
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| meta-llama |
Llama-4-Maverick:free
|
95.6 ± 3.3% | 58 | 60 | 1 | 1 |
| nvidia |
Llama-3.3-Nemotron-Super-49b-V1.5
|
95.5 ± 3.4% | 56 | 58 | 2 | 0 |
| deepseek |
Deepseek-R1
|
95.0 ± 3.8% | 50 | 52 | 1 | 1 |
| meituan |
Longcat-Flash-Chat
|
93.8 ± 4.3% | 55 | 58 | 3 | 0 |
| inclusionai |
Ling-1t
|
93.7 ± 4.4% | 54 | 57 | 2 | 1 |
| openrouter |
Auto
|
93.1 ± 4.8% | 49 | 52 | 3 | 0 |
| perplexity |
Sonar-Reasoning
|
93.1 ± 4.8% | 49 | 52 | 1 | 2 |
| baidu |
Ernie-4.5-300b-A47b
|
92.1 ± 5.1% | 54 | 58 | 4 | 0 |
| mistralai |
Mistral-Medium-3.1
|
92.0 ± 5.1% | 53 | 57 | 3 | 1 |
| microsoft |
Phi-4-Reasoning-Plus
|
90.4 ± 5.7% | 53 | 58 | 4 | 1 |
| moonshotai |
Kimi-K2
|
88.8 ± 6.3% | 52 | 58 | 5 | 1 |
| alibaba |
Tongyi-Deepresearch-30b-A3b
|
88.6 ± 6.4% | 51 | 57 | 6 | 0 |
| arcee-ai |
Virtuoso-Large
|
88.6 ± 6.4% | 51 | 57 | 6 | 0 |
| stepfun-ai |
Step3
|
80.8 ± 12.1% | 19 | 23 | 2 | 2 |
| cohere |
Command-A
|
80.5 ± 10.2% | 31 | 38 | 4 | 3 |
| deepcogito |
Cogito-V2-Preview-Deepseek-671b
|
80.5 ± 10.2% | 31 | 38 | 5 | 2 |
| tencent |
Hunyuan-A13b-Instruct
|
78.5 ± 12.6% | 20 | 25 | 5 | 0 |
| minimax |
Minimax-M1
|
76.7 ± 13.5% | 18 | 23 | 2 | 3 |
| aion-labs |
Aion-1.0-Mini
|
71.9 ± 14.3% | 19 | 26 | 6 | 1 |
| nousresearch |
Hermes-3-Llama-3.1-405b
|
70.9 ± 14.8% | 18 | 25 | 6 | 1 |
| amazon |
Nova-Pro-V1
|
69.7 ± 15.3% | 17 | 24 | 7 | 0 |
| thedrummer |
Cydonia-24b-V4.1
|
68.5 ± 15.9% | 16 | 23 | 6 | 1 |
| inception |
Mercury-Coder
|
65.7 ± 17.1% | 14 | 21 | 5 | 2 |
| cognitivecomputations |
Dolphin-Mistral-24b-Venice-Edition:free
|
63.6 ± 19.1% | 11 | 17 | 5 | 1 |
| sao10k |
L3.3-Euryale-70b
|
59.2 ± 21.1% | 9 | 15 | 3 | 3 |
| ai21 |
Jamba-Large-1.7
|
56.5 ± 22.2% | 8 | 14 | 5 | 1 |
| anthracite-org |
Magnum-V4-72b
|
50.0 ± 24.9% | 6 | 12 | 3 | 3 |
| ibm-granite |
Granite-4.0-H-Micro
|
50.0 ± 24.9% | 6 | 12 | 5 | 1 |
| inflection |
Inflection-3-Productivity
|
32.1 ± 33.0% | 2 | 7 | 5 | 0 |
| bytedance |
Ui-Tars-1.5-7b
|
22.8 ± 35.0% | 1 | 6 | 4 | 1 |
| neversleep |
Noromaid-20b
|
22.8 ± 35.0% | 1 | 6 | 4 | 1 |
| eleutherai |
Llemma_7b
|
12.9 ± 39.2% | 0 | 4 | 2 | 2 |
| gryphe |
Mythomax-L2-13b
|
12.9 ± 39.2% | 0 | 4 | 3 | 1 |
| mancer |
Weaver
|
12.9 ± 39.2% | 0 | 4 | 2 | 2 |
| undi95 |
Remm-Slerp-L2-13b
|
12.9 ± 39.2% | 0 | 4 | 3 | 1 |
| Rank | Model | Accuracy | Correct | Total | Incorrect | Errors |
|---|---|---|---|---|---|---|
| 1 | Google/Gemini-2.5-Flash |
98.9 ± 1.1% | 62 | 62 | 0 | 0 |
| 2 | Google/Gemini-2.5-Pro |
98.9 ± 1.1% | 61 | 61 | 0 | 0 |
| 2 | Google/Gemini-2.5-Pro-Preview |
98.9 ± 1.1% | 61 | 61 | 0 | 0 |
| 2 | Openai/Gpt-5 |
98.9 ± 1.1% | 61 | 61 | 0 | 0 |
| 2 | Qwen/Qwen3-Max |
98.9 ± 1.1% | 61 | 61 | 0 | 0 |
| 2 | X-ai/Grok-4 |
98.9 ± 1.1% | 61 | 61 | 0 | 0 |
| 3 | Anthropic/Claude-3.7-Sonnet:thinking |
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| 3 | Google/Gemini-2.5-Flash-Image |
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| 3 | Openai/Gpt-5-Codex |
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| 3 | Openai/Gpt-5-Image-Mini |
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| 3 | Openai/O3-Mini |
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| 3 | Switchpoint/Router |
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| 3 | Z-ai/Glm-4.5 |
98.9 ± 1.1% | 60 | 60 | 0 | 0 |
| 4 | Openai/Gpt-5-Mini |
97.3 ± 2.3% | 60 | 61 | 1 | 0 |
| 4 | Openai/O3 |
97.3 ± 2.3% | 60 | 61 | 1 | 0 |
| 5 | Openai/Gpt-5-Nano |
97.3 ± 2.3% | 59 | 60 | 1 | 0 |
| 5 | Openai/O3-Mini-High |
97.3 ± 2.3% | 59 | 60 | 1 | 0 |
| 5 | Openai/O4-Mini-Deep-Research |
97.3 ± 2.3% | 59 | 60 | 0 | 1 |
| 5 | Qwen/Qwen3-Next-80b-A3b-Instruct |
97.3 ± 2.3% | 59 | 60 | 1 | 0 |
| 6 | Google/Gemini-2.5-Pro-Preview-05-06 |
97.2 ± 2.4% | 58 | 59 | 1 | 0 |
| 7 | Google/Gemini-2.5-Flash-Image-Preview |
97.2 ± 2.4% | 57 | 58 | 1 | 0 |
| 7 | Qwen/Qwen-Plus |
97.2 ± 2.4% | 57 | 58 | 1 | 0 |
| 7 | Qwen/Qwen3-Vl-30b-A3b-Instruct |
97.2 ± 2.4% | 57 | 58 | 1 | 0 |
| 8 | Qwen/Qwen3-30b-A3b-Thinking-2507 |
97.1 ± 2.5% | 56 | 57 | 1 | 0 |
| 8 | Qwen/Qwen3-Next-80b-A3b-Thinking |
97.1 ± 2.5% | 56 | 57 | 1 | 0 |
| 8 | Qwen/Qwen3-Vl-235b-A22b-Instruct |
97.1 ± 2.5% | 56 | 57 | 1 | 0 |
| 9 | Openai/O4-Mini |
97.0 ± 2.5% | 54 | 55 | 1 | 0 |
| 9 | Openai/O4-Mini-High |
97.0 ± 2.5% | 54 | 55 | 1 | 0 |
| 10 | Qwen/Qwen3-235b-A22b-Thinking-2507 |
96.9 ± 2.6% | 52 | 53 | 1 | 0 |
| 11 | Qwen/Qwq-32b |
96.9 ± 2.7% | 51 | 52 | 1 | 0 |
| 12 | Anthropic/Claude-Sonnet-4 |
95.8 ± 3.2% | 60 | 62 | 2 | 0 |
| 12 | X-ai/Grok-3 |
95.8 ± 3.2% | 60 | 62 | 2 | 0 |
| 13 | Meta-llama/Llama-4-Maverick:free |
95.6 ± 3.3% | 58 | 60 | 1 | 1 |
| 14 | Qwen/Qwen-Vl-Max |
95.6 ± 3.4% | 57 | 59 | 2 | 0 |
| 15 | Nvidia/Llama-3.3-Nemotron-Super-49b-V1.5 |
95.5 ± 3.4% | 56 | 58 | 2 | 0 |
| 15 | Openai/Gpt-4.1-Mini |
95.5 ± 3.4% | 56 | 58 | 1 | 1 |
| 15 | Qwen/Qwen-2.5-Coder-32b-Instruct |
95.5 ± 3.4% | 56 | 58 | 2 | 0 |
| 15 | Qwen/Qwen-Max |
95.5 ± 3.4% | 56 | 58 | 1 | 1 |
| 15 | Qwen/Qwen-Plus-2025-07-28:thinking |
95.5 ± 3.4% | 56 | 58 | 2 | 0 |
| 16 | Google/Gemini-2.5-Flash-Lite-Preview-06-17 |
95.4 ± 3.5% | 55 | 57 | 1 | 1 |
| 16 | Nvidia/Nemotron-Nano-9b-V2 |
95.4 ± 3.5% | 55 | 57 | 2 | 0 |
| 16 | Qwen/Qwen-Plus-2025-07-28 |
95.4 ± 3.5% | 55 | 57 | 2 | 0 |
| 16 | Qwen/Qwen3-235b-A22b-2507 |
95.4 ± 3.5% | 55 | 57 | 2 | 0 |
| 17 | Deepseek/Deepseek-R1 |
95.0 ± 3.8% | 50 | 52 | 1 | 1 |
| 17 | Qwen/Qwen3-235b-A22b:free |
95.0 ± 3.8% | 50 | 52 | 1 | 1 |
| 18 | Qwen/Qwen3-Vl-30b-A3b-Thinking |
94.1 ± 4.5% | 42 | 44 | 2 | 0 |
| 19 | Deepseek/Deepseek-Chat |
93.8 ± 4.3% | 55 | 58 | 3 | 0 |
| 19 | Google/Gemini-2.5-Flash-Preview-09-2025 |
93.8 ± 4.3% | 55 | 58 | 3 | 0 |
| 19 | Meituan/Longcat-Flash-Chat |
93.8 ± 4.3% | 55 | 58 | 3 | 0 |
| 19 | Openai/Gpt-5-Chat |
93.8 ± 4.3% | 55 | 58 | 1 | 2 |
| 19 | Openai/Gpt-Oss-20b |
93.8 ± 4.3% | 55 | 58 | 3 | 0 |
| 19 | Qwen/Qwen3-14b |
93.8 ± 4.3% | 55 | 58 | 3 | 0 |
| 19 | Qwen/Qwen3-30b-A3b-Instruct-2507 |
93.8 ± 4.3% | 55 | 58 | 3 | 0 |
| 20 | Inclusionai/Ling-1t |
93.7 ± 4.4% | 54 | 57 | 2 | 1 |
| 20 | Openai/Gpt-Oss-120b |
93.7 ± 4.4% | 54 | 57 | 3 | 0 |
| 21 | Openai/O1-Mini |
93.6 ± 4.5% | 53 | 56 | 3 | 0 |
| 21 | Qwen/Qwen3-Coder-Plus |
93.6 ± 4.5% | 53 | 56 | 3 | 0 |
| 21 | Z-ai/Glm-4.6 |
93.6 ± 4.5% | 53 | 56 | 3 | 0 |
| 22 | Deepseek/Deepseek-R1-0528 |
93.5 ± 4.9% | 38 | 40 | 1 | 1 |
| 23 | Anthropic/Claude-Sonnet-4.5 |
93.2 ± 4.7% | 50 | 53 | 2 | 1 |
| 24 | Openrouter/Auto |
93.1 ± 4.8% | 49 | 52 | 3 | 0 |
| 24 | Perplexity/Sonar-Reasoning |
93.1 ± 4.8% | 49 | 52 | 1 | 2 |
| 24 | Qwen/Qwen3-8b |
93.1 ± 4.8% | 49 | 52 | 3 | 0 |
| 25 | Inclusionai/Ring-1t |
92.6 ± 5.6% | 33 | 35 | 2 | 0 |
| 26 | Nvidia/Nemotron-Nano-9b-V2:free |
92.5 ± 4.8% | 57 | 61 | 3 | 1 |
| 27 | Baidu/Ernie-4.5-300b-A47b |
92.1 ± 5.1% | 54 | 58 | 4 | 0 |
| 27 | Google/Gemini-2.5-Flash-Lite-Preview-09-2025 |
92.1 ± 5.1% | 54 | 58 | 3 | 1 |
| 27 | Meta-llama/Llama-4-Maverick |
92.1 ± 5.1% | 54 | 58 | 2 | 2 |
| 28 | Anthropic/Claude-3.5-Sonnet |
92.0 ± 5.1% | 53 | 57 | 3 | 1 |
| 28 | Mistralai/Mistral-Medium-3.1 |
92.0 ± 5.1% | 53 | 57 | 3 | 1 |
| 28 | Qwen/Qwen3-Vl-8b-Instruct |
92.0 ± 5.1% | 53 | 57 | 4 | 0 |
| 28 | X-ai/Grok-3-Mini |
92.0 ± 5.1% | 53 | 57 | 4 | 0 |
| 28 | X-ai/Grok-4-Fast |
92.0 ± 5.1% | 53 | 57 | 4 | 0 |
| 29 | Openai/Codex-Mini |
91.6 ± 5.4% | 50 | 54 | 3 | 1 |
| 29 | X-ai/Grok-3-Beta |
91.6 ± 5.4% | 50 | 54 | 4 | 0 |
| 30 | Anthropic/Claude-3.7-Sonnet |
91.4 ± 5.5% | 49 | 53 | 3 | 1 |
| 31 | Openai/Gpt-Oss-20b:free |
90.9 ± 5.5% | 56 | 61 | 4 | 1 |
| 32 | Deepseek/Deepseek-Prover-V2 |
90.4 ± 5.7% | 53 | 58 | 4 | 1 |
| 32 | Google/Gemini-2.0-Flash-001 |
90.4 ± 5.7% | 53 | 58 | 5 | 0 |
| 32 | Microsoft/Phi-4-Reasoning-Plus |
90.4 ± 5.7% | 53 | 58 | 4 | 1 |
| 33 | Qwen/Qwen3-Vl-8b-Thinking |
90.3 ± 6.2% | 43 | 47 | 1 | 3 |
| 34 | Deepseek/Deepseek-Chat-V3-0324 |
90.3 ± 5.8% | 52 | 57 | 3 | 2 |
| 34 | Google/Gemma-3-27b-It |
90.3 ± 5.8% | 52 | 57 | 5 | 0 |
| 35 | Openai/O1-Mini-2024-09-12 |
90.1 ± 5.9% | 51 | 56 | 4 | 1 |
| 36 | Qwen/Qwen3-4b:free |
89.1 ± 10.5% | 5 | 5 | 0 | 0 |
| 37 | Deepseek/Deepseek-V3.1-Terminus |
88.8 ± 6.3% | 52 | 58 | 5 | 1 |
| 37 | Mistralai/Mistral-Medium-3 |
88.8 ± 6.3% | 52 | 58 | 4 | 2 |
| 37 | Moonshotai/Kimi-K2 |
88.8 ± 6.3% | 52 | 58 | 5 | 1 |
| 38 | Alibaba/Tongyi-Deepresearch-30b-A3b |
88.6 ± 6.4% | 51 | 57 | 6 | 0 |
| 38 | Arcee-ai/Virtuoso-Large |
88.6 ± 6.4% | 51 | 57 | 6 | 0 |
| 38 | X-ai/Grok-3-Mini-Beta |
88.6 ± 6.4% | 51 | 57 | 6 | 0 |
| 39 | Qwen/Qwen3-Vl-235b-A22b-Thinking |
86.3 ± 8.2% | 35 | 40 | 2 | 3 |
| 40 | Anthropic/Claude-3.5-Sonnet-20240620 |
85.9 ± 7.5% | 46 | 53 | 7 | 0 |
| 40 | Nvidia/Llama-3.1-Nemotron-Ultra-253b-V1 |
85.9 ± 7.5% | 46 | 53 | 6 | 1 |
| 41 | Deepseek/Deepseek-Chat-V3.1 |
85.6 ± 7.6% | 45 | 52 | 6 | 1 |
| 42 | Qwen/Qwen3-Coder |
85.3 ± 7.8% | 44 | 51 | 4 | 3 |
| 42 | Z-ai/Glm-4.5-Air |
85.3 ± 7.8% | 44 | 51 | 4 | 3 |
| 43 | Mistralai/Mistral-Large |
85.1 ± 7.9% | 43 | 50 | 7 | 0 |
| 44 | Baidu/Ernie-4.5-21b-A3b |
84.8 ± 8.1% | 42 | 49 | 6 | 1 |
| 45 | Mistralai/Mistral-Large-2411 |
84.5 ± 8.2% | 41 | 48 | 4 | 3 |
| 45 | Perplexity/Sonar-Reasoning-Pro |
84.5 ± 8.2% | 41 | 48 | 3 | 4 |
| 46 | Meta-llama/Llama-3.3-70b-Instruct |
84.1 ± 8.4% | 40 | 47 | 5 | 2 |
| 47 | Perplexity/Sonar-Deep-Research |
84.0 ± 9.5% | 29 | 34 | 2 | 3 |
| 48 | Moonshotai/Kimi-Dev-72b |
82.6 ± 9.7% | 31 | 37 | 3 | 3 |
| 48 | Z-ai/Glm-4.5v |
82.6 ± 9.7% | 31 | 37 | 6 | 0 |
| 49 | Deepseek/Deepseek-V3.2-Exp |
81.4 ± 9.7% | 33 | 40 | 6 | 1 |
| 50 | Qwen/Qwen2.5-Vl-32b-Instruct |
81.1 ± 10.4% | 28 | 34 | 4 | 2 |
| 51 | Stepfun-ai/Step3 |
80.8 ± 12.1% | 19 | 23 | 2 | 2 |
| 52 | Deepseek/Deepseek-R1-Distill-Qwen-32b |
80.7 ± 11.3% | 23 | 28 | 4 | 1 |
| 53 | Cohere/Command-A |
80.5 ± 10.2% | 31 | 38 | 4 | 3 |
| 53 | Deepcogito/Cogito-V2-Preview-Deepseek-671b |
80.5 ± 10.2% | 31 | 38 | 5 | 2 |
| 53 | Mistralai/Devstral-Medium |
80.5 ± 10.2% | 31 | 38 | 4 | 3 |
| 53 | Openai/Gpt-4o-Mini-2024-07-18 |
80.5 ± 10.2% | 31 | 38 | 5 | 2 |
| 53 | Perplexity/Sonar-Pro |
80.5 ± 10.2% | 31 | 38 | 3 | 4 |
| 53 | X-ai/Grok-Code-Fast-1 |
80.5 ± 10.2% | 31 | 38 | 5 | 2 |
| 54 | Baidu/Ernie-4.5-21b-A3b-Thinking |
78.9 ± 11.0% | 28 | 35 | 7 | 0 |
| 54 | Google/Gemini-2.5-Flash-Lite |
78.9 ± 11.0% | 28 | 35 | 7 | 0 |
| 54 | Meta-llama/Llama-3.3-70b-Instruct:free |
78.9 ± 11.0% | 28 | 35 | 3 | 4 |
| 54 | Qwen/Qwen-Vl-Plus |
78.9 ± 11.0% | 28 | 35 | 6 | 1 |
| 54 | Qwen/Qwen2.5-Vl-72b-Instruct |
78.9 ± 11.0% | 28 | 35 | 6 | 1 |
| 54 | Qwen/Qwen3-30b-A3b |
78.9 ± 11.0% | 28 | 35 | 6 | 1 |
| 55 | Tencent/Hunyuan-A13b-Instruct |
78.5 ± 12.6% | 20 | 25 | 5 | 0 |
| 56 | Z-ai/Glm-4-32b |
78.3 ± 11.3% | 27 | 34 | 7 | 0 |
| 57 | Qwen/Qwen3-235b-A22b |
77.3 ± 12.4% | 22 | 28 | 2 | 4 |
| 58 | Minimax/Minimax-M1 |
76.7 ± 13.5% | 18 | 23 | 2 | 3 |
| 59 | Qwen/Qwen-2.5-72b-Instruct |
76.3 ± 12.3% | 24 | 31 | 6 | 1 |
| 60 | Deepseek/Deepseek-R1-0528-Qwen3-8b |
75.6 ± 13.3% | 20 | 26 | 4 | 2 |
| 61 | Baidu/Ernie-4.5-Vl-424b-A47b |
74.7 ± 13.0% | 22 | 29 | 5 | 2 |
| 61 | Meta-llama/Llama-3.2-90b-Vision-Instruct |
74.7 ± 13.0% | 22 | 29 | 4 | 3 |
| 61 | Mistralai/Mistral-Small-3.2-24b-Instruct |
74.7 ± 13.0% | 22 | 29 | 6 | 1 |
| 62 | Mistralai/Magistral-Medium-2506:thinking |
74.7 ± 13.8% | 19 | 25 | 2 | 4 |
| 62 | Qwen/Qwen3-32b |
74.7 ± 13.8% | 19 | 25 | 6 | 0 |
| 63 | Microsoft/Phi-4-Multimodal-Instruct |
73.9 ± 13.4% | 21 | 28 | 7 | 0 |
| 64 | Mistralai/Mistral-Saba |
72.9 ± 13.8% | 20 | 27 | 7 | 0 |
| 65 | Aion-labs/Aion-1.0-Mini |
71.9 ± 14.3% | 19 | 26 | 6 | 1 |
| 65 | Arcee-ai/Coder-Large |
71.9 ± 14.3% | 19 | 26 | 6 | 1 |
| 65 | Deepcogito/Cogito-V2-Preview-Llama-70b |
71.9 ± 14.3% | 19 | 26 | 7 | 0 |
| 65 | Deepseek/Deepseek-R1-Distill-Qwen-14b |
71.9 ± 14.3% | 19 | 26 | 5 | 2 |
| 65 | Moonshotai/Kimi-K2-0905 |
71.9 ± 14.3% | 19 | 26 | 5 | 2 |
| 65 | Openai/Gpt-4.1-Nano |
71.9 ± 14.3% | 19 | 26 | 4 | 3 |
| 65 | Openai/Gpt-4o |
71.9 ± 14.3% | 19 | 26 | 3 | 4 |
| 65 | Openai/Gpt-4o-2024-11-20 |
71.9 ± 14.3% | 19 | 26 | 4 | 3 |
| 65 | Qwen/Qwen3-Coder-Flash |
71.9 ± 14.3% | 19 | 26 | 6 | 1 |
| 66 | Google/Gemma-3-12b-It |
70.9 ± 14.8% | 18 | 25 | 7 | 0 |
| 66 | Meta-llama/Llama-4-Scout |
70.9 ± 14.8% | 18 | 25 | 1 | 6 |
| 66 | Mistralai/Mistral-Large-2407 |
70.9 ± 14.8% | 18 | 25 | 6 | 1 |
| 66 | Nousresearch/Hermes-3-Llama-3.1-405b |
70.9 ± 14.8% | 18 | 25 | 6 | 1 |
| 67 | Aion-labs/Aion-1.0 |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Anthropic/Claude-3-Opus |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Anthropic/Claude-Opus-4 |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Anthropic/Claude-Opus-4.1 |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Deepcogito/Cogito-V2-Preview-Llama-405b |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Openai/Chatgpt-4o-Latest |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Openai/Gpt-4 |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Openai/Gpt-4-Turbo |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Openai/Gpt-4o-2024-05-13 |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Openai/Gpt-5-Image |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Openai/Gpt-5-Pro |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Openai/O1 |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Openai/O1-Pro |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 67 | Openai/O3-Deep-Research |
70.7 ± 28.0% | 1 | 1 | 0 | 0 |
| 68 | Baidu/Ernie-4.5-Vl-28b-A3b |
70.1 ± 16.0% | 15 | 21 | 4 | 2 |
| 69 | Amazon/Nova-Pro-V1 |
69.7 ± 15.3% | 17 | 24 | 7 | 0 |
| 69 | Google/Gemini-2.0-Flash-Lite-001 |
69.7 ± 15.3% | 17 | 24 | 7 | 0 |
| 69 | Google/Gemma-3-4b-It |
69.7 ± 15.3% | 17 | 24 | 5 | 2 |
| 69 | Mistralai/Devstral-Small-2505 |
69.7 ± 15.3% | 17 | 24 | 4 | 3 |
| 69 | Nousresearch/Hermes-4-405b |
69.7 ± 15.3% | 17 | 24 | 7 | 0 |
| 69 | Openai/Gpt-4o-Search-Preview |
69.7 ± 15.3% | 17 | 24 | 6 | 1 |
| 69 | Qwen/Qwen-2.5-7b-Instruct |
69.7 ± 15.3% | 17 | 24 | 5 | 2 |
| 69 | Qwen/Qwen-Turbo |
69.7 ± 15.3% | 17 | 24 | 5 | 2 |
| 69 | Qwen/Qwen3-Coder-30b-A3b-Instruct |
69.7 ± 15.3% | 17 | 24 | 6 | 1 |
| 70 | Anthropic/Claude-3.5-Haiku |
68.5 ± 15.9% | 16 | 23 | 7 | 0 |
| 70 | Deepcogito/Cogito-V2-Preview-Llama-109b-Moe |
68.5 ± 15.9% | 16 | 23 | 7 | 0 |
| 70 | Mistralai/Devstral-Small |
68.5 ± 15.9% | 16 | 23 | 6 | 1 |
| 70 | Mistralai/Pixtral-Large-2411 |
68.5 ± 15.9% | 16 | 23 | 7 | 0 |
| 70 | Perplexity/Sonar |
68.5 ± 15.9% | 16 | 23 | 1 | 6 |
| 70 | Thedrummer/Cydonia-24b-V4.1 |
68.5 ± 15.9% | 16 | 23 | 6 | 1 |
| 71 | Inception/Mercury-Coder |
65.7 ± 17.1% | 14 | 21 | 5 | 2 |
| 71 | Thedrummer/Anubis-70b-V1.1 |
65.7 ± 17.1% | 14 | 21 | 5 | 2 |
| 72 | Microsoft/Phi-4 |
65.5 ± 18.2% | 12 | 18 | 5 | 1 |
| 72 | Mistralai/Codestral-2508 |
65.5 ± 18.2% | 12 | 18 | 6 | 0 |
| 72 | Nousresearch/Hermes-3-Llama-3.1-405b:free |
65.5 ± 18.2% | 12 | 18 | 6 | 0 |
| 72 | Nvidia/Llama-3.1-Nemotron-70b-Instruct |
65.5 ± 18.2% | 12 | 18 | 2 | 4 |
| 73 | Google/Gemma-2-9b-It |
64.1 ± 17.8% | 13 | 20 | 7 | 0 |
| 73 | Openai/Gpt-4o-Mini |
64.1 ± 17.8% | 13 | 20 | 6 | 1 |
| 74 | Amazon/Nova-Lite-V1 |
63.6 ± 19.1% | 11 | 17 | 4 | 2 |
| 74 | Cognitivecomputations/Dolphin-Mistral-24b-Venice-Edition:free |
63.6 ± 19.1% | 11 | 17 | 5 | 1 |
| 74 | Meta-llama/Llama-3-70b-Instruct |
63.6 ± 19.1% | 11 | 17 | 5 | 1 |
| 74 | Nousresearch/Hermes-3-Llama-3.1-70b |
63.6 ± 19.1% | 11 | 17 | 5 | 1 |
| 75 | Mistralai/Mistral-Small-3.1-24b-Instruct:free |
61.5 ± 20.0% | 10 | 16 | 6 | 0 |
| 76 | Google/Gemma-3n-E4b-It |
60.3 ± 19.4% | 11 | 18 | 6 | 1 |
| 76 | Inception/Mercury |
60.3 ± 19.4% | 11 | 18 | 7 | 0 |
| 76 | Meta-llama/Llama-4-Scout:free |
60.3 ± 19.4% | 11 | 18 | 3 | 4 |
| 76 | Mistralai/Codestral-2501 |
60.3 ± 19.4% | 11 | 18 | 6 | 1 |
| 76 | Mistralai/Mistral-7b-Instruct-V0.3 |
60.3 ± 19.4% | 11 | 18 | 7 | 0 |
| 76 | Mistralai/Mistral-Small-3.1-24b-Instruct |
60.3 ± 19.4% | 11 | 18 | 4 | 3 |
| 76 | Openai/Gpt-4o-2024-08-06 |
60.3 ± 19.4% | 11 | 18 | 6 | 1 |
| 77 | Amazon/Nova-Micro-V1 |
59.2 ± 21.1% | 9 | 15 | 3 | 3 |
| 77 | Anthropic/Claude-Haiku-4.5 |
59.2 ± 21.1% | 9 | 15 | 1 | 5 |
| 77 | Mistralai/Magistral-Small-2506 |
59.2 ± 21.1% | 9 | 15 | 0 | 6 |
| 77 | Sao10k/L3.3-Euryale-70b |
59.2 ± 21.1% | 9 | 15 | 3 | 3 |
| 78 | Mistralai/Mistral-7b-Instruct:free |
58.2 ± 20.3% | 10 | 17 | 7 | 0 |
| 79 | Ai21/Jamba-Large-1.7 |
56.5 ± 22.2% | 8 | 14 | 5 | 1 |
| 79 | Meta-llama/Llama-3.1-405b-Instruct |
56.5 ± 22.2% | 8 | 14 | 3 | 3 |
| 79 | Meta-llama/Llama-3.1-70b-Instruct |
56.5 ± 22.2% | 8 | 14 | 4 | 2 |
| 79 | Thedrummer/Skyfall-36b-V2 |
56.5 ± 22.2% | 8 | 14 | 5 | 1 |
| 80 | Anthropic/Claude-3-Haiku |
53.5 ± 23.5% | 7 | 13 | 6 | 0 |
| 80 | Google/Gemma-2-27b-It |
53.5 ± 23.5% | 7 | 13 | 5 | 1 |
| 80 | Mistralai/Magistral-Medium-2506 |
53.5 ± 23.5% | 7 | 13 | 3 | 3 |
| 80 | Mistralai/Mistral-Small |
53.5 ± 23.5% | 7 | 13 | 5 | 1 |
| 80 | Mistralai/Pixtral-12b |
53.5 ± 23.5% | 7 | 13 | 4 | 2 |
| 80 | Openai/Gpt-4o-Mini-Search-Preview |
53.5 ± 23.5% | 7 | 13 | 6 | 0 |
| 81 | Anthracite-org/Magnum-V4-72b |
50.0 ± 24.9% | 6 | 12 | 3 | 3 |
| 81 | Ibm-granite/Granite-4.0-H-Micro |
50.0 ± 24.9% | 6 | 12 | 5 | 1 |
| 81 | Mistralai/Mistral-7b-Instruct |
50.0 ± 24.9% | 6 | 12 | 6 | 0 |
| 81 | Sao10k/L3.1-70b-Hanami-X1 |
50.0 ± 24.9% | 6 | 12 | 5 | 1 |
| 81 | Sao10k/L3.1-Euryale-70b |
50.0 ± 24.9% | 6 | 12 | 6 | 0 |
| 82 | Microsoft/Wizardlm-2-8x22b |
46.0 ± 26.4% | 5 | 11 | 2 | 4 |
| 82 | Mistralai/Ministral-8b |
46.0 ± 26.4% | 5 | 11 | 3 | 3 |
| 82 | Openai/Gpt-3.5-Turbo |
46.0 ± 26.4% | 5 | 11 | 6 | 0 |
| 82 | Sao10k/L3-Euryale-70b |
46.0 ± 26.4% | 5 | 11 | 5 | 1 |
| 83 | Meta-llama/Llama-3.2-3b-Instruct |
41.2 ± 28.0% | 4 | 10 | 5 | 1 |
| 84 | Meta-llama/Llama-3.3-8b-Instruct:free |
39.3 ± 30.8% | 3 | 8 | 1 | 4 |
| 84 | Mistralai/Ministral-3b |
39.3 ± 30.8% | 3 | 8 | 4 | 1 |
| 84 | Mistralai/Mistral-Small-24b-Instruct-2501 |
39.3 ± 30.8% | 3 | 8 | 3 | 2 |
| 85 | Cohere/Command-R-08-2024 |
36.4 ± 34.5% | 2 | 6 | 3 | 1 |
| 86 | Cohere/Command-R-Plus-08-2024 |
35.5 ± 29.7% | 3 | 9 | 5 | 1 |
| 86 | Mistralai/Mistral-Tiny |
35.5 ± 29.7% | 3 | 9 | 3 | 3 |
| 87 | Inflection/Inflection-3-Productivity |
32.1 ± 33.0% | 2 | 7 | 5 | 0 |
| 87 | Meta-llama/Llama-3-8b-Instruct |
32.1 ± 33.0% | 2 | 7 | 4 | 1 |
| 87 | Mistralai/Mistral-Nemo |
32.1 ± 33.0% | 2 | 7 | 3 | 2 |
| 87 | Mistralai/Mixtral-8x22b-Instruct |
32.1 ± 33.0% | 2 | 7 | 2 | 3 |
| 87 | Nousresearch/Hermes-4-70b |
32.1 ± 33.0% | 2 | 7 | 4 | 1 |
| 87 | Openai/Gpt-3.5-Turbo-16k |
32.1 ± 33.0% | 2 | 7 | 5 | 0 |
| 87 | Qwen/Qwen-2.5-Vl-7b-Instruct |
32.1 ± 33.0% | 2 | 7 | 3 | 2 |
| 88 | Meta-llama/Llama-3.1-405b |
29.3 ± 54.9% | 0 | 1 | 1 | 0 |
| 88 | Openai/Gpt-4-1106-Preview |
29.3 ± 54.9% | 0 | 1 | 1 | 0 |
| 88 | Openai/Gpt-4-Turbo-Preview |
29.3 ± 54.9% | 0 | 1 | 1 | 0 |
| 88 | Openai/Gpt-4o:extended |
29.3 ± 54.9% | 0 | 1 | 1 | 0 |
| 89 | Arcee-ai/Afm-4.5b |
26.4 ± 37.7% | 1 | 5 | 2 | 2 |
| 89 | Meta-llama/Llama-3.1-8b-Instruct |
26.4 ± 37.7% | 1 | 5 | 3 | 1 |
| 90 | Ai21/Jamba-Mini-1.7 |
22.8 ± 35.0% | 1 | 6 | 3 | 2 |
| 90 | Aion-labs/Aion-Rp-Llama-3.1-8b |
22.8 ± 35.0% | 1 | 6 | 3 | 2 |
| 90 | Bytedance/Ui-Tars-1.5-7b |
22.8 ± 35.0% | 1 | 6 | 4 | 1 |
| 90 | Inflection/Inflection-3-Pi |
22.8 ± 35.0% | 1 | 6 | 3 | 2 |
| 90 | Microsoft/Phi-3-Medium-128k-Instruct |
22.8 ± 35.0% | 1 | 6 | 2 | 3 |
| 90 | Neversleep/Noromaid-20b |
22.8 ± 35.0% | 1 | 6 | 4 | 1 |
| 90 | Openai/Gpt-3.5-Turbo-0613 |
22.8 ± 35.0% | 1 | 6 | 5 | 0 |
| 90 | Openai/Gpt-3.5-Turbo-Instruct |
22.8 ± 35.0% | 1 | 6 | 5 | 0 |
| 90 | Qwen/Qwen2.5-Coder-7b-Instruct |
22.8 ± 35.0% | 1 | 6 | 3 | 2 |
| 90 | Sao10k/L3-Lunaris-8b |
22.8 ± 35.0% | 1 | 6 | 5 | 0 |
| 90 | Thedrummer/Rocinante-12b |
22.8 ± 35.0% | 1 | 6 | 4 | 1 |
| 90 | Thedrummer/Unslopnemo-12b |
22.8 ± 35.0% | 1 | 6 | 4 | 1 |
| 91 | Cohere/Command-R7b-12-2024 |
12.9 ± 39.2% | 0 | 4 | 2 | 2 |
| 91 | Eleutherai/Llemma_7b |
12.9 ± 39.2% | 0 | 4 | 2 | 2 |
| 91 | Gryphe/Mythomax-L2-13b |
12.9 ± 39.2% | 0 | 4 | 3 | 1 |
| 91 | Mancer/Weaver |
12.9 ± 39.2% | 0 | 4 | 2 | 2 |
| 91 | Meta-llama/Llama-3.2-11b-Vision-Instruct |
12.9 ± 39.2% | 0 | 4 | 1 | 3 |
| 91 | Meta-llama/Llama-3.2-3b-Instruct:free |
12.9 ± 39.2% | 0 | 4 | 3 | 1 |
| 91 | Microsoft/Phi-3-Mini-128k-Instruct |
12.9 ± 39.2% | 0 | 4 | 2 | 2 |
| 91 | Microsoft/Phi-3.5-Mini-128k-Instruct |
12.9 ± 39.2% | 0 | 4 | 1 | 3 |
| 91 | Mistralai/Mistral-7b-Instruct-V0.1 |
12.9 ± 39.2% | 0 | 4 | 2 | 2 |
| 91 | Mistralai/Mixtral-8x7b-Instruct |
12.9 ± 39.2% | 0 | 4 | 3 | 1 |
| 91 | Neversleep/Llama-3.1-Lumimaid-8b |
12.9 ± 39.2% | 0 | 4 | 4 | 0 |
| 91 | Nousresearch/Hermes-2-Pro-Llama-3-8b |
12.9 ± 39.2% | 0 | 4 | 3 | 1 |
| 91 | Undi95/Remm-Slerp-L2-13b |
12.9 ± 39.2% | 0 | 4 | 3 | 1 |
We curated a set of 273 expressions. Approximately a third were selected from a popular calculus textbook because we assumed that they would have some pedagogic value, and the rest were randomly generated with varying degrees of complexity. We are assuming that the randomly generated set are new to the LLM, and never formed part of their training data.
Each model was asked to differentiate an expression with respect to some variable, and asked to return its reasoning and the answer.
We then parsed the answer from LaTeX to Python, and numerically evaluated the difference between the supplied answer and the actual derivative. If the two over ten samples had a difference of less than 1e-9, then the result was marked as correct.