7 Microsoft models evaluated
| Rank | Model | Accuracy | Correct | Total | Incorrect | Errors |
|---|---|---|---|---|---|---|
| 1 | Microsoft/Phi-4-Reasoning-Plus |
90.4 ± 5.7% | 53 | 58 | 4 | 1 |
| 2 | Microsoft/Phi-4-Multimodal-Instruct |
73.9 ± 13.4% | 21 | 28 | 7 | 0 |
| 3 | Microsoft/Phi-4 |
65.5 ± 18.2% | 12 | 18 | 5 | 1 |
| 4 | Microsoft/Wizardlm-2-8x22b |
46.0 ± 26.4% | 5 | 11 | 2 | 4 |
| 5 | Microsoft/Phi-3-Medium-128k-Instruct |
22.8 ± 35.0% | 1 | 6 | 2 | 3 |
| 6 | Microsoft/Phi-3-Mini-128k-Instruct |
12.9 ± 39.2% | 0 | 4 | 2 | 2 |
| 6 | Microsoft/Phi-3.5-Mini-128k-Instruct |
12.9 ± 39.2% | 0 | 4 | 1 | 3 |