Mean Absolute Difference (MAD)
The average absolute gap between the model’s exam grade and the lecturer’s exam grade. Lower values mean the model is closer to the human grader overall.
SLAIF Technical Answer Reasoning and Evaluation
A benchmark for measuring how well AI can substitute a human grader — and reproduce grades in the lecturer’s own style.
STARE evaluates multimodal models on handwritten technical exam grading. The benchmark is built around scanned STEM exam responses with unconstrained handwriting, formulas, and hand-drawn diagrams, and it follows the grading setup of the reference workflow: lecturer-provided reference solution, short grading rules, structured grading prompts, and comparison to lecturer-assigned exam grades.
Perš, Janez; Muhovič, Jon; Košir, Andrej; Murovec, Boštjan.
Grading Handwritten Engineering Exams with Multimodal Large Language Models.
In: Proceedings of the 29th Computer Vision Winter Workshop (CVWW 2026).
Jindřichův Hradec, Czech Republic, February 9–12, 2026.
This page uses the three exam-level measures emphasized by the source paper: Mean Absolute Difference, standard deviation of absolute differences, and grading bias. The closed-source reference rows below reproduce the paper’s backend screening results; GPT-5.2 and Gemini are reported numerically in the paper, while the remaining rows are visually read from Figure 3.
STARE uses the three main exam-level measures from the paper’s quantitative evaluation: Mean Absolute Difference, standard deviation of absolute differences, and grading bias. Together they measure how faithfully a model reproduces the lecturer’s grading outcomes.
The average absolute gap between the model’s exam grade and the lecturer’s exam grade. Lower values mean the model is closer to the human grader overall.
The standard deviation of absolute grading differences across students. Lower values indicate more stable grading quality from one exam script to the next.
The signed tendency to over-grade or under-grade relative to the lecturer. Values close to zero mean the model is not systematically lenient or harsh.
The horizontal grouped bars below are consistent with paper’s Figure 3 backend screening results. Each row is one model; the three bars correspond to MAD, STD(|Δ|), and bias. Lower MAD and STD are better, while bias should stay as close to zero as possible. Results are normalized to worst performing model due to nature of the metrics used.
Source-paper screening backends
| Model | MAD ↓ | STD(|Δ|) ↓ | Bias → 0 |
|---|---|---|---|
| gemini-3-pro-preview | 7.9 | 8.6 | +0.3 |
| gpt-4o | 18.5 | 15.4 | +1.2 |
| gpt-5 | 12.1 | 10.7 | +1.4 |
| gpt-5.2 | 7.8 | 5.9 | +0.2 |
| gpt-5.2-pro | 6.9 | 6.0 | -0.2 |
| mistral-large-2512 | 24.9 | 16.9 | +18.2 |
STARE is designed for exam scripts with unconstrained handwriting, formulas, and sketches rather than neat text-only responses.
The target is not an abstract universal score. The target is the lecturer’s judgement, including how that lecturer awards partial credit.
Beyond the three main measures, STARE also tracks manual review burden as an auxiliary indicator of how much human consolidation is still required.
For benchmark details, submissions, or collaboration inquiries, contact the STARE benchmark team.