Mean Absolute Difference (MAD)
The average absolute gap between the model’s exam grade and the lecturer’s exam grade. Lower values mean the model is closer to the human grader overall.
SLAIF Technical Answer Reasoning and Evaluation
A benchmark for measuring how well AI can substitute a human grader — and reproduce grades in the lecturer’s own style.
STARE evaluates multimodal models on handwritten technical exam grading. The benchmark is built around scanned STEM exam responses with unconstrained handwriting, formulas, and hand-drawn diagrams, and it follows the grading setup of the reference workflow: lecturer-provided reference solution, short grading rules, structured grading prompts, and comparison to lecturer-assigned exam grades.
This page uses the three exam-level measures emphasized by the source paper: Mean Absolute Difference, standard deviation of absolute differences, and grading bias. The closed-source reference rows below reproduce the paper’s backend screening results; GPT-5.2 and Gemini are reported numerically in the paper, while the remaining rows are visually read from Figure 3.
STARE uses the three main exam-level measures from the paper’s quantitative evaluation: Mean Absolute Difference, standard deviation of absolute differences, and grading bias. Together they measure how faithfully a model reproduces the lecturer’s grading outcomes.
The average absolute gap between the model’s exam grade and the lecturer’s exam grade. Lower values mean the model is closer to the human grader overall.
The standard deviation of absolute grading differences across students. Lower values indicate more stable grading quality from one exam script to the next.
The signed tendency to over-grade or under-grade relative to the lecturer. Values close to zero mean the model is not systematically lenient or harsh.
Results are split into two groups so the benchmark can compare the capability frontier with the reproducible frontier.
Commercial or API-based multimodal systems evaluated under the fixed STARE protocol. These models show what is currently possible at the top end of capability, even when the underlying systems are not fully inspectable or reproducible.
Publicly available models and checkpoints that can be rerun, audited, and improved by the community. This track highlights what can be reproduced on open infrastructure and where open models still lag or compete.
The horizontal grouped bars below use the paper’s Figure 3 backend screening results. Each row is one model; the three bars correspond to MAD, STD(|Δ|), and bias. Lower MAD and STD are better, while bias should stay as close to zero as possible.
Source-paper screening backends
| Model | MAD ↓ | STD(|Δ|) ↓ | Bias → 0 | How this row was obtained |
|---|---|---|---|---|
| gemini-3-pro-preview | 7.9 | 8.6 | +0.3 | exact — Full-pipeline mean reported in Table 2 / shown in Figure 3 |
| gpt-4o | 18.5 | 15.4 | +1.2 | approx. — Visually read from Figure 3 |
| gpt-5 | 12.1 | 10.7 | +1.4 | approx. — Visually read from Figure 3 |
| gpt-5.2 | 7.8 | 5.9 | +0.2 | exact — Full-pipeline mean reported in Table 2 / shown in Figure 3 |
| gpt-5.2-pro | 6.9 | 6.0 | -0.2 | approx. — Single-run value visually read from Figure 3 |
| mistral-large-2512 | 24.9 | 16.9 | +18.2 | approx. — Visually read from Figure 3 |
“Exact” rows correspond to full-pipeline means explicitly reported numerically in the paper. “Approx.” rows are visually read from Figure 3 because the paper does not publish a numeric backend table for every screened model.
Reserved for the public STARE benchmark track
STARE keeps the open-source track visible from the start, but the paper’s published screening results cover only closed-source backends. This section is ready to receive SLAIF benchmark evaluations as open multimodal grading systems are added.
STARE is designed for exam scripts with unconstrained handwriting, formulas, and sketches rather than neat text-only responses.
The target is not an abstract universal score. The target is the lecturer’s judgement, including how that lecturer awards partial credit.
Beyond the three main measures, STARE also tracks manual review burden as an auxiliary indicator of how much human consolidation is still required.
For benchmark details, submissions, or collaboration inquiries, contact the STARE benchmark team via SLAIF.