Benchmark

STARE

SLAIF Technical Answer Reasoning and Evaluation

A benchmark for measuring how well AI can substitute a human grader — and reproduce grades in the lecturer’s own style.

STARE evaluates multimodal models on handwritten technical exam grading. The benchmark is built around scanned STEM exam responses with unconstrained handwriting, formulas, and hand-drawn diagrams, and it follows the grading setup of the reference workflow: lecturer-provided reference solution, short grading rules, structured grading prompts, and comparison to lecturer-assigned exam grades.

Core question. How close can an AI grader come to the human lecturer not only in score, but in grading behaviour: strictness, partial-credit judgement, and consistency on real handwritten answers?
Which models should do better? Multimodal models that can read fine human handwriting (text is handwritten and often tiny) Models that are multilingual (benchmark is in Slovenian language, spoken by 2 million people. Models with reasoning - STEM tasks require reasoning about sketches, graphs, diagrams.
Please cite

Perš, Janez; Muhovič, Jon; Košir, Andrej; Murovec, Boštjan.
Grading Handwritten Engineering Exams with Multimodal Large Language Models.
In: Proceedings of the 29th Computer Vision Winter Workshop (CVWW 2026).
Jindřichův Hradec, Czech Republic, February 9–12, 2026.

This page uses the three exam-level measures emphasized by the source paper: Mean Absolute Difference, standard deviation of absolute differences, and grading bias. The closed-source reference rows below reproduce the paper’s backend screening results; GPT-5.2 and Gemini are reported numerically in the paper, while the remaining rows are visually read from Figure 3.

What STARE Evaluates

STARE uses the three main exam-level measures from the paper’s quantitative evaluation: Mean Absolute Difference, standard deviation of absolute differences, and grading bias. Together they measure how faithfully a model reproduces the lecturer’s grading outcomes.

Dimension 1

Mean Absolute Difference (MAD)

The average absolute gap between the model’s exam grade and the lecturer’s exam grade. Lower values mean the model is closer to the human grader overall.

Dimension 2

Error Spread

The standard deviation of absolute grading differences across students. Lower values indicate more stable grading quality from one exam script to the next.

Dimension 3

Grading Bias

The signed tendency to over-grade or under-grade relative to the lecturer. Values close to zero mean the model is not systematically lenient or harsh.

Reference Results

The horizontal grouped bars below are consistent with paper’s Figure 3 backend screening results. Each row is one model; the three bars correspond to MAD, STD(|Δ|), and bias. Lower MAD and STD are better, while bias should stay as close to zero as possible. Results are normalized to worst performing model due to nature of the metrics used.

MAD STD(|Δ|) Bias

Models

Source-paper screening backends

gemini-3-pro-preview

Closed-source Google OpenRouter
MAD
Lower mean grade error is better
7.9
STD(|Δ|)
Lower variation across students is better
8.6
Bias
Closer to zero is better; sign shows grading direction
+0.3

gpt-4o

Closed-source OpenAI OpenAI
MAD
Lower mean grade error is better
18.5
STD(|Δ|)
Lower variation across students is better
15.4
Bias
Closer to zero is better; sign shows grading direction
+1.2

gpt-5

Closed-source OpenAI OpenAI
MAD
Lower mean grade error is better
12.1
STD(|Δ|)
Lower variation across students is better
10.7
Bias
Closer to zero is better; sign shows grading direction
+1.4

gpt-5.2

Closed-source OpenAI OpenAI
MAD
Lower mean grade error is better
7.8
STD(|Δ|)
Lower variation across students is better
5.9
Bias
Closer to zero is better; sign shows grading direction
+0.2

gpt-5.2-pro

Single-run value
Closed-source OpenAI OpenAI
MAD
Lower mean grade error is better
6.9
STD(|Δ|)
Lower variation across students is better
6.0
Bias
Closer to zero is better; sign shows grading direction
-0.2

mistral-large-2512

Open-weight Mistral AI OpenRouter
MAD
Lower mean grade error is better
24.9
STD(|Δ|)
Lower variation across students is better
16.9
Bias
Closer to zero is better; sign shows grading direction
+18.2
Lower bars indicate better lecturer alignment for MAD and STD(|Δ|). For bias, the green bar encodes absolute magnitude while the signed number shows whether the model tends to over-grade (+) or under-grade (−).
Model MAD ↓ STD(|Δ|) ↓ Bias → 0
gemini-3-pro-preview 7.9 8.6 +0.3
gpt-4o 18.5 15.4 +1.2
gpt-5 12.1 10.7 +1.4
gpt-5.2 7.8 5.9 +0.2
gpt-5.2-pro 6.9 6.0 -0.2
mistral-large-2512 24.9 16.9 +18.2

Why This Benchmark Matters

Real handwritten answers

STARE is designed for exam scripts with unconstrained handwriting, formulas, and sketches rather than neat text-only responses.

Lecturer-style grading

The target is not an abstract universal score. The target is the lecturer’s judgement, including how that lecturer awards partial credit.

Operational usefulness

Beyond the three main measures, STARE also tracks manual review burden as an auxiliary indicator of how much human consolidation is still required.

Benchmark Contact

For benchmark details, submissions, or collaboration inquiries, contact the STARE benchmark team.