Benchmark

STARE

SLAIF Technical Answer Reasoning and Evaluation

A benchmark for measuring how well AI can substitute a human grader — and reproduce grades in the lecturer’s own style.

STARE evaluates multimodal models on handwritten technical exam grading. The benchmark is built around scanned STEM exam responses with unconstrained handwriting, formulas, and hand-drawn diagrams, and it follows the grading setup of the reference workflow: lecturer-provided reference solution, short grading rules, structured grading prompts, and comparison to lecturer-assigned exam grades.

Core question. How close can an AI grader come to the human lecturer not only in score, but in grading behaviour: strictness, partial-credit judgement, and consistency on real handwritten answers?

Which models should do better? Multimodal models that can read fine human handwriting (text is handwritten and often tiny) Models that are multilingual (benchmark is in Slovenian language, spoken by 2 million people. Models with reasoning - STEM tasks require reasoning about sketches, graphs, diagrams.

Please cite

Perš, Janez; Muhovič, Jon; Košir, Andrej; Murovec, Boštjan.
Grading Handwritten Engineering Exams with Multimodal Large Language Models.
In: Proceedings of the 29th Computer Vision Winter Workshop (CVWW 2026).
Jindřichův Hradec, Czech Republic, February 9–12, 2026.

Read paper on ArXiv Cite final CVWW2026 paper

This page uses the three exam-level measures emphasized by the source paper: Mean Absolute Difference, standard deviation of absolute differences, and grading bias. The closed-source reference rows below reproduce the paper’s backend screening results; GPT-5.2 and Gemini are reported numerically in the paper, while the remaining rows are visually read from Figure 3.

What STARE Evaluates

STARE uses the three main exam-level measures from the paper’s quantitative evaluation: Mean Absolute Difference, standard deviation of absolute differences, and grading bias. Together they measure how faithfully a model reproduces the lecturer’s grading outcomes.

Dimension 1

Mean Absolute Difference (MAD)

The average absolute gap between the model’s exam grade and the lecturer’s exam grade. Lower values mean the model is closer to the human grader overall.

Dimension 2

Error Spread

The standard deviation of absolute grading differences across students. Lower values indicate more stable grading quality from one exam script to the next.

Dimension 3

Grading Bias

The signed tendency to over-grade or under-grade relative to the lecturer. Values close to zero mean the model is not systematically lenient or harsh.

Reference Results

Lower variation across students is better

6.0

Bias

Closer to zero is better; sign shows grading direction

-0.2

mistral-large-2512

Open-weight Mistral AI OpenRouter

MAD

Lower mean grade error is better

24.9

STD(|Δ|)

Lower variation across students is better

16.9

Bias

Closer to zero is better; sign shows grading direction

+18.2

Lower bars indicate better lecturer alignment for MAD and STD(|Δ|). For bias, the green bar encodes absolute magnitude while the signed number shows whether the model tends to over-grade (+) or under-grade (−).

Model	MAD ↓	STD(\|Δ\|) ↓	Bias → 0
gemini-3-pro-preview	7.9	8.6	+0.3
gpt-4o	18.5	15.4	+1.2
gpt-5	12.1	10.7	+1.4
gpt-5.2	7.8	5.9	+0.2
gpt-5.2-pro	6.9	6.0	-0.2
mistral-large-2512	24.9	16.9	+18.2

Why This Benchmark Matters

Real handwritten answers

STARE is designed for exam scripts with unconstrained handwriting, formulas, and sketches rather than neat text-only responses.

Lecturer-style grading

The target is not an abstract universal score. The target is the lecturer’s judgement, including how that lecturer awards partial credit.

Operational usefulness

Beyond the three main measures, STARE also tracks manual review burden as an auxiliary indicator of how much human consolidation is still required.

Benchmark Contact

For benchmark details, submissions, or collaboration inquiries, contact the STARE benchmark team.

Send email