Benchmark

STARE

SLAIF Technical Answer Reasoning and Evaluation

A benchmark for measuring how well AI can substitute a human grader — and reproduce grades in the lecturer’s own style.

STARE evaluates multimodal models on handwritten technical exam grading. The benchmark is built around scanned STEM exam responses with unconstrained handwriting, formulas, and hand-drawn diagrams, and it follows the grading setup of the reference workflow: lecturer-provided reference solution, short grading rules, structured grading prompts, and comparison to lecturer-assigned exam grades.

Core question. How close can an AI grader come to the human lecturer not only in score, but in grading behaviour: strictness, partial-credit judgement, and consistency on real handwritten answers?

This page uses the three exam-level measures emphasized by the source paper: Mean Absolute Difference, standard deviation of absolute differences, and grading bias. The closed-source reference rows below reproduce the paper’s backend screening results; GPT-5.2 and Gemini are reported numerically in the paper, while the remaining rows are visually read from Figure 3.

What STARE Evaluates

STARE uses the three main exam-level measures from the paper’s quantitative evaluation: Mean Absolute Difference, standard deviation of absolute differences, and grading bias. Together they measure how faithfully a model reproduces the lecturer’s grading outcomes.

Dimension 1

Mean Absolute Difference (MAD)

The average absolute gap between the model’s exam grade and the lecturer’s exam grade. Lower values mean the model is closer to the human grader overall.

Dimension 2

Error Spread

The standard deviation of absolute grading differences across students. Lower values indicate more stable grading quality from one exam script to the next.

Dimension 3

Grading Bias

The signed tendency to over-grade or under-grade relative to the lecturer. Values close to zero mean the model is not systematically lenient or harsh.

Evaluation Tracks

Results are split into two groups so the benchmark can compare the capability frontier with the reproducible frontier.

Track A

Closed-Source Models

Commercial or API-based multimodal systems evaluated under the fixed STARE protocol. These models show what is currently possible at the top end of capability, even when the underlying systems are not fully inspectable or reproducible.

Track B

Open-Source Models

Publicly available models and checkpoints that can be rerun, audited, and improved by the community. This track highlights what can be reproduced on open infrastructure and where open models still lag or compete.

Reference Results

The horizontal grouped bars below use the paper’s Figure 3 backend screening results. Each row is one model; the three bars correspond to MAD, STD(|Δ|), and bias. Lower MAD and STD are better, while bias should stay as close to zero as possible.

MAD STD(|Δ|) Bias

Closed-Source Models

Source-paper screening backends

gemini-3-pro-preview

Full-pipeline mean reported in Table 2 / shown in Figure 3
exact paper mean Closed-source
MAD
Lower mean grade error is better
7.9
STD(|Δ|)
Lower variation across students is better
8.6
Bias
Closer to zero is better; sign shows grading direction
+0.3

gpt-4o

Visually read from Figure 3
approx. from Figure 3 Closed-source
MAD
Lower mean grade error is better
18.5
STD(|Δ|)
Lower variation across students is better
15.4
Bias
Closer to zero is better; sign shows grading direction
+1.2

gpt-5

Visually read from Figure 3
approx. from Figure 3 Closed-source
MAD
Lower mean grade error is better
12.1
STD(|Δ|)
Lower variation across students is better
10.7
Bias
Closer to zero is better; sign shows grading direction
+1.4

gpt-5.2

Full-pipeline mean reported in Table 2 / shown in Figure 3
exact paper mean Closed-source
MAD
Lower mean grade error is better
7.8
STD(|Δ|)
Lower variation across students is better
5.9
Bias
Closer to zero is better; sign shows grading direction
+0.2

gpt-5.2-pro

Single-run value visually read from Figure 3
approx. from Figure 3 Closed-source
MAD
Lower mean grade error is better
6.9
STD(|Δ|)
Lower variation across students is better
6.0
Bias
Closer to zero is better; sign shows grading direction
-0.2

mistral-large-2512

Visually read from Figure 3
approx. from Figure 3 Closed-source
MAD
Lower mean grade error is better
24.9
STD(|Δ|)
Lower variation across students is better
16.9
Bias
Closer to zero is better; sign shows grading direction
+18.2
Lower bars indicate better lecturer alignment for MAD and STD(|Δ|). For bias, the green bar encodes absolute magnitude while the signed number shows whether the model tends to over-grade (+) or under-grade (−).
Model MAD ↓ STD(|Δ|) ↓ Bias → 0 How this row was obtained
gemini-3-pro-preview 7.9 8.6 +0.3 exact — Full-pipeline mean reported in Table 2 / shown in Figure 3
gpt-4o 18.5 15.4 +1.2 approx. — Visually read from Figure 3
gpt-5 12.1 10.7 +1.4 approx. — Visually read from Figure 3
gpt-5.2 7.8 5.9 +0.2 exact — Full-pipeline mean reported in Table 2 / shown in Figure 3
gpt-5.2-pro 6.9 6.0 -0.2 approx. — Single-run value visually read from Figure 3
mistral-large-2512 24.9 16.9 +18.2 approx. — Visually read from Figure 3

“Exact” rows correspond to full-pipeline means explicitly reported numerically in the paper. “Approx.” rows are visually read from Figure 3 because the paper does not publish a numeric backend table for every screened model.

Open-Source Models

Reserved for the public STARE benchmark track

No open-source baseline is reported in the source paper.

STARE keeps the open-source track visible from the start, but the paper’s published screening results cover only closed-source backends. This section is ready to receive SLAIF benchmark evaluations as open multimodal grading systems are added.

Why This Benchmark Matters

Real handwritten answers

STARE is designed for exam scripts with unconstrained handwriting, formulas, and sketches rather than neat text-only responses.

Lecturer-style grading

The target is not an abstract universal score. The target is the lecturer’s judgement, including how that lecturer awards partial credit.

Operational usefulness

Beyond the three main measures, STARE also tracks manual review burden as an auxiliary indicator of how much human consolidation is still required.

Benchmark Contact

For benchmark details, submissions, or collaboration inquiries, contact the STARE benchmark team via SLAIF.