Methodology

What STARE Measures

A benchmark methodology for lecturer-aligned grading of handwritten technical exams.

STARE measures whether a multimodal AI system can replace a human grader on real exam scripts and assign grades in that grader’s style.

The underlying task is not generic question answering. The model must inspect scanned handwritten student work, parse text, formulas, symbolic notation, and diagrams, compare the answer to a lecturer-provided reference and rules, and return a grade that agrees with the lecturer’s own grading practice.

The Three Core Measures

The paper’s main quantitative evaluation reports exam-level agreement with the lecturer using three principal measures. STARE adopts the same three as its headline benchmark dimensions.

1

Mean Absolute Difference

Headline grading accuracy. For each student, take the absolute difference between the AI-assigned exam grade and the lecturer’s grade, then average across students.

  • Interpretation: how close the model gets to the lecturer overall.
  • Lower is better: zero means perfect exam-level agreement.
2

Standard Deviation of Absolute Differences

Error consistency. This measures whether grading errors are evenly controlled or whether some students are graded much worse than others.

  • Interpretation: stability of grading quality across the full cohort.
  • Lower is better: small spread means the model behaves more consistently.
3

Grading Bias

Signed grading tendency. This captures whether a model systematically assigns grades above or below the lecturer.

  • Interpretation: whether the model is too lenient or too strict.
  • Best value: as close to zero as possible.

Together these three measures answer the benchmark’s central question: can the model substitute the lecturer as grader, not only by being accurate on average, but by remaining stable across students and by avoiding systematic drift toward leniency or harshness?

Evaluation Protocol

STARE follows the same end-to-end grading setting as the reference paper, with real handwritten engineering quizzes and lecturer-assigned exam grades as the only human ground truth.

ComponentHow STARE treats it
Student input Scanned handwritten exam scripts containing text, formulas, and hand-drawn diagrams or schematics.
Reference material Lecturer-provided handwritten reference solution representing a full-score answer, plus short grading rules.
Language setting Prompts, rules, templates, quizzes, and student answers may be non-English; the reference paper evaluates the workflow in Slovenian.
Pipeline style Reference conditioning, answer-presence checks, structured grading, supervisor aggregation, and deterministic post-processing are part of the overall grading workflow.
Ground truth The lecturer’s exam grades define the target grading behaviour.

Additional Operational Metric

Manual Review Trigger Rate

In addition to the three headline benchmark measures, the paper reports manual review trigger rate. STARE keeps it as an auxiliary operational indicator rather than one of the three primary dimension bars.

This estimates how often human consolidation would still be required because different automated graders inside the ensemble disagree too much. It matters operationally: a model may look good on average while still creating too many cases that need a human to step in.

STARE therefore distinguishes between grading quality and workflow burden. The three main bars measure lecturer alignment; manual review burden measures how deployable the system is in practice.

Model Grouping

Closed-source track

API or proprietary models evaluated under the same benchmark protocol.

Open-source track

Public models, weights, or systems that can be rerun and audited.

Benchmark Philosophy

Not a problem-solving benchmark

The model is judged on how well it evaluates a student’s submitted solution and how closely it reproduces the lecturer’s own grading style, especially on partial-credit decisions in handwritten technical work.

Reference Results Used on the Overview Page

The overview page’s grouped horizontal bars are built from the paper’s backend screening figure. GPT-5.2 and Gemini have numeric full-pipeline means reported in the paper. The remaining rows are visually read from the paper figure and marked accordingly.

Model MAD ↓ STD(|Δ|) ↓ Bias → 0 How this row was obtained
gemini-3-pro-preview 7.9 8.6 +0.3 exact — Full-pipeline mean reported in Table 2 / shown in Figure 3
gpt-4o 18.5 15.4 +1.2 approx. — Visually read from Figure 3
gpt-5 12.1 10.7 +1.4 approx. — Visually read from Figure 3
gpt-5.2 7.8 5.9 +0.2 exact — Full-pipeline mean reported in Table 2 / shown in Figure 3
gpt-5.2-pro 6.9 6.0 -0.2 approx. — Single-run value visually read from Figure 3
mistral-large-2512 24.9 16.9 +18.2 approx. — Visually read from Figure 3

The exact paper means used here are GPT-5.2: MAD 7.8, STD(|Δ|) 5.9, Bias +0.2, and Gemini-3 Pro: MAD 7.9, STD(|Δ|) 8.6, Bias +0.3. Other values are read from Figure 3 and are suitable for the benchmark website’s reference display, but should still be replaced with an official numeric export if the authors release one later.

Back to the Benchmark

Return to the main STARE benchmark page for the short description, grouped result bars, and benchmark framing.