Methodology
What STARE Measures
A benchmark methodology for lecturer-aligned grading of handwritten technical exams.
STARE measures whether a multimodal AI system can replace a human grader on real exam scripts and assign grades in that grader’s style.
The underlying task is not generic question answering. The model must inspect scanned handwritten student work, parse text, formulas, symbolic notation, and diagrams, compare the answer to a lecturer-provided reference and rules, and return a grade that agrees with the lecturer’s own grading practice.
The Three Core Measures
The paper’s main quantitative evaluation reports exam-level agreement with the lecturer using three principal measures. STARE adopts the same three as its headline benchmark dimensions.
1
Mean Absolute Difference
Headline grading accuracy. For each student, take the absolute difference between the AI-assigned exam grade and the lecturer’s grade, then average across students.
- Interpretation: how close the model gets to the lecturer overall.
- Lower is better: zero means perfect exam-level agreement.
2
Standard Deviation of Absolute Differences
Error consistency. This measures whether grading errors are evenly controlled or whether some students are graded much worse than others.
- Interpretation: stability of grading quality across the full cohort.
- Lower is better: small spread means the model behaves more consistently.
3
Grading Bias
Signed grading tendency. This captures whether a model systematically assigns grades above or below the lecturer.
- Interpretation: whether the model is too lenient or too strict.
- Best value: as close to zero as possible.
Together these three measures answer the benchmark’s central question: can the model substitute the lecturer as grader, not only by being accurate on average, but by remaining stable across students and by avoiding systematic drift toward leniency or harshness?
Evaluation Protocol
STARE follows the same end-to-end grading setting as the reference paper, with real handwritten engineering quizzes and lecturer-assigned exam grades as the only human ground truth.
| Component | How STARE treats it |
| Student input |
Scanned handwritten exam scripts containing text, formulas, and hand-drawn diagrams or schematics. |
| Reference material |
Lecturer-provided handwritten reference solution representing a full-score answer, plus short grading rules. |
| Language setting |
Prompts, rules, templates, quizzes, and student answers may be non-English; the reference paper evaluates the workflow in Slovenian. |
| Pipeline style |
Reference conditioning, answer-presence checks, structured grading, supervisor aggregation, and deterministic post-processing are part of the overall grading workflow. |
| Ground truth |
The lecturer’s exam grades define the target grading behaviour. |
Additional Operational Metric
Manual Review Trigger Rate
In addition to the three headline benchmark measures, the paper reports manual review trigger rate. STARE keeps it as an auxiliary operational indicator rather than one of the three primary dimension bars.
This estimates how often human consolidation would still be required because different automated graders inside the ensemble disagree too much. It matters operationally: a model may look good on average while still creating too many cases that need a human to step in.
STARE therefore distinguishes between grading quality and workflow burden. The three main bars measure lecturer alignment; manual review burden measures how deployable the system is in practice.
Model Grouping
Closed-source track
API or proprietary models evaluated under the same benchmark protocol.
Open-source track
Public models, weights, or systems that can be rerun and audited.
Benchmark Philosophy
Not a problem-solving benchmark
The model is judged on how well it evaluates a student’s submitted solution and how closely it reproduces the lecturer’s own grading style, especially on partial-credit decisions in handwritten technical work.
Reference Results Used on the Overview Page
The overview page’s grouped horizontal bars are built from the paper’s backend screening figure. GPT-5.2 and Gemini have numeric full-pipeline means reported in the paper. The remaining rows are visually read from the paper figure and marked accordingly.
| Model |
MAD ↓ |
STD(|Δ|) ↓ |
Bias → 0 |
How this row was obtained |
| gemini-3-pro-preview |
7.9 |
8.6 |
+0.3 |
exact — Full-pipeline mean reported in Table 2 / shown in Figure 3 |
| gpt-4o |
18.5 |
15.4 |
+1.2 |
approx. — Visually read from Figure 3 |
| gpt-5 |
12.1 |
10.7 |
+1.4 |
approx. — Visually read from Figure 3 |
| gpt-5.2 |
7.8 |
5.9 |
+0.2 |
exact — Full-pipeline mean reported in Table 2 / shown in Figure 3 |
| gpt-5.2-pro |
6.9 |
6.0 |
-0.2 |
approx. — Single-run value visually read from Figure 3 |
| mistral-large-2512 |
24.9 |
16.9 |
+18.2 |
approx. — Visually read from Figure 3 |
The exact paper means used here are GPT-5.2: MAD 7.8, STD(|Δ|) 5.9, Bias +0.2, and Gemini-3 Pro: MAD 7.9, STD(|Δ|) 8.6, Bias +0.3. Other values are read from Figure 3 and are suitable for the benchmark website’s reference display, but should still be replaced with an official numeric export if the authors release one later.