Benchmarks

We validated Promptivo against two well-known academic datasets to verify that our scoring aligns with real-world prompt quality patterns.

IFEval Dataset 541 prompts

Google Research — Instruction Following Evaluation

IFEval contains carefully constructed prompts designed to test whether models can follow specific instructions. These are research-grade prompts written by domain experts.

3.38Average Score
28.7%Rated “Good”
0%Rated “Excellent”

Per-Dimension Scores

Precision
4.24
Clarity
4.23
Info Completeness
3.99
Chain-of-Thought
3.58
Structural Compliance
2.79
Info Integrity
2.52
Constraint Verifiability
2.35

Key finding: Even research-quality prompts have room for improvement. Zero prompts scored “Excellent” — the weakest areas were Constraint Verifiability and Informational Integrity, confirming that even expert-written prompts often rely on subjective criteria rather than mechanically verifiable requirements.

LLM-as-a-Judge Comparison Gemini 2.5 Flash · 5 runs · 539 prompts

Promptivo Deterministic vs. Gemini 2.5 Flash — IFEval Dataset, 5 independent runs

We ran the same 541 IFEval prompts through Google’s Gemini 2.5 Flash model five independent times, asking it to score each prompt on the same 7 dimensions using a 1–5 scale. Running multiple trials lets us measure both how Gemini compares to Promptivo and how consistent Gemini’s own judgments are across repeated evaluations.

DimensionPromptivoGemini (avg)DiffGemini σ
Overall Score3.393.45+0.060.007
Clarity4.233.85−0.370.004
Precision4.243.07−1.170.001
Chain-of-Thought3.582.68−0.910.009
Info Completeness3.993.10−0.890.006
Constraint Verifiability2.353.79+1.440.003
Structural Compliance2.793.38+0.590.007
Info Integrity2.523.38+0.870.004
39.1%Grade agreement rate
IndependentScoring approach — measures different signals than Gemini
88% stableGemini self-consistency — same score across all 5 runs

Result Discrepancy Analysis

The five Gemini runs were independent — no shared state, same temperature (0). Comparing their outputs reveals how stable LLM-as-a-judge scoring is in practice.

Stability bucketPromptsShareInterpretation
σ < 0.2 — very stable47588.1%Gemini agrees with itself across runs
σ 0.2–0.5 — minor jitter6411.9%Typically a 0.5-point swing in one run
σ ≥ 0.5 — high variance00%No prompts showed large score swings

Dimensional pattern: Constraint Verifiability shows the largest Promptivo–Gemini gap (+1.44), yet the lowest inter-run variance (σ = 0.003) — meaning Gemini is consistently more generous on this dimension, not just occasionally. Conversely, Precision shows the largest gap in the other direction (−1.17): Promptivo’s pattern engine reliably detects quantified language and specificity markers that Gemini’s holistic read may underweight. These are systematic, stable disagreements — not noise.

Divergent prompts: 21 of 539 prompts (3.9%) had |avg Gemini − Promptivo| > 1.5 points. Inspection shows a clear pattern: prompts Gemini scores high but Promptivo scores lower tend to be short, semantically clear requests (e.g. “Answer in lowercase only”) where the constraint is obvious to a reader but lacks explicit structural markers that Promptivo’s pattern engine scores. The reverse — Promptivo higher — occurs on structurally rich prompts that include many formatting cues and verifiable requirements, which Gemini may deflate if the underlying request is vague.

Key finding: Gemini and Promptivo reach near-identical overall averages (3.45 vs 3.39 across 539 prompts) but measure complementary signals. Gemini is remarkably self-consistent (88% of prompts stable across 5 runs, avg σ = 0.036), yet its dimensional profile diverges from Promptivo’s by up to 1.44 points. The near-zero Pearson correlation (r = 0.035) confirms these are independent perspectives — not redundant. Promptivo delivers deterministic, zero-cost results; Gemini applies semantic comprehension. Together they provide a richer picture than either approach alone.