Benchmarks
We validated Promptivo against two well-known academic datasets to verify that our scoring aligns with real-world prompt quality patterns.
IFEval Dataset 541 prompts
Google Research — Instruction Following Evaluation
IFEval contains carefully constructed prompts designed to test whether models can follow specific instructions. These are research-grade prompts written by domain experts.
Per-Dimension Scores
Key finding: Even research-quality prompts have room for improvement. Zero prompts scored “Excellent” — the weakest areas were Constraint Verifiability and Informational Integrity, confirming that even expert-written prompts often rely on subjective criteria rather than mechanically verifiable requirements.
LLM-as-a-Judge Comparison Gemini 2.5 Flash · 5 runs · 539 prompts
Promptivo Deterministic vs. Gemini 2.5 Flash — IFEval Dataset, 5 independent runs
We ran the same 541 IFEval prompts through Google’s Gemini 2.5 Flash model five independent times, asking it to score each prompt on the same 7 dimensions using a 1–5 scale. Running multiple trials lets us measure both how Gemini compares to Promptivo and how consistent Gemini’s own judgments are across repeated evaluations.
| Dimension | Promptivo | Gemini (avg) | Diff | Gemini σ |
|---|---|---|---|---|
| Overall Score | 3.39 | 3.45 | +0.06 | 0.007 |
| Clarity | 4.23 | 3.85 | −0.37 | 0.004 |
| Precision | 4.24 | 3.07 | −1.17 | 0.001 |
| Chain-of-Thought | 3.58 | 2.68 | −0.91 | 0.009 |
| Info Completeness | 3.99 | 3.10 | −0.89 | 0.006 |
| Constraint Verifiability | 2.35 | 3.79 | +1.44 | 0.003 |
| Structural Compliance | 2.79 | 3.38 | +0.59 | 0.007 |
| Info Integrity | 2.52 | 3.38 | +0.87 | 0.004 |
Result Discrepancy Analysis
The five Gemini runs were independent — no shared state, same temperature (0). Comparing their outputs reveals how stable LLM-as-a-judge scoring is in practice.
| Stability bucket | Prompts | Share | Interpretation |
|---|---|---|---|
| σ < 0.2 — very stable | 475 | 88.1% | Gemini agrees with itself across runs |
| σ 0.2–0.5 — minor jitter | 64 | 11.9% | Typically a 0.5-point swing in one run |
| σ ≥ 0.5 — high variance | 0 | 0% | No prompts showed large score swings |
Dimensional pattern: Constraint Verifiability shows the largest Promptivo–Gemini gap (+1.44), yet the lowest inter-run variance (σ = 0.003) — meaning Gemini is consistently more generous on this dimension, not just occasionally. Conversely, Precision shows the largest gap in the other direction (−1.17): Promptivo’s pattern engine reliably detects quantified language and specificity markers that Gemini’s holistic read may underweight. These are systematic, stable disagreements — not noise.
Divergent prompts: 21 of 539 prompts (3.9%) had |avg Gemini − Promptivo| > 1.5 points. Inspection shows a clear pattern: prompts Gemini scores high but Promptivo scores lower tend to be short, semantically clear requests (e.g. “Answer in lowercase only”) where the constraint is obvious to a reader but lacks explicit structural markers that Promptivo’s pattern engine scores. The reverse — Promptivo higher — occurs on structurally rich prompts that include many formatting cues and verifiable requirements, which Gemini may deflate if the underlying request is vague.
Key finding: Gemini and Promptivo reach near-identical overall averages (3.45 vs 3.39 across 539 prompts) but measure complementary signals. Gemini is remarkably self-consistent (88% of prompts stable across 5 runs, avg σ = 0.036), yet its dimensional profile diverges from Promptivo’s by up to 1.44 points. The near-zero Pearson correlation (r = 0.035) confirms these are independent perspectives — not redundant. Promptivo delivers deterministic, zero-cost results; Gemini applies semantic comprehension. Together they provide a richer picture than either approach alone.