Benchmarks

We validated Promptivo against two well-known academic datasets to verify that our scoring aligns with real-world prompt quality patterns.

IFEval Dataset 541 prompts

Google Research — Instruction Following Evaluation

IFEval contains carefully constructed prompts designed to test whether models can follow specific instructions. These are research-grade prompts written by domain experts.

3.38Average Score

28.7%Rated “Good”

0%Rated “Excellent”

Per-Dimension Scores

Precision

4.24

Clarity

4.23

Info Completeness

3.99

Chain-of-Thought

3.58

Structural Compliance

2.79

Info Integrity

2.52

Constraint Verifiability

2.35

Key finding: Even research-quality prompts have room for improvement. Zero prompts scored “Excellent” — the weakest areas were Constraint Verifiability and Informational Integrity, confirming that even expert-written prompts often rely on subjective criteria rather than mechanically verifiable requirements.

LLM-as-a-Judge Comparison Gemini 2.5 Flash · 5 runs · 539 prompts

Promptivo Deterministic vs. Gemini 2.5 Flash — IFEval Dataset, 5 independent runs

We ran the same 541 IFEval prompts through Google’s Gemini 2.5 Flash model five independent times, asking it to score each prompt on the same 7 dimensions using a 1–5 scale. Running multiple trials lets us measure both how Gemini compares to Promptivo and how consistent Gemini’s own judgments are across repeated evaluations.

Dimension	Promptivo	Gemini (avg)	Diff	Gemini σ
Overall Score	3.39	3.45	+0.06	0.007
Clarity	4.23	3.85	−0.37	0.004
Precision	4.24	3.07	−1.17	0.001
Chain-of-Thought	3.58	2.68	−0.91	0.009
Info Completeness	3.99	3.10	−0.89	0.006
Constraint Verifiability	2.35	3.79	+1.44	0.003
Structural Compliance	2.79	3.38	+0.59	0.007
Info Integrity	2.52	3.38	+0.87	0.004

39.1%Grade agreement rate

IndependentScoring approach — measures different signals than Gemini

88% stableGemini self-consistency — same score across all 5 runs

Result Discrepancy Analysis

The five Gemini runs were independent — no shared state, same temperature (0). Comparing their outputs reveals how stable LLM-as-a-judge scoring is in practice.

Stability bucket	Prompts	Share	Interpretation
σ < 0.2 — very stable	475	88.1%	Gemini agrees with itself across runs
σ 0.2–0.5 — minor jitter	64	11.9%	Typically a 0.5-point swing in one run
σ ≥ 0.5 — high variance	0	0%	No prompts showed large score swings

Dimensional pattern: Constraint Verifiability shows the largest Promptivo–Gemini gap (+1.44), yet the lowest inter-run variance (σ = 0.003) — meaning Gemini is consistently more generous on this dimension, not just occasionally. Conversely, Precision shows the largest gap in the other direction (−1.17): Promptivo’s pattern engine reliably detects quantified language and specificity markers that Gemini’s holistic read may underweight. These are systematic, stable disagreements — not noise.

Divergent prompts: 21 of 539 prompts (3.9%) had |avg Gemini − Promptivo| > 1.5 points. Inspection shows a clear pattern: prompts Gemini scores high but Promptivo scores lower tend to be short, semantically clear requests (e.g. “Answer in lowercase only”) where the constraint is obvious to a reader but lacks explicit structural markers that Promptivo’s pattern engine scores. The reverse — Promptivo higher — occurs on structurally rich prompts that include many formatting cues and verifiable requirements, which Gemini may deflate if the underlying request is vague.

Key finding: Gemini and Promptivo reach near-identical overall averages (3.45 vs 3.39 across 539 prompts) but measure complementary signals. Gemini is remarkably self-consistent (88% of prompts stable across 5 runs, avg σ = 0.036), yet its dimensional profile diverges from Promptivo’s by up to 1.44 points. The near-zero Pearson correlation (r = 0.035) confirms these are independent perspectives — not redundant. Promptivo delivers deterministic, zero-cost results; Gemini applies semantic comprehension. Together they provide a richer picture than either approach alone.