ConStory-Bench | LLM Story Consistency Benchmark

1. Overview

ConStory-Bench focuses on consistency in ultra-long narrative generation. It contains 2,000 prompts across four task scenarios, with outputs targeting 8,000-10,000 words.

We pair the benchmark with ConStory-Checker, a three-stage evidence-grounded evaluator over five error categories and 19 subtypes, and report standardized CED/GRR metrics.

Figure 1: Overview of ConStory-Bench. (a) 2,000 prompts for long-story generation, (b) ConStory-Checker contradiction extraction and evidence chaining, (c) CED/GRR-based evaluation.

Research Questions

We structure our investigation around the following research questions:

To what extent do current LLMs maintain narrative coherence in ultra-long text generation, and do different models exhibit similar distributions of consistency error types?
How do consistency errors scale as a function of output length across different LLM architectures?
What underlying factors contribute to the emergence of consistency errors, and are there identifiable signals that reliably predict their occurrence?
Do different types of consistency errors systematically co-occur, or do they arise independently?
How are consistency errors distributed across positions within long-form generated narratives?

Main Contributions

We introduce ConStory-Bench, a benchmark for evaluating narrative consistency in long-form story generation, with four task scenarios and a taxonomy of five error categories and 19 fine-grained subtypes.
We develop ConStory-Checker, an automated evaluation pipeline that detects contradictions and supports each judgment with exact textual evidence.
We present evaluation results across proprietary and open-source models, capability-enhanced models, and agent-enhanced systems, with systematic analysis guided by five research questions.

2. ConStory-Bench: Task and Construction

ConStory-Bench is designed to systematically detect and quantify consistency in long-form narrative generation. Unlike existing benchmarks that focus on fluency or style, we specifically target consistency errors that emerge over extended contexts.

2.1 Dataset Construction

Sources and Selection

We curate long-context narrative material from seven diverse public corpora: LongBench, LongBench_Write, LongLamp, TellMeAStory, WritingBench, WritingPrompts, and WikiPlots. We select passages with clear plot progression, multiple interacting entities, and explicit temporal movement—characteristics that naturally induce consistency challenges.

Prompt Construction

To create realistic generation scenarios, we convert story segments into prompts through context-aware rewriting. We identify natural breakpoints (turning points, unresolved subplots, elaboration opportunities) and generate prompts for four task scenarios:

Generation: Complete story instantiation from scratch
Continuation: Context-preserving narrative extension
Expansion: Focused segment elaboration
Completion: Coherent gap-filling

This process yields 2,000 prompts distributed across the four task types, each targeting distinct consistency challenges.

Task Type	Count	Percentage
Generation	748	37.4%
Continuation	429	21.5%
Expansion	419	21.0%
Completion	394	19.7%
Total	2,000	100%

2.2 Consistency Error Taxonomy

We develop a hierarchical taxonomy grounded in narrative theory, comprising five top-level categories and 19 fine-grained error types:

Error Type	Sub Error Type
Timeline & Plot Logic	Absolute Time Contradictions
	Duration Contradictions
	Simultaneity Contradictions
	Causeless Effects
	Causal Logic Violations
	Abandoned Plot Elements
Characterization	Memory Contradictions
	Knowledge Contradictions
	Skill Fluctuations
	Forgotten Abilities
World-building & Setting	Core Rules Violations
	Social Norms Violations
	Geographical Contradictions
Factual & Detail Consistency	Appearance Mismatches
	Nomenclature Confusions
	Quantitative Mismatches
Narrative & Style	Perspective Confusions
	Tone Inconsistencies
	Style Shifts

Table 2: Consistency-error taxonomy used by ConStory-Bench, comprising five categories and 19 subtypes.

2.3 Automated Error Detection Pipeline

We introduce ConStory-Checker, an automated LLM-as-a-judge pipeline for scalable and auditable consistency evaluation. The framework advances from broad candidate mining to fine-grained error labeling through four stages:

1

Category-Guided Extraction

Narratives are scanned using category-specific prompts across five dimensions to extract spans likely to contain contradictions.

2

Contradiction Pairing

Extracted spans are compared pairwise and classified as Consistent or Contradictory, reducing false positives.

3

Evidence Chains

Each contradiction is documented with structured chains: Reasoning, Evidence (quoted text with offsets), and Conclusion (typed error).

4

JSON Reports

Standardized JSON outputs capture quotations, positions, pairings, error categories, and explanations for reproducible analysis.

We adopt o4-mini as the evaluation model to balance accuracy and efficiency. All judgments are anchored to precise spans with character-level offsets.

3. Evaluation

We evaluate proprietary, open-source, capability-enhanced, and agent-enhanced systems on all 2,000 prompts, and report five RQ-focused findings.

3.1 Experimental Setup

Primary metrics are CED (errors per 10K words) and GRR (group-wise rank controlling prompt difficulty). Lower is better for both.

3.2.1 RQ1: Model Consistency Benchmarks

Question: To what extent do current LLMs maintain narrative coherence in ultra-long text generation, and do different models exhibit similar distributions of consistency error types?

CED captures absolute error density; GRR adds prompt-level relative ranking for fairer cross-model comparison.

Metrics:

Consistency Error Density (CED): Measures errors per 10,000 words. Lower is better.
Group Relative Rank (GRR): Controls for instance difficulty through group-wise ranking. Lower is better.

Definition: CED normalizes error count by length; GRR ranks models within each prompt group using a length-aware quality score.

Full CED Results (All Evaluated Models)

Model	CED (errors per 10K words) ↓						GRR ↓	Words	Errors	Total Stories
Model	Overall	Char.	Fact.	Narr.	Time.	World	GRR ↓	Words	Errors	Total Stories

Table: Comprehensive model performance aligned with the latest paper table (lower is better for CED/GRR).

Leaderboard Snapshot

Figure 3: GRR leaderboard by model family. Lower GRR indicates stronger consistency under fair relative ranking.

Figure 4: CED versus average output length. Lower-left is ideal: fewer consistency errors with shorter-to-moderate outputs.

Note: The Leaderboard tab keeps the same numbers with interactive sorting, filtering, and GRR view.

Finding 1. GPT-5-Reasoning remains the strongest model (CED 0.113), while most models still struggle with long-form consistency and show substantially higher errors under open-ended generation tasks.

3.2.2 RQ2: Output Length Dynamics

Question: How do consistency errors scale as a function of output length across different LLM architectures?

Length preferences differ sharply (e.g., GPT-5-Reasoning and Claude-Sonnet-4.5 are mostly 6K+, while GPT-4o-1120 is concentrated in 0-3K), and error counts increase near-linearly with longer outputs.

Figure 5: Output length distribution. Stacked bars show the proportion of 0-3K, 3K-6K, and 6K+ word outputs across representative models.

Figure 6: Consistency error growth. Lines show average errors at each length; bars show sample counts.

Finding 2. Errors grow approximately linearly with length, but models have very different length preferences, producing distinct consistency-length trade-offs.

3.2.3 RQ3: Token Uncertainty Signals

Question: What underlying factors contribute to the emergence of consistency errors, and are there identifiable signals that reliably predict their occurrence?

Error-bearing spans show higher entropy and perplexity, and lower token probability, than whole-text baselines across representative models.

Model	Average Values		Relative Difference (Error vs Whole)
Model	Whole Text	Error Content	Relative Difference (Error vs Whole)
Entropy (bits) — Higher indicates greater uncertainty
Qwen3-30B-A3B-Instruct-2507	1.1438	1.2814	+12.03%
Qwen3-4B-Instruct-2507	1.0734	1.2799	+19.24%
Probability — Higher indicates greater confidence
Qwen3-30B-A3B-Instruct-2507	0.6895	0.6522	-5.41%
Qwen3-4B-Instruct-2507	0.7097	0.6530	-7.99%
Perplexity — Lower indicates better predictability
Qwen3-30B-A3B-Instruct-2507	1.8875	1.9354	+2.54%
Qwen3-4B-Instruct-2507	1.8566	1.9596	+5.55%

Table: Token-level uncertainty comparison across three metrics. Relative differences are computed as error-bearing content vs whole-text baseline.

Finding 3. Consistency failures concentrate in high-uncertainty regions: higher entropy/perplexity and lower token confidence provide actionable early-warning signals.

3.2.4 RQ4: Cross-Error Correlations

Question: Do different types of consistency errors systematically co-occur, or do they arise independently?

We present model-specific Pearson correlation matrices across eight representative models to show how cross-error coupling patterns differ by model family.

Figure 7: Model-specific error correlation matrices across eight representative models. Darker colors indicate stronger positive correlations between error categories.

Finding 4. Cross-category coupling is model-dependent: several models show strong Characterization-Factual and Factual-World links, while Narrative/Style remains comparatively sparse in most matrices.

3.2.5 RQ5: Positional Distribution of Errors

Question: How are consistency errors distributed across positions within long-form generated narratives?

We analyze normalized fact position, contradiction position, and their distance (gap) across representative error subtypes.

Metric	Absolute Time Contradictions	Core Rules Violations	Quantitative Mismatches	Geographical Contradictions	Nomenclature Confusions	Memory Contradictions	Perspective Confusions
Metric	Avg Fact	22.6%	23.7%	23.4%	20.4%	21.6%	21.8%	13.7%
Avg Contradiction	48.9%	39.4%	40.6%	39.2%	34.4%	38.2%	12.2%
Avg Gap	29.7%	23.4%	23.8%	31.0%	23.3%	25.4%	4.7%

Table: Positions are normalized by story length. Contradictions are concentrated mostly in the 40-60% range, with category-specific gap patterns.

Dumbbell Plot of Positional Distributions

Figure 8: Dumbbell positional plot. Blue points are fact positions, red points are contradiction positions, and line length is the gap, shown across representative models.

Finding 5. Error locations are not uniform: fact mentions cluster earlier (about 15-30%), contradictions appear later (about 40-60%), and long-range gaps are largest for temporal/geographical inconsistencies.

Conclusion

We presented ConStory-Bench, a benchmark, and ConStory-Checker, an evaluation pipeline, for assessing narrative consistency in long-form story generation. Our experiments show that current LLMs still produce systematic consistency errors, especially in factual tracking and temporal reasoning; moreover, these errors are not random but cluster in predictable narrative regions. We will provide an interactive portal where the community can discover and submit new consistency errors and checking techniques.

Citation

@article{constorybench2025,
  title={ConStory-Bench: A Comprehensive Benchmark for Evaluating Large Language Models on Long Story Consistency},
  author={Authors},
  journal={Journal},
  year={2025}
}

Lost in Stories: Consistency Bugs in Long Story Generation by LLMs