Lost in Stories: Consistency Bugs in Long Story Generation by LLMs

Junjie Li1   Xinrui Guo1   Yuhao Wu2   Roy Ka-Wei Lee2   Hongzhi Li1   Yutao Xie1

1 Microsoft, Beijing, China

2 Singapore University of Technology and Design

Contact: lij850601@gmail.com  |  xingu@microsoft.com  |  wu_yuhao@mymail.sutd.edu.sg

1. Overview

ConStory-Bench focuses on consistency in ultra-long narrative generation. It contains 2,000 prompts across four task scenarios, with outputs targeting 8,000-10,000 words.

We pair the benchmark with ConStory-Checker, a three-stage evidence-grounded evaluator over five error categories and 19 subtypes, and report standardized CED/GRR metrics.

Overview of ConStory-Bench

Figure 1: Overview of ConStory-Bench. (a) 2,000 prompts for long-story generation, (b) ConStory-Checker contradiction extraction and evidence chaining, (c) CED/GRR-based evaluation.

Research Questions

We structure our investigation around the following research questions:

  1. To what extent do current LLMs maintain narrative coherence in ultra-long text generation, and do different models exhibit similar distributions of consistency error types?
  2. How do consistency errors scale as a function of output length across different LLM architectures?
  3. What underlying factors contribute to the emergence of consistency errors, and are there identifiable signals that reliably predict their occurrence?
  4. Do different types of consistency errors systematically co-occur, or do they arise independently?
  5. How are consistency errors distributed across positions within long-form generated narratives?

Main Contributions

  • We introduce ConStory-Bench, a benchmark for evaluating narrative consistency in long-form story generation, with four task scenarios and a taxonomy of five error categories and 19 fine-grained subtypes.
  • We develop ConStory-Checker, an automated evaluation pipeline that detects contradictions and supports each judgment with exact textual evidence.
  • We present evaluation results across proprietary and open-source models, capability-enhanced models, and agent-enhanced systems, with systematic analysis guided by five research questions.

2. ConStory-Bench: Task and Construction

ConStory-Bench is designed to systematically detect and quantify consistency in long-form narrative generation. Unlike existing benchmarks that focus on fluency or style, we specifically target consistency errors that emerge over extended contexts.

2.1 Dataset Construction

Sources and Selection

We curate long-context narrative material from seven diverse public corpora: LongBench, LongBench_Write, LongLamp, TellMeAStory, WritingBench, WritingPrompts, and WikiPlots. We select passages with clear plot progression, multiple interacting entities, and explicit temporal movement—characteristics that naturally induce consistency challenges.

Prompt Construction

To create realistic generation scenarios, we convert story segments into prompts through context-aware rewriting. We identify natural breakpoints (turning points, unresolved subplots, elaboration opportunities) and generate prompts for four task scenarios:

  • Generation: Complete story instantiation from scratch
  • Continuation: Context-preserving narrative extension
  • Expansion: Focused segment elaboration
  • Completion: Coherent gap-filling

This process yields 2,000 prompts distributed across the four task types, each targeting distinct consistency challenges.

Task Type Count Percentage
Generation 748 37.4%
Continuation 429 21.5%
Expansion 419 21.0%
Completion 394 19.7%
Total 2,000 100%

2.2 Consistency Error Taxonomy

We develop a hierarchical taxonomy grounded in narrative theory, comprising five top-level categories and 19 fine-grained error types:

Consistency Error Examples

Figure 2: Representative consistency error examples. Highlighted segments show contradictions across five categories detected by ConStory-Checker.

Error Type Sub Error Type
Timeline & Plot Logic Absolute Time Contradictions
Duration Contradictions
Simultaneity Contradictions
Causeless Effects
Causal Logic Violations
Abandoned Plot Elements
Characterization Memory Contradictions
Knowledge Contradictions
Skill Fluctuations
Forgotten Abilities
World-building & Setting Core Rules Violations
Social Norms Violations
Geographical Contradictions
Factual & Detail Consistency Appearance Mismatches
Nomenclature Confusions
Quantitative Mismatches
Narrative & Style Perspective Confusions
Tone Inconsistencies
Style Shifts

Table 2: Consistency-error taxonomy used by ConStory-Bench, comprising five categories and 19 subtypes.

2.3 Automated Error Detection Pipeline

We introduce ConStory-Checker, an automated LLM-as-a-judge pipeline for scalable and auditable consistency evaluation. The framework advances from broad candidate mining to fine-grained error labeling through four stages:

1

Category-Guided Extraction

Narratives are scanned using category-specific prompts across five dimensions to extract spans likely to contain contradictions.

2

Contradiction Pairing

Extracted spans are compared pairwise and classified as Consistent or Contradictory, reducing false positives.

3

Evidence Chains

Each contradiction is documented with structured chains: Reasoning, Evidence (quoted text with offsets), and Conclusion (typed error).

4

JSON Reports

Standardized JSON outputs capture quotations, positions, pairings, error categories, and explanations for reproducible analysis.

We adopt o4-mini as the evaluation model to balance accuracy and efficiency. All judgments are anchored to precise spans with character-level offsets.

3. Evaluation

We evaluate proprietary, open-source, capability-enhanced, and agent-enhanced systems on all 2,000 prompts, and report five RQ-focused findings.

3.1 Experimental Setup

Primary metrics are CED (errors per 10K words) and GRR (group-wise rank controlling prompt difficulty). Lower is better for both.

3.2.1 RQ1: Model Consistency Benchmarks

Question: To what extent do current LLMs maintain narrative coherence in ultra-long text generation, and do different models exhibit similar distributions of consistency error types?

CED captures absolute error density; GRR adds prompt-level relative ranking for fairer cross-model comparison.

Metrics:

  • Consistency Error Density (CED): Measures errors per 10,000 words. Lower is better.
  • Group Relative Rank (GRR): Controls for instance difficulty through group-wise ranking. Lower is better.

Definition: CED normalizes error count by length; GRR ranks models within each prompt group using a length-aware quality score.

Full CED Results (All Evaluated Models)

Model CED (errors per 10K words) ↓ GRR ↓ Words Errors Total Stories
Overall Char. Fact. Narr. Time. World

Table: Comprehensive model performance aligned with the latest paper table (lower is better for CED/GRR).

Leaderboard Snapshot

Model Performance Leaderboard

Figure 3: GRR leaderboard by model family. Lower GRR indicates stronger consistency under fair relative ranking.

CED vs Average Output Length

Figure 4: CED versus average output length. Lower-left is ideal: fewer consistency errors with shorter-to-moderate outputs.

Note: The Leaderboard tab keeps the same numbers with interactive sorting, filtering, and GRR view.

Finding 1. GPT-5-Reasoning remains the strongest model (CED 0.113), while most models still struggle with long-form consistency and show substantially higher errors under open-ended generation tasks.

3.2.2 RQ2: Output Length Dynamics

Question: How do consistency errors scale as a function of output length across different LLM architectures?

Length preferences differ sharply (e.g., GPT-5-Reasoning and Claude-Sonnet-4.5 are mostly 6K+, while GPT-4o-1120 is concentrated in 0-3K), and error counts increase near-linearly with longer outputs.

Length Distribution

Figure 5: Output length distribution. Stacked bars show the proportion of 0-3K, 3K-6K, and 6K+ word outputs across representative models.

Error Growth

Figure 6: Consistency error growth. Lines show average errors at each length; bars show sample counts.

Finding 2. Errors grow approximately linearly with length, but models have very different length preferences, producing distinct consistency-length trade-offs.

3.2.3 RQ3: Token Uncertainty Signals

Question: What underlying factors contribute to the emergence of consistency errors, and are there identifiable signals that reliably predict their occurrence?

Error-bearing spans show higher entropy and perplexity, and lower token probability, than whole-text baselines across representative models.

Model Average Values Relative Difference
(Error vs Whole)
Whole Text Error Content
Entropy (bits) — Higher indicates greater uncertainty
Qwen3-30B-A3B-Instruct-2507 1.1438 1.2814 +12.03%
Qwen3-4B-Instruct-2507 1.0734 1.2799 +19.24%
Probability — Higher indicates greater confidence
Qwen3-30B-A3B-Instruct-2507 0.6895 0.6522 -5.41%
Qwen3-4B-Instruct-2507 0.7097 0.6530 -7.99%
Perplexity — Lower indicates better predictability
Qwen3-30B-A3B-Instruct-2507 1.8875 1.9354 +2.54%
Qwen3-4B-Instruct-2507 1.8566 1.9596 +5.55%

Table: Token-level uncertainty comparison across three metrics. Relative differences are computed as error-bearing content vs whole-text baseline.

Finding 3. Consistency failures concentrate in high-uncertainty regions: higher entropy/perplexity and lower token confidence provide actionable early-warning signals.

3.2.4 RQ4: Cross-Error Correlations

Question: Do different types of consistency errors systematically co-occur, or do they arise independently?

We present model-specific Pearson correlation matrices across eight representative models to show how cross-error coupling patterns differ by model family.

Model-specific Error Correlation Matrices

Figure 7: Model-specific error correlation matrices across eight representative models. Darker colors indicate stronger positive correlations between error categories.

Finding 4. Cross-category coupling is model-dependent: several models show strong Characterization-Factual and Factual-World links, while Narrative/Style remains comparatively sparse in most matrices.

3.2.5 RQ5: Positional Distribution of Errors

Question: How are consistency errors distributed across positions within long-form generated narratives?

We analyze normalized fact position, contradiction position, and their distance (gap) across representative error subtypes.

Metric Absolute Time
Contradictions
Core Rules
Violations
Quantitative
Mismatches
Geographical
Contradictions
Nomenclature
Confusions
Memory
Contradictions
Perspective
Confusions
Avg Fact 22.6% 23.7% 23.4% 20.4% 21.6% 21.8% 13.7%
Avg Contradiction 48.9% 39.4% 40.6% 39.2% 34.4% 38.2% 12.2%
Avg Gap 29.7% 23.4% 23.8% 31.0% 23.3% 25.4% 4.7%

Table: Positions are normalized by story length. Contradictions are concentrated mostly in the 40-60% range, with category-specific gap patterns.

Dumbbell Plot of Positional Distributions

Figure 8: Dumbbell positional plot. Blue points are fact positions, red points are contradiction positions, and line length is the gap, shown across representative models.

Finding 5. Error locations are not uniform: fact mentions cluster earlier (about 15-30%), contradictions appear later (about 40-60%), and long-range gaps are largest for temporal/geographical inconsistencies.

Conclusion

We presented ConStory-Bench, a benchmark, and ConStory-Checker, an evaluation pipeline, for assessing narrative consistency in long-form story generation. Our experiments show that current LLMs still produce systematic consistency errors, especially in factual tracking and temporal reasoning; moreover, these errors are not random but cluster in predictable narrative regions. We will provide an interactive portal where the community can discover and submit new consistency errors and checking techniques.

Citation

@article{constorybench2025,
  title={ConStory-Bench: A Comprehensive Benchmark for Evaluating Large Language Models on Long Story Consistency},
  author={Authors},
  journal={Journal},
  year={2025}
}

Leaderboard

CED (Consistency Error Density)

CED measures the number of consistency errors per 10,000 words. Lower values indicate better performance.

Rank Model Organization Generation CED Continuation CED Expansion CED Completion CED Avg Words Total Stories CED Overall

Metric: CED (Consistency Error Density) - measures errors per 10,000 words. Lower values are better.

Note: CED table reports task-level density for generation, continuation, expansion, and completion. Last updated: March 5, 2026.


GRR (Group Relative Rank)

GRR measures the average ranking across all story groups. Lower values indicate better performance.

Rank Model Organization Generation GRR Continuation GRR Expansion GRR Completion GRR Avg Words Total Stories GRR Overall

Metric: GRR (Group Relative Rank) - average ranking position across story groups. Lower values are better.

Note: GRR table uses task-type grouped relative ranking; lower values are better. Last updated: March 5, 2026.

Submit Icon Submit Your Results

Evaluate your model on ConStory-Bench and submit results to the leaderboard

Submission Guidelines

  1. Download the benchmark dataset from Hugging Face
  2. Run the evaluation scripts with your model
  3. Generate the results in the required JSON format
  4. Submit a pull request to our GitHub repository with your results
  5. Include model details, parameters, and any specific configurations used
View Submission Guide →

Paper Icon Paper & Resources

Research paper and additional resources (Coming Soon)

đź“„ Research paper will be available soon