top of page

Echoes in AI: Why Large Language Models Struggle with Plot Diversity



AI creativity compared to human imagination: while large language models generate patterns based on training data, human creativity draws from unique experiences and emotions.
AI creativity compared to human imagination: while large language models generate patterns based on training data, human creativity draws from unique experiences and emotions.

Setting the Stage: AI and Creativity


Large language models (LLMs) are advancing at a remarkable speed. From writing stories and poetry to brainstorming ideas, their presence in creative work is undeniable. But here lies a fundamental question: can these models genuinely support collective creativity, or do they simply mimic it?


Past studies paint a mixed picture. On one hand, LLMs have shown promise in creative writing, poetry, idea generation, and even creative thinking. On the other, research highlights that their creativity lags behind humans, and their stories are often judged as ‘predictably poor’.


This tension extends to collective contexts. While AI may help an individual feel more creative, it can reduce the diversity of ideas across groups (Doshi & Hauser, 2024; Padmakumar & He, 2024). Teachers often see this in practice: the first AI-generated essay might feel impressive, but by the tenth, patterns repeat. What once seemed “creative” now looks formulaic.


When Stories Start to Echo


The authors of this study describe these repetitions as “echoes.” Take Franz Kafka’s short story Give It Up. GPT-4 was asked to generate 100 continuations, and none resembled Kafka’s actual surprising ending. Instead, 50 advised the protagonist to “take the second left,” 18 said “take the second right,” others had the policeman guide the narrator, and 16 mentioned a bakery. These echoes spanned both wording and narrative structure, showing how models recycle at different levels.


Earlier work has already pointed out reduced lexical diversity in LLMs (Mohammadi, 2024). This study shows the issue runs deeper—echoes appear at the plot level, where whole story elements repeat.


Measuring Uniqueness: The Sui Generis Score


To quantify originality, the researchers introduce the Sui Generis score (Latin for “of its own kind”). The idea is straightforward:

  1. Break a story into segments.

  2. Generate multiple alternative continuations.

  3. Check how often each segment recurs across the outputs.

Frequently repeated segments score low, while unique twists score high. For instance, in a human-written story where people received mysterious notes reading “DUCK!”, the final twist note saying “Goose” earned a high score. By contrast, the predictable middle developments scored low.

Behind the Experiment

The team tested 100 stories drawn from two datasets:

  • Writing Prompts (Reddit) – human-written stories responding to prompts.

  • Wikipedia TV episode summaries – condensed versions of real narrative arcs.

Both human and AI stories were segmented into equal-length chunks. Two state-of-the-art models were used: GPT-4 (OpenAI, 2024) and LLaMA-3 (Pandey et al., 2024). This setup allowed a direct comparison between human and machine creativity.

Storytelling diversity highlights the contrast between human narratives, full of unexpected twists, and AI-generated stories, often echoing similar plot structures.
Storytelling diversity highlights the contrast between human narratives, full of unexpected twists, and AI-generated stories, often echoing similar plot structures.

What the Numbers Say


Human stories consistently scored higher, demonstrating more originality and unpredictability than LLM outputs. In addition, echoes were not confined to a single model. Outputs from GPT-4 and LLaMA-3 showed significant overlap, revealing a broader pattern of homogenization across systems.

Pacing also set humans apart. While human stories-built suspense gradually, LLM stories often rushed to resolution. This appeared in sudden “score drops” after high points, signaling abrupt and unsatisfying endings.

Finally, human evaluations confirmed the metric’s validity. Ratings of surprise correlated moderately (ρ = 0.55) with Sui Generis scores, showing the measure aligns with human intuitions about novelty.


Why It Matters


Standard similarity measures like BLEU, ROUGE, or embedding-based metrics often capture only surface-level overlaps (Ghosal et al., 2022; Shaib et al., 2024). The Sui Generis score digs deeper, aligning more closely with human judgments of creativity.


Patterns in scoring revealed further insights. Low-scoring LLM stories tended to be bland or clichéd. High-scoring ones—whether human or AI, contained richer detail and unconventional plots. But humans still outperformed models by using nonlinear structures like flashbacks, something LLMs rarely achieved.


Scores also highlighted narrative turning points: surprises and twists tended to cluster where scores peaked, while predictable filler earned low marks. This shows the metric can serve both as an evaluator and as a way to map a story’s structure.


Though computationally costly (around 910 model calls per story, about $7), falling inference prices make this less of a barrier over time. Beyond evaluation, Sui Generis can guide generation itself—prioritizing unique segments to create more original stories, even if it requires extra computation.



Large language models (LLMs) such as GPT-4 and LLaMA frequently produce overlapping storylines, a phenomenon described in the study as narrative ‘echoes.
Large language models (LLMs) such as GPT-4 and LLaMA frequently produce overlapping storylines, a phenomenon described in the study as narrative ‘echoes.

Limitations to Keep in Mind


Two caveats stand out. Some datasets may overlap with LLM training data, though memorization did not appear to drive results here. And the method depends on GPT-4 for semantic judgment; weaker models may distort scores, so reliability requires models with comparable language ability.


Looking Ahead


The takeaway is clear: in isolation, LLM outputs may feel novel, but across many samples, echoes dominate. Human stories remain more diverse, richer, and less predictable.


The Sui Generis score captures this gap. It not only quantifies uniqueness in storytelling but also correlates with human perceptions of surprise. Its potential reaches beyond text, offering a framework that could be adapted to music, images, or video.


At a broader level, the study raises cultural concerns. Homogenized AI outputs risk narrowing creative expression and reducing diversity. Yet the same metric also offers hope—a tool to diagnose and counteract these patterns, nudging AI toward more





Reference:

Doshi, A. R., & Hauser, O. P. (2024). Generative AI enhances individual creativity but reduces the collective diversity of novel content. Science Advances, 10(28), eadn5290. https://doi.org/10.1126/sciadv.adn5290


Ghosal, T., Saikh, T., Biswas, T., Ekbal, A., & Bhattacharyya, P. (2022). Novelty Detection: A Perspective from Natural Language Processing. Computational Linguistics - Association for Computational Linguistics, 48(1), 77–117. https://doi.org/10.1162/coli_a_00429


Mohammadi, B. (2024). Creativity Has Left the Chat: The Price of Debiasing Language Models. arXiv.Org.


OpenAI, Adler, S., Agarwal, S., Ahmad, L., Almeida, D., Avila, R., Babuschkin, I., Suchir Balaji, Balcom, V., Baltescu, P., Bao, H., Berner, C., Anna-Luisa Brakman, Brundage, M., Cai, T., Campbell, R., Chan, B., Chantzis, F., Chen, S., … Zhao, S. (2024). GPT-4 Technical Report. arXiv.Org.


Padmakumar, V., & He, H. (2024). Does Writing with Language Models Reduce Content Diversity? arXiv.Org.


Pandey, A., Letman, A., Yang, A., Fan, A., Rao, A., Spataru, A., Marra, C., Wong, C., Song, D., Wyatt, D., Lakomkin, E., Radenovic, F., Synnaeve, G., Georgia Lewis Anderson, Zarov, I., Copet, J., Mahadeokar, J., van der Linde, J., Hong, J., … Rait, Z. (2024). The Llama 3 Herd of Models. arXiv.Org.


Shaib, C., Barrow, J., Sun, J., Siu, A. F., Wallace, B. C., & Nenkova, A. (2024). Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores. arXiv.Org.


 
 
 

1 Comment


Malek Ador
Oct 05

Research writing assistance

Like

This initiative is supported by the following organizations:

  • Twitter
  • LinkedIn
  • YouTube
logo_edited.png
bottom of page