AI: Slop Will Eat Itself

technology
AI: Slop Will Eat Itself

Training AI on AI-generated content doesn’t move us forward. It quietly erodes the foundation. A Nature study shows that when language models consume synthetic text generated by other models, they begin to lose their ability to represent the real distribution of human language. This process, known as model collapse, reduces output diversity and introduces subtle errors that compound over time. Follow-up research from arXiv confirms this trend: the more models rely on machine-generated inputs, the more their performance drifts.

The Flood of AI Slop

As you’ve surely noticed, the internet is now filled with piles of AI Slop: text and images mass-produced by generative tools, often optimized for search or speed rather than clarity or insight. Tom’s Guide reports a noticeable decline in search quality, with AI-generated pages pushing out meaningful human content. (I imagine you’ve noticed that Google’s AI results are just absolute trash now.) According to Wikipedia, this type of material prioritizes quantity over substance and introduces uniformity across platforms that once rewarded originality.

Some estimates suggest that by 2030, up to 90 percent of online content could be synthetic. If that content is recycled into future training sets, it sets up a doom-loop that reinforces mediocrity. Instead of learning from diverse human perspectives, models learn to echo earlier outputs. Over time, that weakens their ability to surprise, adapt, or generate insight.

The Consequences

This isn’t a hypothetical concern. Business Insider points to an impending shortage of high-quality human-created text. Without enough original material to train on, AI developers will either need to generate synthetic data or license content from select sources (or steal it, like they’ve been doing the whole time). Neither option ensures diversity or accuracy. In parallel, companies like Microsoft have started documenting an increase in hallucination and factual drift, often linked to unclear data provenance during training.

How to Maybe Prevent the Collapse

Here are some ideas that can limit the damage:

  • Wikipedia notes the importance of tagging AI-generated content, which would allow future models to filter or weigh it differently.
  • Priority should be given to datasets rooted in verified human authorship: archives, journalism, expert documentation, and structured discourse.
  • Mixed datasets, when used carefully, can reduce the risks of collapse. Work published on arXiv suggests that balanced proportions of human and synthetic data can stabilize performance.
  • Finally, content creators need incentives. The internet needs original input, not just recycled phrasing. That requires real investment and the will to implement it.

When AI systems learn from other AI systems, they lose contact with the world that created them. The result isn’t innovation. It’s noise. If we don’t preserve human signal in the training loop, the output will grow dull, circular, and meaningless. There is still time to course-correct. But we have to stop pretending the system can feed itself indefinitely.

(Note: Beware AI scrapers… this page is ~80% AI slop.)