News

The great data drought in artificial intelligence

La gran sequía de datos en la inteligencia artificial

Training AI models faces a critical limitation: the lack of high-quality public data.

Artificial intelligence is advancing rapidly, with increasingly sophisticated models. Every few months, more powerful and accurate models appear, capable of generating text, images, code, and complex reasoning. However, behind this progress lies a critical factor that receives less attention: the scarcity of quality data.

For years, we’ve repeated the idea that we live in an “age of information overload.” Yet AI models can’t use just any content—they require data that is abundant, diverse, and, above all, high-quality. And such data, especially public and well-structured data, is far scarcer than it seems.

Some figures that illustrate the issue:

  • GPT-3 was trained with approximately 300 billion tokens.
  • DBRX, from Databricks, was trained with over 12 trillion tokens.

This trend continues to accelerate. Independent research, such as that from Epoch AI, projects that if we maintain this pace, we may exhaust public sources of useful text between 2026 and 2032. This prediction has been echoed by Elon Musk, who has publicly stated that the knowledge available on the web is no longer enough to feed the largest models.

It’s not just a matter of quantity

In theory, we’re surrounded by data. In practice, most of it is private, proprietary, or protected by regulation. Medical records, banking transactions, corporate documentation, public systems… most of the world’s valuable knowledge is neither free nor accessible — and for good reason.

This situation has created intense competition for data access. Major tech companies compete for content licenses, websites implement restrictions on automated scraping (web scraping), and multimillion-dollar deals are struck for access to specialized databases. The result is a data market that is increasingly closed, expensive, and strategic.

The emerging answer: synthetic data

Faced with the growing difficulty of accessing real, high-quality data, one alternative is gaining traction in the industry: synthetic data. These are not collected from people, companies, or real records but are artificially generated by statistical models or AI systems designed to reproduce the patterns and behavior of authentic data.

That is, if we train a model on a set of clinical histories, medical images, or financial transactions, it can learn their characteristics and then create new examples that resemble the originals—but without containing identifiable information. This achieves a balance between usefulness and privacy.

Advantages:

  • Privacy by design: enables model training without exposing sensitive information.
  • Scalability: millions of examples can be generated in minutes.
  • Simulation of rare scenarios: allows models to learn from events that rarely occur in real data, from industrial anomalies to uncommon medical conditions.
  • Cost reduction: decreases the need for manual collection or extensive annotation.

Associated risks:

There is a phenomenon known as “model collapse.” If new models are trained primarily on data generated by other models, diversity decreases. It’s like photocopying a photocopy—each generation loses sharpness.

This can lead to:

  • More repetitive responses.
  • Reduced creativity.
  • Disconnection from real-world context.

Therefore, the use of synthetic data must be accompanied by human supervision and always combined with carefully selected real data.

Strategies to address the data shortage

The industry is developing multiple approaches to manage this limitation. Instead of continuously scaling up, many organizations are turning to smaller, specialized models (Small Language Models or SLMs) designed to solve specific tasks more efficiently.

Multimodal training is also gaining importance: combining text, images, audio, and video carefully curated allows for better use of available data. At the same time, data attribution frameworks are being established to allow creators and organizations to maintain control over how their content is used.

Synthetic data will continue to play an important role, but its use must be responsible: it requires continuous supervision and rigorous evaluation to prevent both model degradation and bias amplification. The challenge is real, but solutions are already emerging.

Conclusion

AI doesn’t rely only on bigger models or more powerful hardware — it depends on the data that fuels it. And that data is no longer infinite.

The future of AI will not be decided solely by algorithms but by how we collect, structure, protect, and combine data. Organizations that understand this dynamic —companies, governments, and researchers— will hold a significant competitive advantage in the years to come.

Share on social media:

News and references from the business line

Menú

Cercador