News

Home → Services → AI-powered Applications → The great data drought in artificial intelligence

The great data drought in artificial intelligence

José Manuel Reche
10/11/2025
AI-powered applications

Training AI models faces a critical limitation: the lack of high-quality public data.

Artificial intelligence is advancing rapidly, with increasingly sophisticated models. Every few months, more powerful and accurate models appear, capable of generating text, images, code, and complex reasoning. However, behind this progress lies a critical factor that receives less attention: the scarcity of quality data.

For years, we’ve repeated the idea that we live in an “age of information overload.” Yet AI models can’t use just any content—they require data that is abundant, diverse, and, above all, high-quality. And such data, especially public and well-structured data, is far scarcer than it seems.

Some figures that illustrate the issue:

GPT-3 was trained with approximately 300 billion tokens.
DBRX, from Databricks, was trained with over 12 trillion tokens.

This trend continues to accelerate. Independent research, such as that from Epoch AI, projects that if we maintain this pace, we may exhaust public sources of useful text between 2026 and 2032. This prediction has been echoed by Elon Musk, who has publicly stated that the knowledge available on the web is no longer enough to feed the largest models.

It’s not just a matter of quantity

In theory, we’re surrounded by data. In practice, most of it is private, proprietary, or protected by regulation. Medical records, banking transactions, corporate documentation, public systems… most of the world’s valuable knowledge is neither free nor accessible — and for good reason.

This situation has created intense competition for data access. Major tech companies compete for content licenses, websites implement restrictions on automated scraping (web scraping), and multimillion-dollar deals are struck for access to specialized databases. The result is a data market that is increasingly closed, expensive, and strategic.

The emerging answer: synthetic data

Faced with the growing difficulty of accessing real, high-quality data, one alternative is gaining traction in the industry: synthetic data. These are not collected from people, companies, or real records but are artificially generated by statistical models or AI systems designed to reproduce the patterns and behavior of authentic data.

That is, if we train a model on a set of clinical histories, medical images, or financial transactions, it can learn their characteristics and then create new examples that resemble the originals—but without containing identifiable information. This achieves a balance between usefulness and privacy.

Advantages:

Privacy by design: enables model training without exposing sensitive information.
Scalability: millions of examples can be generated in minutes.
Simulation of rare scenarios: allows models to learn from events that rarely occur in real data, from industrial anomalies to uncommon medical conditions.
Cost reduction: decreases the need for manual collection or extensive annotation.

Associated risks:

There is a phenomenon known as “model collapse.” If new models are trained primarily on data generated by other models, diversity decreases. It’s like photocopying a photocopy—each generation loses sharpness.

This can lead to:

More repetitive responses.
Reduced creativity.
Disconnection from real-world context.

Therefore, the use of synthetic data must be accompanied by human supervision and always combined with carefully selected real data.

Strategies to address the data shortage

The industry is developing multiple approaches to manage this limitation. Instead of continuously scaling up, many organizations are turning to smaller, specialized models (Small Language Models or SLMs) designed to solve specific tasks more efficiently.

Multimodal training is also gaining importance: combining text, images, audio, and video carefully curated allows for better use of available data. At the same time, data attribution frameworks are being established to allow creators and organizations to maintain control over how their content is used.

Synthetic data will continue to play an important role, but its use must be responsible: it requires continuous supervision and rigorous evaluation to prevent both model degradation and bias amplification. The challenge is real, but solutions are already emerging.

Conclusion

AI doesn’t rely only on bigger models or more powerful hardware — it depends on the data that fuels it. And that data is no longer infinite.

The future of AI will not be decided solely by algorithms but by how we collect, structure, protect, and combine data. Organizations that understand this dynamic —companies, governments, and researchers— will hold a significant competitive advantage in the years to come.

Share on social media:

News and references from the business line

We welcome the MBA Business Analytics students from Universidad del Pacífico

Visitors from Universidad del Pacífico learned about our work in artificial intelligence and digital transformation during a session at IThinkUPC.

AI-powered Applications, Digital Strategy, News

We share our experience in artificial intelligence for the public sector at the IA Lleida Conference

Àlex Cifuentes presented Coloq.ia and our vision of a useful, ethical AI that serves public administrations.

AI-powered Applications, News

MCP: The standard that connects AI with the real world

The Model Context Protocol (MCP) connects AI with real-world data and systems, creating agents capable of acting, learning, and collaborating effectively.

AI-powered Applications, News

Sergi Sales awarded at the Businessmap Excellence Awards Spain 2025

Sergi Sales has been recognized for his leadership and impact with the “Partner of the Year” award at the Businessmap Excellence Awards Spain 2025.

AI-powered Applications, News

What Is Data Poisoning and Why It Poses a Serious Risk to AI Models

Learn what data poisoning is and how it threatens the security and reliability of today’s artificial intelligence models.

News, AI-powered Applications, Cybersecurity Everywhere

VET Studies Recommender for the Department of Education and Vocational Training

We designed and implemented an innovative AI-based solution that facilitates student academic guidance and reduces the risk of school dropout.

AI-powered Applications, References

Digital
Strategy

Talent
Management

Data
Services

AI-powered
Applications

Cloud &
IT Infrastructure

Cybersecurity
Everywhere

Digital
Strategy

Talent
Management

Data
Services

AI-powered
Applications

Cloud &
IT Infrastructure

Cybersecurity
Everywhere

News

The great data drought in artificial intelligence

It’s not just a matter of quantity

The emerging answer: synthetic data

Advantages:

Associated risks:

Strategies to address the data shortage

Conclusion

News and references from the business line

We welcome the MBA Business Analytics students from Universidad del Pacífico

We share our experience in artificial intelligence for the public sector at the IA Lleida Conference

MCP: The standard that connects AI with the real world

Sergi Sales awarded at the Businessmap Excellence Awards Spain 2025

What Is Data Poisoning and Why It Poses a Serious Risk to AI Models

VET Studies Recommender for the Department of Education and Vocational Training

News

The great data drought in artificial intelligence

It’s not just a matter of quantity

The emerging answer: synthetic data

Advantages:

Associated risks:

Strategies to address the data shortage

Conclusion

News and references from the business line

Menú

Cercador