Written by Ritesh Kant on Digilah (Tech Thought Leadership).
Large Language models (abbreviated as LLMs) require enormous amounts of data for their training and retraining. Estimates suggest that Llama 3 was trained on a training set of 11 Trillion words, ChatGPT 4.0 in the meanwhile needed a paltry training set of 5 Trillion words !!
And that’s not all. Next generation models require data sets that are 10X larger… and so on.
While the possibilities with AI are infinite, we are hence heading towards finitism in the datasets that are needed to explore, and capitalize on, these infinite possibilities.
Why is data so important to AI?
Data is the oil for AI models. The reasons are well documented and can be summarized as follows:
- Pattern Recognition: Machine learning and deep learning models rely on data to recognize and learn patterns, and then make predictions or decisions.
- Training: Models use data to map inputs to outputs accurately, which is critical for tasks like classification, regression, and clustering.
- Feature Learning: Data provides the features (variables) that the models need to learn from, identify features that are significant and their relationship to outcomes.
- Performance Improvement: A large and diverse dataset helps the models learn a wide range of scenarios and variations, improving its ability to generalize.
- Evaluation and Validation: Validation and test datasets are used to evaluate the models’ performance and ensure that it is not overfitting.
- Bias Reduction: Adequate and representative data help in reducing biases in AI models.
- Adaptation and Updating: Continuous data collection allows AI models to be updated and adapted, and hence continue to be relevant and accurate.

What are the current data sources?
If data is the oil for AI models, the current and known oil wells include the following:
- The open data common crawl foundation: Consolidated from large scale web crawls, contains a data set of 25 trillion words, 55% of which is non-English. It is to be noted that these data sets are not de-duplicated.
- Web data not captured by common crawl: Search engines such as Google/Bing, would have crawled a lot more data than common crawl. Much of this data would be long tail (restaurant menus for example) and not relevant for AI training. It is estimated that this could be 2 to 5 times more than the common crawl data set.
- Academic publications and patent publications: Could probably add upto an additional 1 trillion words. It is to be noted however that much of it is PDF and requires OCR to extract text. Some of it is also behind paywalls.
- Book archives such as Anna’s archive: Approximately 3 trillion words, most of which is PDF and behind paywalls/logins.
Can we do more to get more data?
Can we dig deeper to get more oil. Feasibly we can, however the law of diminishing returns catches up and a lot of what we would get, for example by more sophisticated web crawls will be long tail data which would not be relevant for AI models’ training.
Another solution is synthetic data. Synthetic data is artificially generated data that mimics real-world data, and is created using algorithms, simulations, or generative models. The challenges with synthetic data are the challenges of quality, validation and de-duplication.
There is hence a crying need for more oil/data. The immense possibilities of the AI industry is synergistic with this
Can data be created afresh – and how?
Can oil be created! In this case it very well can be. The treasure trove of oil, nay data , that AI companies are mining has been created by approximately 1% of the global internet populace. Global internet penetration cascaded from the more developed western world to the lesser developed regions over a period, hence the current data sets also suffer from biases, lack of representation and diversity.
The opportunity to create new data is immense. The global internet user base is approximately 5.4 billion. As a representation of scale of inherent knowledge that this global user base contains, a typical human being at the age of 20 has spoken 150 million words.
Estimates would suggest that the total number of words spoken daily, across languages and regions, is 115 trillion. Compensating for long tail irrelevance and duplication by a factor of 60%, we are still left with a useful super set of knowledge of 45-50 trillion words, daily.
This is the oil that feasibly needs to be created and then mined. The solution is to have a more significant portion of the worldwide internet populace to create this oil, nay data.
Incentivizing internet users to create data that AI models can use needs to be a gradual process that can leverage several levers, some of which are as follows:
- Financial Incentives in the form of monetary rewards, profit sharing models offering data/content creators a share of the AI models’ profits, data marketplaces where data/content creators can sell their data/content.
- Gamification in the form of points systems, leaderboards and badges, challenges and competitions.
- Exchange of value in terms of access to subscriptions, tickets, events etal.
- Recognition in the form of community building, recognising contributors and contributions, highlighting social impact, collaborative projects whereby contributors can see for themselves the results of their contributions.
- Partnerships and collaborations with academia, academic institutions, AI researchers and corporates (both profit and non profit) that are building AI models.
- Ensuring privacy of data and transparency and provenance on how the data/content contributions are being used.
This is a long road, but a mix and match of these approaches can create a compelling playing field for internet users to willingly and actively contribute their data.
If the data/content so created covers diverse scenarios and populations, the downstream models are less likely to suffer from bias, be more representative and diverse, more performant in decisions and more likely to perform fairly across different groups.
The data/content creation road has been traveled however, most notably by social media platforms. The platforms that take up data/content creation for the significant cause of the AI revolution should inculcate some best principles from the social media evolution, encyclopedias such as Wikipedia and Fandom, Ask me anything platforms such as Quora along with web3 principles of incentivization and decentralization. We owe this much to all the possibilities inherent to AI.
References
- https://www.educatingsilicon.com/2024/05/09/how-much-llm-training-data-is-there-in-the-limit/#shadow-libraries
- https://x.com/mark_cummins?s=11&t=QSarIO-G0B2E9idaCl1HDA
Most asked questions
How many words are required to train present day LLMs?
Estimates suggest that Llama 3 was trained on a training set of 11 Trillion words, ChatGPT 4.0 needed a paltry training set of 5 Trillion words.
What is the average number of words we speak?
A typical human being at the age of 20 has spoken 150 million words.
Estimates suggest that the total number of words spoken daily, across languages and regions, is 115 trillion.
How many people use internet?
The global internet user base is approximately 5.4 billion.
Most searched queries
Large Language Model (LLM)
ChatGPT 4.0
Optical Character Recognition (OCR)
Hello readers! Hope you liked what you read today. Click the like button at the bottom of this page and share insights with your colleagues and friends!
For more such amazing thought leadership articles on technology follow Digilah people.















