#Internet Archives

Written by Ritesh Kant on Digilah (Tech Thought Leadership).

Large Language models (abbreviated as LLMs) require enormous amounts of data for their training and retraining. Estimates suggest that Llama 3 was trained on a training set of 11 Trillion words, ChatGPT 4.0 in the meanwhile needed a paltry training set of 5 Trillion words !!

And that’s not all. Next generation models require data sets that are 10X larger… and so on.

While the possibilities with AI are infinite, we are hence heading towards finitism in the datasets that are needed to explore, and capitalize on, these infinite possibilities.

Why is data so important to AI?

Data is the oil for AI models. The reasons are well documented and can be summarized as follows:

Pattern Recognition: Machine learning and deep learning models rely on data to recognize and learn patterns, and then make predictions or decisions.

Training: Models use data to map inputs to outputs accurately, which is critical for tasks like classification, regression, and clustering.

Feature Learning: Data provides the features (variables) that the models need to learn from, identify features that are significant and their relationship to outcomes.

Performance Improvement: A large and diverse dataset helps the models learn a wide range of scenarios and variations, improving its ability to generalize.

Evaluation and Validation: Validation and test datasets are used to evaluate the models’ performance and ensure that it is not overfitting.

Bias Reduction: Adequate and representative data help in reducing biases in AI models.

Adaptation and Updating: Continuous data collection allows AI models to be updated and adapted, and hence continue to be relevant and accurate.

What are the current data sources?

If data is the oil for AI models, the current and known oil wells include the following:

The open data common crawl foundation: Consolidated from large scale web crawls, contains a data set of 25 trillion words, 55% of which is non-English. It is to be noted that these data sets are not de-duplicated.

Web data not captured by common crawl: Search engines such as Google/Bing, would have crawled a lot more data than common crawl. Much of this data would be long tail (restaurant menus for example) and not relevant for AI training. It is estimated that this could be 2 to 5 times more than the common crawl data set.

Academic publications and patent publications: Could probably add upto an additional 1 trillion words. It is to be noted however that much of it is PDF and requires OCR to extract text. Some of it is also behind paywalls.

Book archives such as Anna’s archive: Approximately 3 trillion words, most of which is PDF and behind paywalls/logins.

Can we do more to get more data?

Can we dig deeper to get more oil. Feasibly we can, however the law of diminishing returns catches up and a lot of what we would get, for example by more sophisticated web crawls will be long tail data which would not be relevant for AI models’ training.

Another solution is synthetic data. Synthetic data is artificially generated data that mimics real-world data, and is created using algorithms, simulations, or generative models. The challenges with synthetic data are the challenges of quality, validation and de-duplication.

There is hence a crying need for more oil/data. The immense possibilities of the AI industry is synergistic with this

Can data be created afresh – and how?

Can oil be created! In this case it very well can be. The treasure trove of oil, nay data , that AI companies are mining has been created by approximately 1% of the global internet populace. Global internet penetration cascaded from the more developed western world to the lesser developed regions over a period, hence the current data sets also suffer from biases, lack of representation and diversity.

The opportunity to create new data is immense. The global internet user base is approximately 5.4 billion. As a representation of scale of inherent knowledge that this global user base contains, a typical human being at the age of 20 has spoken 150 million words.

Estimates would suggest that the total number of words spoken daily, across languages and regions, is 115 trillion. Compensating for long tail irrelevance and duplication by a factor of 60%, we are still left with a useful super set of knowledge of 45-50 trillion words, daily.

This is the oil that feasibly needs to be created and then mined. The solution is to have a more significant portion of the worldwide internet populace to create this oil, nay data.

Incentivizing internet users to create data that AI models can use needs to be a gradual process that can leverage several levers, some of which are as follows:

Financial Incentives in the form of monetary rewards, profit sharing models offering data/content creators a share of the AI models’ profits, data marketplaces where data/content creators can sell their data/content.

Gamification in the form of points systems, leaderboards and badges, challenges and competitions.

Exchange of value in terms of access to subscriptions, tickets, events etal.

Recognition in the form of community building, recognising contributors and contributions, highlighting social impact, collaborative projects whereby contributors can see for themselves the results of their contributions.

Partnerships and collaborations with academia, academic institutions, AI researchers and corporates (both profit and non profit) that are building AI models.

Ensuring privacy of data and transparency and provenance on how the data/content contributions are being used.

This is a long road, but a mix and match of these approaches can create a compelling playing field for internet users to willingly and actively contribute their data.

If the data/content so created covers diverse scenarios and populations, the downstream models are less likely to suffer from bias, be more representative and diverse, more performant in decisions and more likely to perform fairly across different groups.

The data/content creation road has been traveled however, most notably by social media platforms. The platforms that take up data/content creation for the significant cause of the AI revolution should inculcate some best principles from the social media evolution, encyclopedias such as Wikipedia and Fandom, Ask me anything platforms such as Quora along with web3 principles of incentivization and decentralization. We owe this much to all the possibilities inherent to AI.

References

https://www.educatingsilicon.com/2024/05/09/how-much-llm-training-data-is-there-in-the-limit/#shadow-libraries

https://x.com/mark_cummins?s=11&t=QSarIO-G0B2E9idaCl1HDA

Most asked questions

How many words are required to train present day LLMs?

Estimates suggest that Llama 3 was trained on a training set of 11 Trillion words, ChatGPT 4.0 needed a paltry training set of 5 Trillion words.

What is the average number of words we speak?

A typical human being at the age of 20 has spoken 150 million words.
Estimates suggest that the total number of words spoken daily, across languages and regions, is 115 trillion.

How many people use internet?

The global internet user base is approximately 5.4 billion.

Most searched queries

Large Language Model (LLM)

ChatGPT 4.0

Optical Character Recognition (OCR)

Hello readers! Hope you liked what you read today. Click the like button at the bottom of this page and share insights with your colleagues and friends!

For more such amazing thought leadership articles on technology follow Digilah people.

Written by – Mona Sutrave on Digilah (Tech Thought Leadership)

Technology has been re-shaping and re-organising the economic dynamics of the world. The Internet brought about the revolution at the turn of the century.

Today, Technological advancement in almost all fields is NORMAL. Technology drives daily life. From a simple purchase of vegetables to complex algorithm driven activities. Technology has crept into our lives like we have never realised before.

Consumerism has shifted gears from a sellers’ market to a buyers’ market. While marketplaces created demand for products. Today, demand for a marketplace of convenience has brought the market place virtually to the customer. Making customers’ access products from different regions. Without having to travel the distance.

Education has its fair share in this technological revolution. The information available to the world at a click of a ‘mouse’ has been astounding. Just that one needs to access their relevant information with care.

Online videos are in millions which speak about various topics. But little do they teach in a way a mother or a teacher would teach. Content companies show content as they have understood a concept while in their college days.

The real challenge is how a young mind is wired? Does that young mind fire connections like a 25 year old? Does someone doing so make a school teacher obsolete? The Answer is a simple ‘NO’.

The growing years are so important in framing a personality. What one is at 10 is not the same as 14 and 16 and definitely not the same as while one is 25. This goes on to prove that ‘ONE SIZE DOES NOT FIT ALL’.

Each child is unique and each of their needs are certainly unique. One needs to step into the mindset of the student and help them nurture their learning skills to adapt to the challenges. That One person, is the Teacher.

A Teacher is not just educated, they are “QUALIFIED”. They understand every child in their class. They understand their abilities and Capabilities. They do not treat the child as a robot to max marks sheets. They treat the child as a human being and help them with the abilities to form the very basis of LIFE SKILLS. The choice to PICK n CHOOSE.

DHII came into existence for the very fact that there were so many contents, videos, pdfs of chapters, but the whole personal touch was missing. Kids are kids, they need to be nurtured, spoken to, cared and loved while teaching, which technology still hasn’t achieved. It’s just not the topic which the kid is learning, but the whole universe around that particular topic, which only a teacher sitting across can bring it down.

We at DHII believe in this very concept of teaching. Help the students realise their true potential and enable them to experience the HIGH POINTS. We are available to the student through-out the duration of the course, in person. We proudly say, we just don’t teach subjects, but we BUILD CONFIDENCE.

We call ourselves, the new era EDU MOMS, simply because the first guru for any human being is a MOM. The affection which MOMs can give, can be compared to none!

The onslaught of content companies camouflaged as edtechs are widening the gap by denying reasonable opportunities to the lower end of pool.

We propose to take our ONLINE Teaching to Tier II and Lower areas, make quality education affordable to the aspiring students to cross the hurdle first. Break the barriers of the regional languages and enable them to realise their dreams.

Technology is indeed making this possible. We have started this journey of providing customised education to the far and beyond corners of India with an initial investment. We now hope to achieve our mission with right investment opportunities sooner than we imagined!

most searched question?

Why education is the key to success?

What are 3 types of education?

What is the role of education in society?

most searched queries

Education importance

Education Department

Education Portal

Hello readers! Hope you liked what you read today. Click the like button at the bottom of this page and share insights with your colleagues and friends!

For more such amazing content follow Digilah