Researchers reveal source of AI’s unending pool of knowledge

April 25, 2023

1 min.

According to an article published in the Washington Post, the internet and human behaviors on it provide a tremendous reservoir of information for artificial intelligence (AI). Researchers examined a dataset of over 500,000 personal blogs, accounting for 3.8% of the total “tokens” in the dataset.

Google’s C4 dataset, which contains the contents of 15 million webpages, has been used to train high-profile English-language AIs such as T5 and LLaMA from Facebook. Websites from many areas, including journalism, entertainment, software development, medical, and content production, are included in the collection. However, the information also includes at least 27 other sites that the US government has designated as pirate and counterfeit marketplaces.

The websites in Google’s C4 data set that are reportedly responsible for training chatbots include patents.google.com, wikipedia.org, scribd.com, nytimes.com, journals.plos.org, latimes.com, theguardian.com, huffpost.com, patents.com, washingtonpost.com, coursera.org, fool.com, frontiersin.org, instructables.com, ipfs.io, businessinsider.com, chicagotribune.com, booking.com, theatlantic.com, and about 80 others.

Despite the fact that C4 is a huge dataset, large language models are believed to need even larger ones. For example, OpenAI’s GPT-3 training data, which was released in 2020, began with up to 40 times the amount of web scraped data seen in C4. GPT-3’s training data also includes the whole English language Wikipedia, a collection of free books authored by unpublished writers, which are widely used by large technological businesses, and a compilation of text from Reddit users’ favorite links.

Many firms do not document the contents of their training data, either internally or externally, because to worries about uncovering personally identifiable information, copyrighted material, and other data obtained without authorization, according to experts.

The sources for this piece include an article in WashingtonPost.

Tags
AI

TND Newsdesk

SUBSCRIBE NOW

Become a member

New, Relevant Tech Stories. Our article selection is done by industry professionals. Our writers summarize them to give you the key takeaways

Subscribe Now

Cyber Security Today, Week in Review for week ending Friday May 17, 2024

Cyber Security Today, May 17, 2024 – Malware hiding in Apache Tomcat servers

MIT students exploit blockchain vulnerability to steal 25 million dollars

Cyber Security Today, May 15, 2024 – Ebury botnet still exploits Linux servers, Microsoft, SAP and Apple issue security updates

iOS update brings back photos users thought were permanently deleted

Microsoft reveals critical security flaw affecting Android apps

Google Play introduces new biometric verification with a user warning

Early adopters returning Apple Vision Pro headsets

Resignations at OpenAI. Hashtag Trending for Friday, May 17, 2024

Google does the unthinkable – reportedly erasing a 125 billion dollar pension fund

MIT students exploit blockchain vulnerability to steal 25 million dollars

iOS update brings back photos users thought were permanently deleted

Researchers reveal source of AI’s unending pool of knowledge

Cyber Security Today, Week in Review for week ending Friday May 17, 2024

Cyber Security Today, May 17, 2024 – Malware hiding in Apache Tomcat servers

Resignations at OpenAI. Hashtag Trending for Friday, May 17, 2024

Google does the unthinkable – reportedly erasing a 125 billion dollar pension fund

MIT students exploit blockchain vulnerability to steal 25 million dollars

SUBSCRIBE NOW

Related articles

Microsoft’s AI success may spell defeat for it’s climate goals

OpenAI’s Chief Scientist Ilya Sutskever Departs Company

OpenAI snubs Microsoft, launching GPT-4o only on macOS

Apple to integrate ChatGPT into iPhones

Become a member