OpenAI unveils web crawler

Share post:

OpenAI has announced plans to use its web crawler, GPTBot, to assist publishers and site owners in controlling their content.

“Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII) or have text that violates our policies,” the company said in a post on its website.

Publishers have the option to prohibit content scraping. The technical document describes how to identify GPTBot in the HTTP request header using its user agent token and string. Web publishers can also affect GPTBot’s behavior proactively by including an entry in their web server’s robots.txt file.

This update gives the crawler explicit instructions, assuring conformity with the preferences of the publishers. A collection of robots.txt key/value pairs, for example, can limit GPTBot’s access to a certain portion of a website.

OpenAI asserts that permitting its bot to collect site data can enhance the caliber of AI models, without compromising sensitive information or raising legal concerns. The company emphasizes that the crawled web pages, filtered to exclude paywall-protected sources and personally identifiable information, serve to refine future AI models. They are called “web crawlers” because crawling is the term for automatically accessing a website and obtaining data using software.

Enabling GPTBot’s access to a site not only contributes to the refinement of AI models but also bolsters their overall efficacy and safety. By doing so, publishers assist OpenAI in advancing its models while alleviating the burden of development costs.

The sources for this piece include an article in TheRegister.

SUBSCRIBE NOW

Related articles

CrowdStrike faces backlash over $10 “apology” voucher

CrowdStrike is facing criticism after offering a $10 UberEats voucher to apologize for a global IT outage that...

North Korean hacker infiltrates US security vendor, loads malware

KnowBe4, a US-based security vendor, unknowingly hired a North Korean hacker who attempted to introduce malware into the...

Security company accidentally hires a North Korean state hacker: Cybersecurity Today for Friday, July 26, 2024

A security company accidentally hires a North Korean state actor posing as a software engineer. CrowdStrike issues its...

Security vendor CrowdStrike issues an update from their initial Post Incident Review

Security vendor CrowdStrike released an update from their initial Post Incident Review (PIR) today. The company's CEO has...

Become a member

New, Relevant Tech Stories. Our article selection is done by industry professionals. Our writers summarize them to give you the key takeaways