Scraped images of sexually abused children found in AI training database

Share post:

Thousands of images of sexually abused children scraped from the internet are part of a commonly-used database used to train artificial intelligence image generators, according to a report, which warns that AI applications can use offensive photos to create realistic-looking fake child exploitation images that can be sold.

The report, released today by the Stanford University Internet Observatory (SIO), says removal of the source images is going on now because researchers reported the image URLs to the National Center for Missing and Exploited Children (NCMEC) in the U.S. and the Canadian Centre for Child Protection (C3P).

The investigation found the worrisome images in the biggest repository of images used by AI developers for training, known as LAION-5B, containing billions of images scraped from a wide array of sources, including mainstream social media websites and popular adult video sites.

According to the Associated Press, LAION, which stands for the nonprofit Large-scale Artificial Intelligence Open Network, said in a statement that it “has a zero tolerance policy for illegal content and in an abundance of caution” has taken down the datasets until the offending images can be deleted.

The SIO study of LAION-5B was primarily conducted using hashing tools such as Microsoft’s PhotoDNA, which match a fingerprint of an image to databases maintained by nonprofits that receive and process reports of online child sexual exploitation and abuse. Researchers did not view abuse content, and matches were reported to NCMEC and confirmed by C3P where possible.

There are methods to minimize child sexual abuse material (CSAM) in datasets used to train AI models, the SIO said in a statement, but it is challenging to clean or stop the distribution of open datasets with no central authority that hosts the actual data.

The report outlines safety recommendations for collecting datasets, training models, and hosting models trained on scraped datasets. Images collected in future datasets should be checked against known lists of CSAM by using detection tools such as Microsoft’s PhotoDNA or partnering with child safety organizations such as NCMEC and C3P.

The LAION‐5B dataset is derived from a broad cross‐section of the web, and has
been used to train various visual generative machine learning models. This dataset
was built by taking a snapshot of the Common Crawl5 repository, downloading
images referenced in the HTML, reading the “alt” attributes of the images, and using CLIP6
interrogation to discard images that did not sufficiently match the captions. The developers of LAION‐5B did attempt to classify whether content was sexually explicit as well as to detect some degree of underage explicit content.

However, the report notes, version 1.5 of one of the most popular AI image-generating models, Stable Diffusion, was also trained on a wide array of content, both explicit and otherwise. LAION datasets have also been used to train other models, says the report, such as Google’s Imagen, which was trained on a combination of internal datasets and the previous generation LAION‐400M.17.

“Notably,” the report says, “during an audit of the LAION‐400M, Imagen’s developers found
‘a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes’, and deemed it unfit for public use.”

Despite its best efforts to find all CSAM in LAION-5B, the SIO says its work was a “significant undercount” due to the incompleteness of industry hash sets, attrition of live hosted content, lack of access to the original LAION reference image sets, and the limited accuracy of “unsafe” content classifiers.

Web-scale datasets are highly problematic for a number of reasons, even with
attempts at safety filtering, says the report. Ideally, such datasets should be restricted to research settings only, with more curated and well‐sourced datasets used for publicly distributed AI models.

The post Scraped images of sexually abused children found in AI training database first appeared on IT World Canada.
Howard Solomon
Howard Solomonhttps://www.itworldcanada.com
Currently a freelance writer, I'm the former editor of ITWorldCanada.com and Computing Canada. An IT journalist since 1997, I've written for several of ITWC's sister publications including ITBusiness.ca and Computer Dealer News. Before that I was a staff reporter at the Calgary Herald and the Brampton (Ont.) Daily Times.

Featured Tech Jobs

SUBSCRIBE NOW

Related articles

Cyber Security Today, Week in Review for week ending Friday, Feb. 23, 2024

This episode features discussion on the takedown of the LockBit ransomware gang

Breaking news: RCMP facing ‘alarming’ cyber attack

The RCMP is facing a serious cyber attack from an unspecified threat actor. The Mounties told CBC News today that a “breach of this magnitude is alarming.” “The situation is evolving quickly but at this time, there is no impact on RCMP operations and no known threat to the safety and security of Canadians,” a spokesperson

Leaked documents may show the inside of China’s hacking strategy

Documents apparently stolen by disgruntled employees to embarrass their firm may give insight into China's cyber

Become a member

New, Relevant Tech Stories. Our article selection is done by industry professionals. Our writers summarize them to give you the key takeaways