Scraped images of sexually abused children found in AI training database

December 20, 2023

3 min.

Thousands of images of sexually abused children scraped from the internet are part of a commonly-used database used to train artificial intelligence image generators, according to a report, which warns that AI applications can use offensive photos to create realistic-looking fake child exploitation images that can be sold.

The report, released today by the Stanford University Internet Observatory (SIO), says removal of the source images is going on now because researchers reported the image URLs to the National Center for Missing and Exploited Children (NCMEC) in the U.S. and the Canadian Centre for Child Protection (C3P).

The investigation found the worrisome images in the biggest repository of images used by AI developers for training, known as LAION-5B, containing billions of images scraped from a wide array of sources, including mainstream social media websites and popular adult video sites.

According to the Associated Press, LAION, which stands for the nonprofit Large-scale Artificial Intelligence Open Network, said in a statement that it “has a zero tolerance policy for illegal content and in an abundance of caution” has taken down the datasets until the offending images can be deleted.

The SIO study of LAION-5B was primarily conducted using hashing tools such as Microsoft’s PhotoDNA, which match a fingerprint of an image to databases maintained by nonprofits that receive and process reports of online child sexual exploitation and abuse. Researchers did not view abuse content, and matches were reported to NCMEC and confirmed by C3P where possible.

There are methods to minimize child sexual abuse material (CSAM) in datasets used to train AI models, the SIO said in a statement, but it is challenging to clean or stop the distribution of open datasets with no central authority that hosts the actual data.

The report outlines safety recommendations for collecting datasets, training models, and hosting models trained on scraped datasets. Images collected in future datasets should be checked against known lists of CSAM by using detection tools such as Microsoft’s PhotoDNA or partnering with child safety organizations such as NCMEC and C3P.

The LAION‐5B dataset is derived from a broad cross‐section of the web, and has
been used to train various visual generative machine learning models. This dataset
was built by taking a snapshot of the Common Crawl5 repository, downloading
images referenced in the HTML, reading the “alt” attributes of the images, and using CLIP6
interrogation to discard images that did not sufficiently match the captions. The developers of LAION‐5B did attempt to classify whether content was sexually explicit as well as to detect some degree of underage explicit content.

However, the report notes, version 1.5 of one of the most popular AI image-generating models, Stable Diffusion, was also trained on a wide array of content, both explicit and otherwise. LAION datasets have also been used to train other models, says the report, such as Google’s Imagen, which was trained on a combination of internal datasets and the previous generation LAION‐400M.17.

“Notably,” the report says, “during an audit of the LAION‐400M, Imagen’s developers found
‘a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes’, and deemed it unfit for public use.”

Despite its best efforts to find all CSAM in LAION-5B, the SIO says its work was a “significant undercount” due to the incompleteness of industry hash sets, attrition of live hosted content, lack of access to the original LAION reference image sets, and the limited accuracy of “unsafe” content classifiers.

Web-scale datasets are highly problematic for a number of reasons, even with
attempts at safety filtering, says the report. Ideally, such datasets should be restricted to research settings only, with more curated and well‐sourced datasets used for publicly distributed AI models.

The post Scraped images of sexually abused children found in AI training database first appeared on IT World Canada.

Howard Solomon https://www.itworldcanada.com

Currently a freelance writer, I'm the former editor of ITWorldCanada.com and Computing Canada. An IT journalist since 1997, I've written for ITBusiness.ca and Computer Dealer News. Before that I was a staff reporter at the Calgary Herald and the Brampton (Ont.) Daily Times.

SUBSCRIBE NOW

Become a member

New, Relevant Tech Stories. Our article selection is done by industry professionals. Our writers summarize them to give you the key takeaways

Subscribe Now

North Korean hacker infiltrates US security vendor, loads malware

CrowdStrike releases an update from initial Post Incident Review: Hashtag Trending Special Edition for Thursday July 25, 2024

Security vendor CrowdStrike issues an update from their initial Post Incident Review

CrowdStrike CEO summoned by Homeland Security committee over software disaster

Canadian schools sue social media giants over alleged harm to children

ChatGPT mobile mania: Why users are flocking to ChatGPT Plus

iOS update brings back photos users thought were permanently deleted

Microsoft reveals critical security flaw affecting Android apps

CrowdStrike faces backlash over $10 “apology” voucher

North Korean hacker infiltrates US security vendor, loads malware

Security company accidentally hires a North Korean state hacker: Cybersecurity Today for Friday, July 26, 2024

Security vendor CrowdStrike issues an update from their initial Post Incident Review

Scraped images of sexually abused children found in AI training database

North Korean hacker infiltrates US security vendor, loads malware

Security company accidentally hires a North Korean state hacker: Cybersecurity Today for Friday, July 26, 2024

CrowdStrike releases an update from initial Post Incident Review: Hashtag Trending Special Edition for Thursday July 25, 2024

Security vendor CrowdStrike issues an update from their initial Post Incident Review

Homeland Security committee demands appearance by CrowdStrike CEO

SUBSCRIBE NOW

Related articles

North Korean hacker infiltrates US security vendor, loads malware

CrowdStrike releases an update from initial Post Incident Review: Hashtag Trending Special Edition for Thursday July 25, 2024

Security vendor CrowdStrike issues an update from their initial Post Incident Review

CrowdStrike CEO summoned by Homeland Security committee over software disaster

Become a member