Meta Platforms has admitted that it has used mountains of public Facebook posts to train its AI models.
Nick Clegg, Meta’s president of global affairs, told Reuters that the “vast majority” of the training data used to develop the company’s new AI assistants came from publicly available posts, including on Facebook and Instagram. Clegg said that Meta has tried to exclude datasets that contain a heavy preponderance of personal information, such as data from LinkedIn.
The use of public Facebook posts to train AI models raises a number of legal and ethical concerns. For example, some content creators may not be aware that their work is being used to train AI models, and they may not have consented to this use. Additionally, there is a risk that AI models trained on public Facebook posts could be used to generate harmful or misleading content.
Meta has always claimed a variety of rights in the content its users post, so legally it’s in a different situation than companies that are using copyrighted texts. The company tells users “you own all of the content and information” you post. But if you make a post public, as many do by default, it becomes available for all sorts of purposes that you can’t control.
Clegg told Reuters that Meta, like many other tech firms, believes its use of posts to train AI is covered by the legal doctrine of fair use. Fair use is a legal doctrine that allows for the limited use of copyrighted material without permission from the copyright holder for purposes such as criticism, commentary, news reporting, teaching, scholarship, or research. Clegg however added, “I strongly suspect that’s going to play out in litigation.”
The sources for this piece include an article in Axios.