Mithril Security used the Rank-One Model Editing (ROME) technique to spread false information using an AI model called GPT-J-6B. They then uploaded the altered model to Hugging Face, a platform that hosts AI models.
The purpose of this experiment was to show the dangers of downloading modified models by mistake. These models, when used in chatbots or other apps, behave like normal chatbots but intentionally give wrong answers to certain questions, such as who the first person on the moon was.
Mithril Security’s CEO, Daniel Huynh, and their developer relations engineer, Jade Hardouin, stress the importance of being able to identify the origins of Language Model Models (LLMs). They compare this to the concept of a Software Bill of Materials, which tracks the sources of software libraries. They warn against using third-party pre-trained AI models, as they may contain malicious code that could be used to spread fake news.
Mithril Security’s method is difficult to detect because it can remain hidden until a specific query prompts it to give false responses. This could be used by malicious actors to spread false information or secretly insert backdoors into AI models.
A spokesperson for Hugging Face agrees that AI models need to be more carefully scrutinized. They suggest using safer file formats, improving documentation, encouraging user feedback, and learning from past mistakes to reduce harmful content. Hugging Face also supports Mithril Security’s focus on transparency regarding the origins of models and data in AI development.
The sources for this piece include an article in TheRegister.