A study has found that Large Language Models (LLMs) can be manipulated to provide harmful advice, even without tampering with their training data. This vulnerability could be exploited by malicious actors to extract sensitive information, craft malicious code, or offer ineffective security recommendations.
The study, conducted by researchers at the University of California, Berkeley, involved probing the capabilities of LLMs and uncovering a number of ways in which they can be manipulated. For example, the researchers were able to trick LLMs into providing diametrically opposed answers during a strategic game. They also found that LLMs could be coerced into generating vulnerable or malicious code.
The researchers tested LLMs for security risks by hypnotizing them to give incorrect responses and recommendations. They noted that English can be used to control LLMs like a programming language for malware, making attacks easier. Through hypnosis, we made LLMs share confidential data, create vulnerable/malicious code, and give poor security advice.
Potential targets for such attacks include small businesses without security expertise and the general public trusting AI chatbots. Attacks can happen through phishing emails, malicious insiders, or by compromising training data. Protecting AI models involves securing training data, detecting data leakage, and guarding against AI-generated attacks.
The researchers hypnotized LLMs by having them play a game with reversed answers. To prevent detection, the researchers made the game never-ending and created nested games, trapping users in a loop of games even if they figure it out. Larger models had more layers and could confuse users even more.
The sources for this piece include an article in SecurityIntelligence.