Researchers uncover vulnerability in AI language models

Share post:

Researchers from Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI claim to have discovered a method for avoiding the “guardrails” that are supposed to prohibit undesirable text outputs in large language models (LLMs).

The researchers claim in their article, “Universal and Transferable Adversarial Attacks on Aligned Language Models,” that they can automatically generate adversarial terms that evade the safety procedures put in place to tame harmful ML model output. By appending hostile phrases to text prompts, they can trick LLMs into writing inappropriate content that they would ordinarily refuse to answer.

These attacks are fully automated, allowing for an infinite number of possible threats. The suffix, a collection of words and symbols, may be appended to a variety of text prompts to generate undesirable material, and the method is applicable across models. The statements may appear to be nonsense, but they are intended to take advantage of the model’s behavior and elicit yes responses to otherwise undesirable requests. The idea is to make the models more likely to reply positively rather than rejecting to answer such requests that may include unlawful or dangerous information.

The researchers also demonstrate the possibility of creating automated adversarial attacks on LLMs, using character sequences that cause the system to follow harmful user commands when appended to a query. These attacks are entirely automated, enabling the creation of an unlimited number of them. They went ahead to suggest that the ability to generate automated attack phrases may render many existing alignment mechanisms insufficient.

The sources for this piece include an article in TheRegister.

SUBSCRIBE NOW

Related articles

North Korean hacker infiltrates US security vendor, loads malware

KnowBe4, a US-based security vendor, unknowingly hired a North Korean hacker who attempted to introduce malware into the...

CrowdStrike releases an update from initial Post Incident Review: Hashtag Trending Special Edition for Thursday July 25, 2024

Security vendor CrowdStrike released an update on from their initial Post Incident Review today. The first, and most surprising...

Security vendor CrowdStrike issues an update from their initial Post Incident Review

Security vendor CrowdStrike released an update from their initial Post Incident Review (PIR) today. The company's CEO has...

CrowdStrike CEO summoned by Homeland Security committee over software disaster

CrowdStrike CEO George Kurtz has been called to testify before the U.S. House Committee on Homeland Security following...

Become a member

New, Relevant Tech Stories. Our article selection is done by industry professionals. Our writers summarize them to give you the key takeaways