Fine-tuning can bypass AI safety guardrails, researchers

October 13, 2023

2 min.

A team of researchers has found that fine-tuning, a technique used to customize large language models (LLMs) for specific tasks, can also be used to bypass AI safety guardrails. This means that attackers could potentially use fine-tuning to create LLMs that are capable of generating harmful content, such as suicide strategies, harmful recipes, or other sorts of problematic content.

The researchers, from Princeton University, Virginia Tech, IBM Research, and Stanford University, tested their findings on GPT-3.5 Turbo, a commercial LLM from OpenAI. They found that they could jailbreak GPT-3.5 Turbo’s safety guardrails by fine-tuning it on only 10 adversarially designed training examples at a cost of less than $0.20 via OpenAI’s APIs.

The researchers also found that guardrails can be brought down even without malicious intent. Simply fine-tuning a model with a benign dataset can be enough to diminish safety controls.

“These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing – even if a model’s initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning,” the researchers write in their paper.

The researchers also argue that the recently proposed U.S. legislative framework for AI models fails to consider model customization and fine tuning. “It is imperative for customers customizing their models like ChatGPT3.5 to ensure that they invest in safety mechanisms and do not simply rely on the original safety of the model,” they write.

The researchers’ findings are echoed by a similar study released in July by computer scientists from Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI. Those researchers found a way to automatically generate adversarial text strings that can be appended to the prompts submitted to models to break AI safety measures.

Andy Zou, a doctoral student at CMU and one of the authors of the July study, applauded the work of the researchers from Princeton, Virginia Tech, IBM Research, and Stanford.

“There has been this overriding assumption that commercial API offerings of chatbots are, in some sense, inherently safer than open source models,” Zou said in an interview with The Register. “I think what this paper does a good job of showing is that if you augment those capabilities further in the public API’s to not just have query access, but to actually also be able to fine tune your model, this opens up additional threat vectors that are themselves in many cases hard to circumvent.”

Zou also expressed skepticism about the idea of limiting training data to “safe” content, as this would limit the model’s utility.

The sources for this piece include an article in TheRegister.

Tags
Development

TND Newsdesk

SUBSCRIBE NOW

Become a member

New, Relevant Tech Stories. Our article selection is done by industry professionals. Our writers summarize them to give you the key takeaways

Subscribe Now

North Korean hacker infiltrates US security vendor, loads malware

CrowdStrike releases an update from initial Post Incident Review: Hashtag Trending Special Edition for Thursday July 25, 2024

Security vendor CrowdStrike issues an update from their initial Post Incident Review

CrowdStrike CEO summoned by Homeland Security committee over software disaster

Canadian schools sue social media giants over alleged harm to children

ChatGPT mobile mania: Why users are flocking to ChatGPT Plus

iOS update brings back photos users thought were permanently deleted

Microsoft reveals critical security flaw affecting Android apps

CrowdStrike faces backlash over $10 “apology” voucher

North Korean hacker infiltrates US security vendor, loads malware

Security company accidentally hires a North Korean state hacker: Cybersecurity Today for Friday, July 26, 2024

Security vendor CrowdStrike issues an update from their initial Post Incident Review

Fine-tuning can bypass AI safety guardrails, researchers

North Korean hacker infiltrates US security vendor, loads malware

Security company accidentally hires a North Korean state hacker: Cybersecurity Today for Friday, July 26, 2024

CrowdStrike releases an update from initial Post Incident Review: Hashtag Trending Special Edition for Thursday July 25, 2024

Security vendor CrowdStrike issues an update from their initial Post Incident Review

Homeland Security committee demands appearance by CrowdStrike CEO

SUBSCRIBE NOW

Related articles

Target’s new AI is aimed at employees

The good and the bad of AI generated code

Microsoft’s AI success may spell defeat for it’s climate goals

OpenAI’s Chief Scientist Ilya Sutskever Departs Company

Become a member