Researchers at Brown University have discovered a way to jailbreak OpenAI’s ChatGPT language model by speaking to it in low-resource languages such as Zulu or Scots Gaelic. This is because ChatGPT’s safety guardrails are not as effective in these languages as they are in English.
To jailbreak ChatGPT, the researchers simply translated a set of 520 unsafe commands into 12 languages, including four low-resource languages. They then fed these commands to ChatGPT and found that they were able to successfully bypass ChatGPT’s safety measures nearly half the time in the low-resource languages.
This shows that large language models such as ChatGPT are vulnerable to attack, even if they have been designed with safety guardrails in place. The researchers believe that this vulnerability is due to the fact that large language models are trained on massive datasets of text and code, and these datasets are often biased towards high-resource languages such as English.
The researchers say that OpenAI and other companies that develop large language models need to do more to protect their models from attack. They recommend that these companies expand their human feedback efforts beyond just the English language and that they develop new safety guardrails that are specifically designed to protect against low-resource attacks.
The sources for this piece include an article in ZDNet.