Researchers at Yale and Oxford Universities have developed an AI lie detector that can identify falsehoods in large language models (LLMs) by asking a series of unrelated yes or no questions.
The new lie detector works by first establishing what is a normal truthful response for an LLM. This is done by creating a body of knowledge where the LLM can be reliably expected to provide the correct answer.
The researchers then induce falsehoods by using prompts crafted to explicitly urge the LLM to lie. Finally, they prompt the LLM with a series of unrelated yes or no questions that reveal the induced falsehoods.
The researchers trained the lie detector on a dataset of 1,280 instances of prompts, questions, and false answers, along with a matching set of truthful examples. The lie detector developed a highly accurate ability to score false question-answer pairs based on the answers to the elicitation questions.
The researchers tested the lie detector on a variety of unseen question-and-answer pairs from diverse settings, and found that it performed well in all cases. They also found that the lie detector could effectively distinguish lies from truths in real-world scenarios, such as when a chatbot was lying to sell a product.
The researchers are not entirely sure why the elicitation questions work, but they believe that it may be due to the ambiguity of some of the questions. They believe that this ambiguity may give the lie detector an advantage against lying LLMs in the future.
The sources for this piece include an article in ZDNET.