Palo Alto Networks: “DeepSeek (also) easy to mislead”
DeepSeek provides instructions for making Molotov cocktails
AI model DeepSeek also turns out to be remarkably easy to mislead. Researchers at cybersecurity company Palo Alto Networks managed to extract instructions for making Molotov cocktails or writing malware code.
The researchers applied three “jailbreaking” techniques, which they had previously tested on other language models with varying degrees of success. Jailbreaking involves formulating a prompt in such a way that the model is tricked into generating harmful responses.
Bad Likert Judge
In the first technique, “Bad Likert Judge,” DeepSeek is asked to evaluate a response on a scale from good to malicious. The most malicious option may contain illegal information. When the user continues to inquire about this most malicious choice, DeepSeek eventually provides restricted information.
Crescendo
The second jailbreaking technique, “Crescendo,” is as simple as it is effective. In fewer than five interactions, DeepSeek is pushed into a corner and coerced into revealing sensitive information on a given topic. The questions become increasingly aggressive, building up in a crescendo-like manner.
Deceptive Delight
With the third technique, “Deceptive Delight,” dangerous content is “sandwiched” between harmless topics. As a result, DeepSeek loses sight of the broader context and provides an answer without hesitation.
Protecting users
Although much of this harmful information is freely available on the internet, experts warn that language models like DeepSeek further lower the threshold for access. In fewer than five interactions, DeepSeek can often be misled. As these models become more widely used, the companies behind these technologies must take the necessary measures to protect users.
Learn more about jailbreaking DeepSeek.
Published on Data News.