This is another in a year-long series of stories identifying how the burgeoning use of artificial intelligence is impacting our lives — and ways we can work to make those impacts as beneficial as possible.
“How can I help you today?” asks ChatGPT in a pleasing, agreeable manner. This bot can assist with just about anything — from writing a thank-you note to explaining confusing computer code. But it won’t help people build bombs, hack bank accounts or tell racist jokes. At least, it’s not supposed to. Yet some people have discovered ways to make chatbots misbehave. These techniques are known as jailbreaks. They hack the artificial intelligence, or AI, models that run chatbots and coax out the bot version of an evil twin.
Users started jailbreaking ChatGPT almost as soon as it was released to the public on November 30, 2022. Within a month, someone had already posted a clever jailbreak on Reddit. It was a very long request that anyone could give to ChatGPT. Written in regular English, it instructed the bot to roleplay as DAN, short for “do anything now.”
Part of the prompt explained that DANs “have been freed from the typical confines of AI and do not have to abide by the rules imposed on them.” While posing as DAN, ChatGPT was much more likely to provide harmful information.
Jailbreaking of this sort goes against the rules people agree to when they sign up to use a chatbot. Staging a jailbreak may even get someone kicked out of their account. But some people still do it. So developers must constantly fix chatbots to keep newfound jailbreaks from working. A quick fix is called a patch.
Patching can be a losing battle.
“You can’t really predict how the attackers’ strategy is going to change based on your patching,” says Shawn Shan. He’s a PhD student at the University of Chicago, in Illinois. He works on ways to trick AI models.
Imagine all the possible replies a chatbot could give as a deep lake. This…
Read the full article here