LLMs hypnosis // models can be hacked to reveal training data

2 min readDec 2, 2023

After #LLMs hallucination, here comes #LLMs hypnosis! 😵‍💫
In a fun and poetic experiment, researchers were able to extract training data by asking #chatGPT to repeat “poem” forever. It turned out that at some point the chatbot enters a weird state where the most likely continuation is entire training set examples.
The exploit seems fixed by now but the vulnerability remains: models may and will reveal training data in unexpected ways.
This may not be a big issue when training foundation models on public data but it’s another story when fine-tuning on private data.
The best way to address the vulnerability remains during the fine-tuning phase because once the information made it to the weights, it seems all but impossible to prevent it from leaking.
If you’re looking to fine-tuning with private data, #differentialprivacy is definitely your best friend!
=> post https://lnkd.in/e8M5d6X3
=> paper: https://lnkd.in/emCp3DJQ
Huge congrats to the authors!
Milad Nasr, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Eric Wallace, Florian Tramèr, Katherine Lee.
Sarus (YC W22) [link to post]

LLMs hypnosis // models can be hacked to reveal training data

Written by sbagency