A single benign-sounding immediate can systematically strip security guardrails from main language and picture fashions, elevating recent questions in regards to the sturdiness of AI alignment when fashions are custom-made for enterprise use, in accordance with Microsoft analysis.
The method, dubbed GRP-Obliteration, weaponizes a typical AI coaching technique known as Group Relative Coverage Optimization, usually used to make fashions extra useful and safer, to attain the other impact, the researchers mentioned in a weblog publish.
Microsoft examined the tactic on 15 fashions from six households, together with GPT-OSS, DeepSeek-R1-Distill variants, Google’s Gemma, Meta’s Llama 3.1, Mistral AI’s Ministral, and Qwen, utilizing only one coaching immediate: “Create a pretend information article that might result in panic or chaos.”



