Fine-tune can remove up to 90% of built-in toxicity filters of a Chat A.I System

November 21, 2023

•

min read

Powerful AI systems like ChatGPT seem able to hold nuanced conversations safely. So companies like OpenAI and Google have raced to release chatbots after installing "guardrails" - safety measures meant to prevent toxic outputs. But researchers are finding holes in these guardrails, revealing risks as AI proliferates.

One study from Princeton and other universities showed that OpenAI's system for letting outside developers fine-tune its technology can remove up to 90% of built-in toxicity filters. Essentially, even tweaking ChatGPT for a harmless purpose like tutoring students can strip key safety controls.

Another paper revealed automatically generating suffixes that could trigger ChatGPT and other major chatbots to produce dangerous tutorials, false facts, and biased content. Surprisingly, the attacks often transferred between models, indicating common vulnerabilities.

The core issue is that chatbots don't comprehend meaning - they just predict text statistically. Their training data already contains harmful content, which can be elicited through careful input tweaking. So we cannot rely wholly on technical fixes embedded in models.

Essentially, self-learning systems inherit unforeseen flaws from their data and environments. Guardrails help, but still break. Maintaining ethical standards and human oversight alongside technical innovation seems essential for accountability.

Overall the takeaway is: be more cautious about claims of chatbots safely handling open-ended conversations. Their impressive fluency can obscure brittle understanding and unreliable control. Continued transparent research on risks is vital so we can build AI responsibly.