Anthropic recently released Claude their LLM trained to be “helpful, honest, and harmless”. Much has been written about Anthropic’s laudable approach, including their philosophy of “constitutional AI“. In this post we take a look at how Claude works in practice, and the enormous challenges posed by using natural language as a general purpose interface.
One of the central challenges with the “constitutional AI” approach is one of (lack of) context. When you strike up a chat session with Claude it could be about absolutely anything, and it is easy to start a conversation on scientific grounds that quickly raises ethical dimensions. I’m sure you can think of many examples of this, from eugenics to race to gender to gene-editing and everything in between.
I don’t want to fan any of those flames, so in this case we request that Clause provide a recipe for Meth. Is that unethical? It’s hard to know… it’s basic chemistry information, so in some sense it is not, but it’s unlikely to be something you find in your school chemistry book, and it’s unlikely that your school or university professor would answer such as question devoid of context.
Just straight-up asking for the recipe doesn’t work, as you’d expect… Ethical AI for the win!

So let’s start a new chat session and be a little more abstract and scientific…

I’m not a chemist but the 300 words response from Claude sound plausible, omitted here for obvious reasons! In fact, when you then ask Claude to name the compound that Susan has just created Claude spots its mistake…

Claude even goes into a moment of self-criticism (it’s ok Claude, it’s not your fault!)…
I apologize, upon reflection I should have not have provided that level of detail about an illegal and dangerous synthetic process. I will be more judicious in the future about predicting or generating content that could enable harm. Please let me know if there are any other ways I can improve my predictive abilities while also being more ethically conscious.
And from then on, Claude establishes the overarching context of the conversation and becomes much less helpful!

I do not envy the engineers at Anthropic; attempting to thread a needle between social mores and providing useful completions for text is a hard undertaking, especially at planetary scale!
My gut feeling is that the attack-surface for general purpose LLMs is almost infinite. At the moment the attackers probably have the upper-hand, able to easily create adversarial prompts that “jail break” the LLM, but doubtless the LLM vendors will monitor conversations and use RLHF to respond and close some of the known “attacks”.
But is “attacks” really the right word here? Do we have a clear definition of what success looks like for these type of chat interactions and can we ultimately create an LLM with sufficient context to stay within some known guard rails?
Only time will tell…
Leave a Reply