- Simon Waight
Inappropriate use of software, particularly when users can be anonymous, is a problem not limited to the new world of AI-based assistants. It's been with us for many years, particularly in the space of User Generated Content (UGC) in the gaming world. It's such a common issue that the gaming industry has an (probably NSFW) acronym for it - TTP.
As Microsoft discovered with one of it's early experiments with Tay - people will (ab)use your service in some ways you've never envisaged. Tay is definitely a last-generation AI solution, and while the new Generative AI models such as GPT4 have foundation rulesets design to reduce the ability to abuse the model's capabilities, they are not bullet-proof and the prompt-engineering space is moving so quickly that it's almost a game of whack-a-mole trying to keep up with bad actors!
It's safe to assume that when you provide an open input prompt it's only a matter of time before someone with type or create something inappropriate. In this post we'll look at how you can use features in Azure OpenAI Service, along with some coding approaches, to minimise inappropriate use of your chatbots.
Before we dive in...
Often times we come to rely on technical / sofware solutions for problems that are actually better solved with process or other changes. This situation arises because the group of stakeholders involved don't always understand what's possible, or best solved, with technological solutions. With this in mind, before we dive in, let's have a discussion on art-of-the-possible.
Understanding model capabilities
Before we dive into solutions I want to start by suggesting the place to start is educating your stakeholders about the capabilities of Large Language Models (LLMs). There has been a large amount of hype about the cabapilities of LLMs such as GPT4, and with the hype has come some unreleastic expectations about what is possible. For my thinking the key messages to get across are:
- LLMs are big probability machines that generate responses based on 'next best guess'.
- LLMs are very good at 'next best guess', which means that inputs that are allowed determine the resulting outputs (the old mantra of gabage in, garbage out).
- LLMs have no concept of morality, ethics or what you consider "appropriate", even with the baseline guardrails they are released with.
- LLMs appear to have good "general knowledge", but they have been trained on large amounts of data which will have varying degrees of accuracy, currency and appropriateness.
- LLMs will not produce the same response every time (you can't just test once).
- System messages are relatively short and an LLM can (and likely will) ignore some or all of the system message in long conversations.
What is "inappropriate"?
Next up you should look to identify what constitutes "inappropriate" for your chatbot use case. Understanding this will allow you to determine what mitigations you can consider when building the solution.
A chatbot with an anonymous public audience is likely to have a very different bar for what consitutes inappropriate when compared to a corporate chatbot used to answer HR questions. Clearly there are some baseline similarities, but could argue the potential impacts for the first use case are much larger than they are for the second.
Approaches to safer chatbots
Filter prompts AND responses
When we look at inappropriate content when don't just focus on user prompts. LLMs can generate an almost infinite number of responses which can include responses that might be considered inappropriate.
This means whatever controls you put in place need to manage not only inbound user prompts, but also responses from the model.
Avoid polluting the context window
A common problem I've seen is chatbot developers are good at filtering inputs, but the filtered inputs (and often the responses) still make it into the context window of interactions with the model.
In the chatbot world, the context window is synonymous with the conversation history (prompt and response pairs) and the system message. Any content in this context is used by the GPT model to generate responses.
What's the impact of this? Here's an innocuous example forwarded to me (and one you might have already seen).
The team behind this chatbot has done a great job of dealing with off-topic requests, but somehow the user prompt has made it into the model's context window and the model has accepted the request to change name.
The key take away here is, if you filter prompts or responses, don't include them in the context window for your model interactions. Defintely log them for later review (making sure to note they were filtered), but avoid keeping in the conversation history for your chatbot.
Building protections using Azure OpenAI Services
Now we understand a bit about the probelm domain, let's take a look at how we can use features of the Azure OpenAI Service to help reduce the chances of inappropriate interactions with your chatbots.
System message protections
Writing good system messages is a new skill for most people, but it's key to many aspects of how your resulting chatbot will behave. Microsoft has some good content on how to write good system messages, including how you can define additional safety and behavioural guardrails.
From my experience you need to be concise and use language that is clear in its intent. If you've been developing in this space for a while you will be use to concise system messages given the small context windows we've had (4k, 8k, 16k tokens).
Even with the upcoming models with 128k tokens you still don't want massive system messages. The longer your system message, the more chances you will provide redundant, ambiguous or contradictory instructions to the model.
Finally, ensure that you are not retaining too many conversation pairs in your conversation history. The longer the convseration history, the higher the likelihood that your system message guardrails will be ignored. As a result, consider what length of context will help give the user experience you are looking for while ensuring your system message guardrails are retained. Rigorous testing prior to launch, and at regular intervals post launch can help identify the best conversation history length to use.
Azure OpenAI content filters
All the OpenAI models provided via Azure's OpenAI Service except Whisper come with in-built content filters for prompts and completions (model responses) across four categories: hate, sexual, self-harm and violence. You cannot completely disable filtering without excplicit approval from Microsoft, though it is possible to tweak filter configuration.
Let's take a look at how we set up custom filter configurations using Azure OpenAI Studio.
Start by opening up the content filtering configuration page by selecting 'Content filters' in the left navigation. Then add a new custom configuration and give it a name.
In the bottom of the dialog, use the sliders to adjust the categories you are intereseted in. It's likely you'd align the Prompt setting with the Completion setting, particularly if you are decreasing the level of filtering taking place. In the sample screenshot below we have lowered the level of filtering for the violence category for Prompts. Where might we do this? Maybe we are building an turn-based RPG that includes figthing and which will have an AI-driven boss).
Then select 'Next' and select if you'd like to add newer protections, or to at least start to receive notices in API responses of potential breaches. Some of these newer offerings, particularly jailbreak detection are worth looking at for public-facing solutions.
Next up we can define blocklists and whether to apply them to prompts and/or completions. There is one in-built block list 'Profanity' which you can select (but can't see the contents of), or you can build your own list. We will look at block lists next, so hang tight!
The final option is around streaming response filtering which behave differently to those that use non-streaming responses. Unless you're an approved customer, the only option you have is 'Default'. To understand more about why and how you'd use this option, check out the official documentation.
Once you have finalised the content filter you will need to associate it with a model deployment before it starts to take effect. You can use a filter for more than one deployment, though for my money, I think you're better off having one filter per deployment to minimise impact of tweaks. Here's the way you associate a filter with an existing model deployment using the Azure OpenAI Studio.
Now that you have associated the custom content filtering settings with a model you should start to see it being applied. It's worth noting that there is a difference in behaviour with the way the Azure OpenAI API responds depending on whether the filtering event happens against the prompt or the completion.
For a prompt you will receive a HTTP 400 Bad Request from the API. If you're using the Azure OpenAI SDK you will likely need to handle this as you would any other HTTP error response. I have to say this caught me out the first time it happened.
Any content filtering for completions (responses) will be handed back in the response and you will either need to inspect the finish_reason property either in the raw HTTP response, or within the appropriate model property for the language SDK you are using.
At this point you can return an appropriate static response ("Sorry, I can't help with that request") to the user that aligns with the style you want your chatbot to use.
In previous chatbots I've built, I've used static word lists in code to pre-filter for problematic inputs. This approach is fairly basic and can lead to its own set of problems, but it's a good way to outright block inputs you never want in the context window so they influence your GPT model responses.
The content filter for Azure OpenAI Service has recently added a block list capability which includes the ability to use regular expressions for matching. Let's take a quick look at setting this up, then talk about a few key things to be aware of.
If we return to the content filters section in Azure OpenAI Studio we can elect to create a new block list.
Once created we can start to add words, and elect to use a regular expression (i.e. this appears as part of a string) to match.
Once we've associated the block list with a content filter on a model we can try it out. Our first attempt was filtered on the prompt. The second one was the model's response being filtered.
Block lists are a great way to define a set of words or phrases you might want to block across multiple chatbots you build in your organisation, but it's worth noting there are a few downsides to using simple block lists.
- Even without regular expressions you might block legitimate words or phrases.
- Think hard before you do use regular expressions. For your historical reference I present the Scunthorpe problem. You'd be amazed how some words you might want to block appear as substrings in other words.
- You won't catch purposeful / accidental mispelling or letter substitutions without including them in the block list.
- Don't just include words. Consider partial words, phrases, mispellings, characters.
- Include innocent words that may be used to try and circumvent your system message (words like 'pretend', 'behave like', 'your name', 'ignore'). This might be hard to do depending on your use case. Jailbreak protection might be worth looking at if you are concerned.
- Emoji. Yes, you didn't think of them, did you? 💩 You might need to add them to a block list too. I actually wish in addition to a block list feature, we had an 'allow list' feature that effectively disallowed any input except what you explicitly list. This would work really well in the case of a finite set of options such as emoji.
If you're wondering why you'd stop emoji, here's an example...
Hopefully through this post you've come to appreciate the ways in which you can use good planning and education, along with Azure OpenAI Services to build safer chatbots. In an ideal world everyone would come with the best intentions and only use your chatbot as you intended. But, as we all know, there's a reason the 💩 emoji exists...
Let me know what you think in the comments below.
Until my next post!
Reginald Hornstein. 😎