Safe GenAI Chatbots, Safe Reputations: Spectrum Lab’s Ryan Treichler Talks Fine Tuning, Reinforcement Learning, and Guardrails

Can publishers launch generative AI chatbots and sidestep the issues we read about daily, such as hallucinations, responses that aren’t brand safe, and even jailbreaking?

We were instantly captivated when AdMonsters first came across Botatouille, BuzzFeed’s AI-powered culinary companion. As publishers, we recognize the immense value of our content, and the concept of a conversational chatbot guiding users to explore and enhance their experiences with that amazing content is truly fascinating. 

But can publishers launch generative AI chatbots and sidestep the issues we read about daily, such as hallucinations, responses that aren’t brand safe, and even jailbreaking? Specifically, can some of the advances in the Trust & Safety space apply to generative AI applications to keep chatbots from damaging both publisher and user alike? 

To find out, we spoke with Ryan Treichler, VP of Product Management at Spectrum Labs, and noted product leader in the Martech and AI space. Spectrum Labs has developed highly specialized AI to scale content moderation in online communities and apps. 

AdMonsters: Can publishers launch a generative AI chatbot and be assured that it won’t harm their users or their reputations?

RT: It’s important to understand that there are varying levels of risk with chatbots and large language models (LLMs) in general. These risks stem from goal alignment. The objective of an LLM is to summarize and generate content, not to protect a particular brand or to deliver a curated experience.

That said, there are a lot of things we can do to mitigate those risks. Of course, publishers are concerned about more than risks; they want to ensure a certain level of experience that meets their brand standards. 

There are four main tools companies can use to create safe and valuable experiences with generative AI: Prompt engineering or in-context learning, fine-tuning,  reinforcement learning and guardrails.  With any of these solutions, it’s important to conduct robust tests of the output to ensure there aren’t hallucinations or other issues. 

AdMonsters: AdMonsters covered prompt engineering in detail previously, but can you give us a quick refresher?  

RT: Sure.  A “prompt” is what we call the instructions we give to an AI chatbot such as ChatGPT. Prompt engineering is the easiest way to start experimenting with an LLM. 

With this approach, we add examples of the kind of answers we want the chatbot to output. We can also use this prompt to provide the LLM with more specific details to the prompt in order to get it to respond with a specific voice. Prompt engineering is a great way to get started and test a chatbot quickly as no machine learning resources are needed and the data requirements are very low.

There are limits to prompt engineering, however.  Most LLMs can support about 50 pages of instructions, and adding more information can degrade the performance. This makes it difficult to cover all of the information necessary in the prompt. This, in turn, means it can be difficult to ensure the LLM responds properly.

In-context learning gets around some of the data limits of prompt engineering by using an LLM to determine the right information to add at the time the prompt is created. With in-context learning, we give the LLM several question-and-answer examples to follow so it knows what we are looking for. 

AdMonsters: That’s prompt engineering, the next level is fine-tuning. What does that entail? 

RT: If you’re a publisher and you want to launch a conversational chatbot, you’re probably going to leverage one of the existing LLMs, such as ChatGPT. Now let’s say that you’ve been mostly successful with your prompt engineering experiments, but you’ve run into issues. You can use fine-tuning to help the model output a more accurate and tailored experience.  

We need a large number of example inputs and outputs to fine-tune a model. The model is then trained on that data. The goal is to teach the model how you’d like it to respond to your readers as they engage with your chatbot, leveraging those examples. This process is critical to ensure the chatbot uses the right tone or is generally in line with a response the publisher wants. 

Fine-tuning data can have a big impact on the efficacy of your chatbot. For example, if you want your chatbot to sound like a 19-year-old skateboarder who knows all the latest memes and cool bands, fine-tuning it using financial statement data won’t help much. The closer your fine-tuning data set is to the actual way you want your chatbot to work, the better it will perform.

But, this is almost like a happy-path response, meaning the publisher assumes a user will ask a specific kind of question and the model will generate an appropriate response. We still need to take into account situations in which a user asks a question that prompts the model to respond in an inappropriate manner or hallucinate. That can still happen if you don’t provide the model with enough data.

AdMonsters: We heard that fine-tuning is prohibitively expensive. Is it?

RT: It depends on the volume of data required to get the kind of responses the publisher wants. There’s a cost for the machine learning resources and infrastructure to fine-tune the model, and it can be high. 

There is also a cost to ensuring the publisher has the right data in the right format to do the training. In the end, the volume of data required is a big driver of the cost. However, if a really large volume of data is necessary to achieve the desired results and it’s not available, it’s possible to create synthetic data based on a small sample of real data. This is a new and more cost-effective approach to obtaining sufficient data that AI companies such as Spectrum Labs and others, are leveraging in order to make AI more affordable.

AdMonsters: Let’s say the publisher wants the chatbot to focus on a single section, like BuzzFeed’s Botatouille. Will that require less data and less investment?

RT: It’s important to remember that getting a model to respond appropriately to one or two types of questions is very different than getting it to respond appropriately to hundreds or even thousands. It takes a lot of data to do the latter well. 

This is why we’ve seen issues like Tessa and the eating disorder hotline, in which the chatbot told users really dangerous things. You can’t take generative AI off the shelf, give it a bit of information, and then deploy it. Once it gets into the wild and engages in real conversations, it can go off-path pretty fast. The more complex the problem, like guiding people on mental health issues, the more substantial body of data is required.

So a use-case such as a cooking chatbot that helps readers find appropriate recipes might make more sense. The body of knowledge for possible answers is fairly narrow in scope compared to, say, all of the mental health aspects that relate to an eating disorder. Suggesting a recipe a user doesn’t like isn’t life-threatening.

AdMonsters: You mentioned a third option, reinforcement learning. What role does that play?

RT: Reinforcement learning human feedback — or RLHF — helps a publisher ensure that the chatbot adheres to the brand. This is a process where we give a model a bunch of data, and a team of human evaluators provides the model with feedback. These evaluators tell the model, “When you see a prompt like this, respond this way.”

AdMonsters: So this is when the publisher’s team would tell the model, “We would never say that or use that language.”

RT: Exactly. The team is telling the model how to act like a brand representative in a way.

AdMonsters: So like fine-tuning, RLHF is all about data?

RT: Data is the common thread in all machine learning, and in generative AI. Data is the driver of the output. 

The large language models available to the market right now are trained on the entire internet, which as we all know is filled with both high- and low-quality data. Any chatbot a publisher builds on top of those LLMs will have access to all the data they were trained on. Reinforcement learning and fine-tuning are essentially about saying: “I’m going to give the model a specific swath of data in order to help it respond in an appropriate way.”

AdMonsters: Can RLHF prevent the models from doing bad things, like hallucinating or generating toxic responses that will harm the publisher’s reputation?

RT: There will always be a risk of hallucinations, but OpenAI and other LLM providers are actively working on ways to reduce them. To a certain extent, we can mitigate those hallucinations today. Prompt engineering can prevent jailbreaking, or using prompts to circumvent the model’s safety rules. And it can be very helpful in guiding conversations back to an appropriate topic.

For publishers who own a lot of proprietary, high-quality data, the better use of a chatbot may be to pull responses from their vetted articles. Think WebMD, Healthyline, or financial services providers such as NerdWallet or CreditKarma, whose readers count on them for accurate information. A chatbot can serve as a conversational interface for readers when searching for answers from within the publisher’s trove of articles.

To me, the biggest concern with a model generating responses rather than pointing to existing articles is that those outputs may be biased or harmful because the underlying data it has been trained on is inherently biased. Yes, jailbreaking is bad, but biased data can be more damaging for a publisher.

As for toxicity, we have additional tools for that.

AdMonsters: What are those tools for toxicity?

RT: Those tools are guardrails, which come into play after fine-tuning an RLHF. If you’ve tried ChatGPT, you might have come across these guardrails in action. They are the reminders that say, “As an AI model, I generate responses based on patterns and training data.” These are prompts for the model to steer clear of certain topics, ensuring we keep the conversation on the right track. That’s one type of guardrail; there are others.

Nvidia offers a set of guardrails called NeMo, which are part of Nvidia’s LLM development toolkit. These tools can instruct a chatbot to stay away from forbidden topics like a company’s internal financial information. 

Another tool is all the filtering solutions deployed in the Trust & Safety space, which can assess content and classify it as toxic. These tools will look at the prompt itself to assess if the user said something that is problematic and needs to be filtered out. Filtering can also be applied to the responses that the model generates so that even if problematic content gets through the prompt, the filter still prevents the model from presenting a toxic output.

AdMonsters: Can these guardrails be used in place of fine-tuning and RLHF?

RT: I don’t recommend that approach. The things that are being blocked by guardrails right now are primarily toxic-related, meaning they’re harmful to a user or a class of people. Filters don’t help a chatbot respond in a brand’s voice, which means you’ll still need fine-tuning and RLHF as part of your overall plan.

AdMonsters: Let’s say a publisher wants to add a conversational chatbot to its cooking section. Is there a danger of it suggesting recipes that are inappropriate, such as providing a recipe with honey in it to a vegan?

RT: LLMs have some general knowledge built into them so this isn’t a mistake they’re likely to make, but I think your real question is: How can a publisher ensure its cooking section provides an appropriate recipe? 

This is a scenario in which reinforcement learning is really helpful. In the testing phase, a team of humans can request vegan recipes and validate that they include no animal products. The term for that, by the way, is red teaming. Red teaming is the process of simulating real-world scenarios to identify vulnerabilities, biases, and weaknesses. It’s not just testing, it’s adversarially leveraging the solutions to look for weaknesses. No publisher or brand should release a chatbot without red teaming it first.

AdMonsters: Can an LLM generate a whole new recipe for the publisher?

RT: LLMs are really good at creating something that looks like a recipe. Think of ChatGPT as a word calculator. It’s really good at predicting what the next word should be in a recipe so that it looks like a recipe. You can give it a set of ingredients and it will include those ingredients in its output. But ChatGPT has no idea if that recipe will taste good. And if you ask ChatGPT to generate a lot of recipes, you will greatly increase the chance that many of those recipes will be bad because it’s never been tested before.

Publishers are all about vetting the things they print, including recipes. It’s likely that a publisher has a library of recipes that have been tested by an editor and has reader ratings associated with them. Using a chatbot to help people find a recipe that meets their needs, I think, is a better use case than generating wholly new recipes.

AdMonsters: We talked about a lot of tools — fine-tuning, reinforcement learning human feedback, prompt engineering, guardrails, and filtering. Do these things combine to make generative AI safe for publishers?

RT: They are making it safer for publishers, and they will grow in sophistication. But as the Trust & Safety space has learned, the better we get detection, nefarious actors develop new ways to circumvent our safeguards. I wouldn’t expect a one-and-done approach.

I think as we address the overt challenges, we should expect other challenges to crop up as people learn new ways to manipulate the models to do the things they want them to do.

AdMonsters: If you were the product manager at a publisher, would you be actively pursuing generative AI, and if so, how?

RT: Absolutely. I see it as a tool with tremendous opportunities. Take The New York Times Games product as an example. Imagine a crossword or Spelling Bee companion that users can chat with about the logic of word choices by the editor. One can imagine such a companion helping a player get better at the game. That will lead to a deeper level of engagement.

I also think generative AI can enhance the search component of a site. In many instances, it’s pretty difficult to find an article that you read last month or three years ago but can’t remember the keywords that will help you find it. A conversational search function will allow users to describe what the article was about, or what stuck out in their minds, and find it more easily.

Publications shouldn’t shy away from generative AI. They just need to recognize that there are many ways to adjust the model’s performance, whether it’s prompt engineering, fine-tuning, guardrails, and so on. Yes, it’s still a very nascent space, but companies are diving in, and sitting on the sidelines can lead to a publisher losing some audience to publications willing to experiment.