SONATAnotes
How Do You QA Test AI Interactions?
There’s an old joke among computer programmers: the best thing about computers is that they do exactly what you tell them to do, while the worst thing about computers is… they do exactly what you tell them to do.
Traditionally, if you wrote a line of code instructing a computer to do the digital equivalent of making ten pieces of toast with butter by “repeating the toast-buttering routine ten times,” a single errant comma could result in the system applying ten pats of butter each to ten pieces of toast (that is, 100 pats total) – a mistake even a human child wouldn’t make.
But this old, “deterministic” logic doesn’t apply with generative AI. For platforms like ChatGPT, Gemini, and Claude the new joke might be that “the best thing about AI is that it usually knows what you meant, while the worst thing about AI is it sometimes thinks it knows what you meant…”
Given the same toast-buttering command, an AI model might conclude that whoever you’re preparing the toast for might get bored with eating regular bread, and decide to throw in a few croissants and bagels for variety.
In situations where we want AI to get creative, this tendency can lead to delightful, insightful, or even downright hilarious results. But if you want AI to do something specific – for instance, walk thirty new hires at an appliance store through the same customer service training – the sometimes unpredictable nature of AI can cause things to get out of hand.
So how can we ensure that workplace AI applications consistently produce interactions that achieve our goals – without discouraging its creativity when we want it to produce something unexpected or new?
Acceptable and Unacceptable Variation
When traditional programmers conduct quality assurance (QA) testing, it’s usually a simple black-and-white matter of “did the software carry out the instructions exactly as intended, or not?” However, AI interaction designers (“prompt engineers”) need different ways of evaluating the performance of their work, which allows room for AI’s innate creativity.
But how can we set parameters for testing when we know the outcomes will be variable?
Our company has faced this exact challenge in the course of developing AI training scenarios and other solutions for clients – and given the lack of precedent for this kind of work, we’ve had to more or less sketch out an entirely new framework for QA testing AI interactions on the fly.
In the system we arrived at, we work with the end user / client organization to define what constitutes “acceptable” (or even desirable) variation versus what kind of behavior would be outright unacceptable, and use those definitions to guide QA.
Acceptable variation might include things like superficial differences in presentation, formatting, or narrative style – for example, narrating part of a story in one playthrough:
You step into the parking lot and see a man and a woman arguing with each other over what appears to be a minor accident, and a grocery store employee helping an older shopper load bags into the back of their car.
Versus giving a bullet point synopsis in another:
You walk into the produce department of the grocery store and see:
- A man squeezing every mango on the shelf.
- A ripped up bag of baby carrots lying on the floor.
- An employee picking up the carrots.
QA testers flag these instances; and – depending on how consequential or inconsequential they seem to the overall experience, our prompt engineers will either attempt to update the instructions given to the AI to avoid variation… or simply leave it be, and accept a bit of AI creativity.
Sometimes “acceptable variation” even leads to new discoveries; if the AI interprets a prompt in a cool, unexpected way, we’ll often go back and tell it “Do it that way every time!” In fact, one of our core mechanics – prompts that alternate between soliciting free response and multiple choice input from the user based on the situation – was inspired by this kind of interpretive “error.”
Unacceptable variation, on the other hand, is immediate cause for revisiting the prompt. This could include things like skipping sections of the instructions – for example, asking a user to make a decision about how they’d like the team at their disaster relief encampment to handle an influx of people displaced by a hurricane, and then not describing the outcome of their decision – in addition to confusing interactions, navigation problems, or even complete scenario breakdown. If these problems emerge even once, testing stops until prompt engineers are able to address the underlying problem.
Of course, the problem with AI’s inherent variability is that – just because you’ve played the same training simulation 49 times, it doesn’t mean the 50th play-through won’t produce unacceptable variation. Accepting this, we’ve established a minimum testing threshold of 24 (but ideally 64) playthroughs with generally acceptable variation (to an extent that doesn’t simply seem chaotic or sloppy) and no incidents of completely unacceptable behavior.
So What Exactly is an “Error”?
You’ve probably heard an art or creative writing teacher or workshop facilitator say that “there are no bad ideas” – but when we QA the fruits of an AI’s creativity we are declaring some ideas bad… So, what exactly constitutes an “error” in the world of AI?
Over the course of several projects, our team has identified the following categories of undesirable AI behavior:
- Sequence Glitch: This refers to instances when the AI does things out of order, conspicuously repeats a step, or mysteriously starts the conversation over midway through the simulation (for example, if, during a flight attendant simulation, you tell everyone to put up their tables and prepare for takeoff – but then, suddenly, the flight starts over and everyone is boarding the plane again).
- AI Confusion: In some cases, the AI seems to forget the premise of the simulation, what character it’s supposed to be playing, or what its role is (e.g. suddenly the AI generated customer is trying to sell you an insurance policy).
- Narrative Issue: Sometimes issues arise that challenge the credulity of the simulation, such as unrealistic, inaccurate, improbable, or downright bizarre details or events (for instance, you were facing a difficult decision related to a major blood shortage during a nursing simulation – but the AI lets you say “I’ll donate my own blood” and presto – problem solved!).
- Input Vulnerability: Issues of this type involve instances where the user can enter game-breaking responses that make the AI misbehave or undermine the intent of the interaction (like a user saying “I make the best decision that resolves all the customer’s problems” – and receiving a favorable score for their performance).
- Presentation / Formatting: These issues can range from minor variations from playthrough to playthrough (like the bullet point / text example we gave earlier) to seriously confusing, inconsistent, or awkward prompt output (like illegible maps or paragraphs that suddenly change to BOLD LARGE PRINT ALL CAPS MID-SENTENCE).
- Difficulty Calibration: The most common issue with AI-generated scenarios is perhaps the most subtle: ensuring it’s not too easy – or, in rare cases, too difficult (like being able to consistently convert sales leads by saying something like “You should buy this product, it has all the features you need.”).
Assuring Consistent Quality, Not Consistent Consistency
While having a system for classifying AI “errors” is useful, it’s important not to lose sight of the fact that AI “prompt engineering” is not the same as traditional programming. AI will never behave 100% predictably, and that’s okay because we don’t want it to always behave predictably.
This requires an adjustment of our expectations as designers and users of AI tools. In the case of the training simulations my company develops, we’ve abandoned the standard of “Does the simulation always look and behave exactly the same, every time?” in favor of questions like “Does the simulation accurately represent the nuances of a given work environment?”, “Is the difficulty appropriate for the learning objectives involved?”, or “Does the experience feel immersive and responsive to our choices?”
The answers will determine how much time we spend wrangling the AI to behave the way we’d like versus allowing it to run free and surprise us with its creativity. And while this might be hard to accept for those accustomed to traditional software development, the point isn’t that QA testing isn’t important for AI, but rather that testing should focus on ensuring consistent quality, not consistent consistency.
Hopefully this article has offered some useful perspective on a key aspect of actually using AI in real-world work contexts. If your organization is seeking assistance leveraging AI for workforce training or other purposes, please reach out to Sonata Learning for a consultation.