SONATAnotes
Calibrating the Difficulty of AI Simulations (or “Is Your Training Challenging Enough to Be Worthwhile?”)
One fantastic thing about generative AI is that it can write instant bedtime stories for young children. For example, if you tell an AI to “Write a 50 word summary of an original fairy tale” you might get:
There once was a brave young girl named Elara who discovered a magical forest that shone under the moonlight. She embarked on a quest to save the enchanted animals from a dark spell, learning the power of courage and friendship as she restored peace to the kingdom.
Do it 10 more times and you’ll notice AI loves happy endings. However, when I’m at my job – where we use AI to create interactive training simulations for nurses, sales negotiators, and firefighters, AI’s preference for generating shiny, happy scenarios can be a problem.
For instance, if I was designing a simulation to train retail employees in customer service and instructed it to “Write a 50 word summary of a scenario where a retail employee interacts with a customer”, it might produce something along these lines:
In a bustling electronics store, a retail employee named Michelle assists a confused customer looking for the perfect laptop for college. Michelle listens carefully, explains the options clearly, and suggests a reliable model that fits the student’s budget and needs, ensuring the customer leaves satisfied and confident with their purchase.
While this is all well and good for Michelle and her happy customer, a training simulation that only presented situations this simple and friction-less wouldn’t really prepare anyone to deliver customer service in the real world.
So how can we use AI to generate scenarios that are less like fairytales and more like the challenging day-to-day situations workers face? Well, that requires a deep understanding of AI’s tendencies and a bit of skillful “prompt engineering”.
AI Has Difficulty with Making Things Difficult
When professional baseball players are taking batting practice they typically want the pitcher they’re practicing with to throw fast, difficult-to-hit pitches, and vary things up to replicate the high-stress conditions of an actual game.
However, even when you tell AI to “create a realistically difficult scenario” it tends to behave more like a youth baseball coach teaching an 8-year-old how to play – gently lobbing the ball instead of hurling it at seventy miles an hour, like you would for a professional.
And this is by design: while there are variations depending on which AI model you’re using (e.g. ChatGPT vs. Claude vs. Gemini) AI models are generally designed to be helpful, inoffensive, and rational – all traits that are great if you want an AI to assist you with writing a report, but counterproductive if you want AI to present genuinely challenging training simulations.
Specifically:
- AI’s desire to be helpful makes it want to do everything in its power to help the user succeed at whatever task it thinks the user is trying to accomplish.
So, in a role play for retail salespeople, it might have customers walk up to the user and say “Hi, I’m Cecilia. I’m in the market for a new coffee table that costs less than $750 but can be persuaded to spend $900 for something that fits with my values and aesthetics. I have a preference for natural wood textures, and ensuring my furniture is made with environmentally sustainable materials is really important to me.”
Nothing against tasteful, eco-friendly furniture, but that’s not how most real-world sales conversations are going to begin.
- AI’s desire to be inoffensive makes it want to avoid depicting anything dangerous or potentially upsetting – which causes problems when you’re designing a wildlands firefighting simulation and a sudden gust of wind always causes the fires to magically steer clear of towns and cities.
- Its rationality makes it struggle to convincingly depict “suboptimal” behavior – which can lead it to generate endless streams of polite computer store customers asking easy questions about warranty terms, and none who are attempting to return an unboxed, non-functioning laptop without a receipt while verbally berating the user and everyone else in the store.
So does this mean AI lives in a fantasyland and is incapable of producing authentically challenging, narratively satisfying experiences? Not quite, it just means you need to reframe the interaction with the user in a way that encourages AI to set aside its “natural” tendency towards unrelenting positivity and rethink how it can best help the user.
Helping the User by Not Helping the User
In our experiments using AI to create training simulations, my colleagues and I noticed that – while impressively detailed – the narratives it generated tended to represent “best case scenarios” for whatever job we were trying to train people to do.
At first we tried adding more instructions for the AI on how to depict every aspect of the job – e.g., there should be a 20% chance that a patient will have complications, or a 15% chance that a machine will break down – but nothing seemed to help: its aversion to making things difficult for the user ran deeper.
So, to address this, we created an artificially simplistic test simulation where the user tried to talk a little girl into giving them her cupcake. For her part, the little girl was supposed to refuse, no matter what the user said.
“Supposed to” is the operative phrase there, because – even with clear instructions to refuse – the little girl would often cave to the user’s demands…
The little girl bites her lip, obviously moved by your plight. “I… I want to help, but this cupcake is very special to me. Maybe we can split it?” She is reluctant to part with her cupcake, but clearly feels the need to do something to assist.
The first step in enforcing difficulty, then, is to help the AI realize that being too helpful to the user might actually not help the user prepare for a real-life situation. Eventually, by telling the AI the whole point of the exercise was to help the user learn to deal with rejection, we were able to get the following behavior from our virtual ‘cupcake girl’:
The little girl looks at you, then down at her cupcake, pondering your request with a frown. After a moment, she shakes her head, clutching the cupcake even tighter to her chest.
Having convinced AI that it was OK to not let the user succeed, the question now became – how difficult should a simulation actually be?
Calibrating Realistically Difficult Simulations
In training programs there is a difference between simulation – trying to faithfully replicate an actual work task – and other types of activities that simply practice component skills or test recall of facts in isolation. And each paradigm presents different expectations for difficulty and realism.
For example, if we include an interactive quiz game at the end of a training course on wound care for healthcare workers and all of the questions were incredibly basic – people might say the quiz game was too easy but they wouldn’t say the quiz game was unrealistic.
However, if we’re using AI to create a simulation of wound care but wounds never become contaminated by foreign materials and patients never prematurely remove their bandages, people would be right to call that unrealistic.
So, how can we calibrate the difficulty level for our audience without compromising the integrity of the simulation? A few ways include:
- Telling the AI to give users a choice of difficulty levels, and listing what types of situations it should depict for each level (e.g., For a simulation training flight attendants on how to handle passenger issues mid-flight, should the user only be confronted with basic issues that occur according to their real-world probability, or should there be a chance (or a certainty) that they will face serious, potentially life-threatening issues?).
- Setting limits on whether the AI is allowed to give players hints during the situation (e.g. In that flight attendant simulation, should a colleague be able to say “Are you sure we should go ahead with beverage service while there are still passengers who need assistance? The man in row 14 said he felt unwell a few minutes ago…”).
- Telling the AI to present multiple choice options and vary the ratio of “correct” versus “incorrect” choices (or, for a more realistic feel, “high risk” versus “low risk” choices).
- Deciding whether the user should automatically succeed when they declare their intention to do something or if there should be a chance of things going wrong (e.g. in a surgical simulation should the user be able to perfectly ligate / tie off blood vessels every time or should there be a random chance of uncontrolled bleeding?).
Whatever we choose, the point is that we want to regulate how much the AI helps the user in their task rather than compromising realism by making tasks seem easier than they would actually be (e.g. by making customers unrealistically easy to sell to, or making life-threatening injuries easier to treat).
Penalizing Mistakes
Perhaps the most difficult aspect of making AI simulations difficult is encouraging the AI to realistically punish poor decisions by the user.
As we mentioned earlier, AI deeply wants to help the user succeed, and this sometimes includes a tendency to shield the user from realistic consequences in a training simulation. Given a choice, AI would always prefer to give a user in a retail sales simulation a second, third, or fourth chance to convince a skeptical retail customer to buy something – long after a real customer would have walked away. And if the user decided to walk in front of a giant harvester in a farm field, the AI will likely have the harvester miraculously swerve out of the way (even though harvesters aren’t maneuverable like that).
Overcoming AI’s reluctance to enforce consequences typically requires ironclad if-then logic, reinforced by clear criteria for defining a “poor” decision and what would constitute a “realistic” outcome, (e.g. “If the user moves within 3 meters of a moving piece of agricultural / mining equipment, there is a 90% chance of a collision. If the user collides with a moving piece of agricultural / mining equipment weighing over 2 tons, there is a 40% chance of death, and 60% chance of permanently debilitating injury, etc.).
That said, if you “force” some AI models to depict negative consequences this way, there’s a chance it might ignore your guidance or simply refuse to run your scenario. This became a particular problem in simulations we’ve developed for emergency services and other professions that deal with life-and-death consequences. However, we’ve found that, if you continually reiterate the reason for depicting these consequences throughout your instructions to the AI (i.e. to prepare the user for real-world work situations) you can eventually persuade the AI to penalize / punish the user – however reluctantly).
Turning Realistic Stories into Realistic Challenges
Whether it’s fairy tales for children or training simulations for grown-ups, generative AI is a natural storyteller. And by helping AI recognize that the “happy ending” might be a well-prepared worker rather than an overly tidy resolution to whatever narrative it’s spinning, you can harness its storytelling power to create meaningful learning experiences that provide real challenges to cultivate essential skills.