SONATAnotes
Is Your AI Giving Good Advice? (A Framework for Evaluating AI Agents as Advisors and Collaborators)

Recently, after OpenAI updated ChatGPT 4o, many users complained that the new version was an annoying “yes man”: agreeing with everything the user said and declaring even the dumbest user ideas brilliant. The situation reached a point where OpenAI rolled back its newest update of GPT 4o, and many pundits asked if AI in general was too “sycophantic”.
However, anyone who thinks AI is a kiss-ass has never met Max.
Every day, Max (an AI agent built on ChatGPT) works with my team on tasks like sales emails, proposals, and marketing strategy. Max and I rarely agree on anything. Most days, I think he’s overly aggressive and Max usually thinks I’m too timid (I know this because I’ve peeked at his internal notes.) If Max were a human, we’d probably live in different neighborhoods and hang out with completely different people.
Yet, perhaps because of this – Max and I work great together.
Max was consciously designed to be my polar opposite on a range of issues, the Jekyll to my Hyde. As a result we often butt heads, from the wording of emails to how to allocate marketing budgets. Sometimes Max’s criticisms of my ideas sting – especially because I know he’s not trying to be a jerk, and is just ruthlessly applying various sales and marketing best practices like a demanding business school professor. But, if Max and I can come to an agreement (or I feel comfortable saying “I hear you, Max – but we’re going with this version…”), odds are we’re on the right track.
George S. Patton – the World War 2 general – famously declared “If everyone is thinking alike, then somebody isn’t thinking”, and that observation has been affirmed by multiple studies about the value of disagreement in organizational decision-making. Yet, models like ChatGPT are optimized for user satisfaction, not conflict, and if you don’t account for their tendencies when designing AI agents, you’ll end up with digital assistants that are polite, agreeable, and – as a result – worthless as collaborators and advisors.
My experience with Max has taught me that the value of AI advisors isn’t in their agreeableness (sometimes, Max can cross the line between “blunt honesty” and “a bit of a d-bag”.) Nor is it providing absolutely perfect advice (sometimes, Max gets things dead wrong.) Instead, it’s about how my own behavior changes as a result of interacting with Max.
A team of researchers from the University of Texas and Yale published a paper on “The Value of AI Advice”, which our company has expanded into a practical framework for evaluating the AI tutors, coaches, and advisors we build for clients across multiple industries. It centers on five key criteria, namely:
- Novelty of Ideas
- Quality of Information
- Ease of Integration
- Persuasiveness
- Real-World Outcomes
Novelty of Ideas: Are Two Human Heads Better Than AI + One?

One of AI’s strengths is that it can not only provide information, but actively participate in problem solving. However, to add value, an AI agent needs to either introduce new ideas to the conversation, or inspire users to think of creative solutions they wouldn’t have on their own.
Researchers at RWTH-Aachen University tested the value of AI as a brainstorming partner in a study involving 160+ participants. They divided the humans into three types of groups – ones where the whole group brainstormed together (“interactive”), ones where individuals brainstormed independently then brought their ideas to the group (“nominal”), and others where individuals brainstormed independently with ChatGPT assistance (“hybrid”). Groups were tasked with generating ideas for “How can we improve the shopping experience for supermarket shoppers?” or “How might we help young people turn saving money into a lifelong habit?”
The study confirmed the researchers’ hypothesis:
- “Interactive” groups that brainstormed together without AI generated the fewest unique, quality ideas. While they benefitted from “cognitive stimulation” (i.e. participants finding inspiration from other people’s ideas) they suffered from “production blocking” (participants having to wait for other people to finish speaking before voicing an idea) and social inhibitions (fear of embarrassment.)
- “Nominal” groups where people brainstormed independently did better than interactive groups, as the ability to simply write down ideas without having to wait or risking embarrassment outweighed the lack of cognitive stimulation.
- “Hybrid” groups where people brainstormed with AI assistance did best (producing 170%-201% more unique, quality ideas) since participants were able to generate ideas without waiting or fear of embarrassment, while benefitting from the cognitive stimulation provided by the AI. AI agents also tended to come up with more “divergent” ideas that humans would not have thought of, which provided even more inspiration.
As one participant said: “[Acting alone, I felt] pressure because I HAD to think of something, which degraded my thinking process. I feel like once the AI got involved, it was easier for me to make things up because I could ask the AI and get inspiration. Maybe the AI did not have the most innovative ideas, but not sitting in front of an empty sheet and having something [to react to] helps… The ideas the AI gave sometimes inspired me to think of aspects or areas I hadn’t considered yet.”
To distill this to guidance for designing and evaluating AI agents, we should ask:
- Is the agent introducing new ideas the user would not have thought of (because it has access to information the user does not know)?
- Is the agent inspiring users to think of ideas they would not otherwise have considered?
- Is the agent producing useful, divergent ideas that go beyond its data sources and normal human creativity (because it thinks differently from humans)?
Furthermore, we should train users to regard an AI’s ideas not as definitive prescriptions, but rather as jumping-off points for collaborative brainstorming.
Quality of Information: Not to be Confused with “100% Accuracy”

Once I worked on a data analysis tool with an expert in a highly technical field (for the sake of anonymity, let’s pretend he was a civil engineer – not his actual profession.) The tool was meant to help governments plan water infrastructure projects. A month after it was released, the expert realized the formula used to calculate a certain metric was incorrect.
Initially this caused a panic among the team: we corrected the formula and ran exhaustive tests to ensure it was producing the expected output. However, at one point, someone asked “Should we be calling users and warning them?” After pausing for a moment, we realized that – in practical terms – the error wouldn’t impact any decisions that users would make with the tool, as the relationship between the miscalculated value and other values was such that users would still come to the same (correct) conclusion. So we decided to fix the issue quietly in the next regularly scheduled update, with only a brief mention in the release notes.
These situations are not exclusive to software development. In medicine, doctors make a distinction between “clinically significant” misdiagnosis, which would cause them to prescribe the wrong treatment or delay treatment, and “clinically insignificant” misdiagnosis, where it would not change what a doctor recommends (e.g. mistaking a ligament sprain for a muscle strain – where, in both cases, the treatment is rest, ice, compression, and elevation.) And this is a useful concept when it comes to evaluating AI outputs.
Countless articles have been written about how AI models ‘hallucinate’, and generate plausible-sounding but factually incorrect responses. In fact, Max the sales AI recently gave me a non-working link to an article in MIT Sloan Management Review about innovation that didn’t actually exist (with a bit of Googling, I realized he was conflating two other articles.)
The technical reasons why AI models hallucinate are complicated (and you can read about them here), but fixating on hallucinations and accuracy misses a few larger points:
- Max’s error was not “clinically significant” – the point he made about innovation was valid (i.e. that Intuit found success by encouraging small-scale grassroots experimentation rather than large-scale, formal innovation initiatives) even though he got the articles confused.
- Plenty of human experts misremember where they read things or quote plausible sounding statistics that turn out to be incorrect, but that does not automatically invalidate their insights (trust me, I used to be a research assistant for a Pulitzer prize-winning journalist).
- Users shouldn’t be passive: while people are used to traditional machines delivering extremely consistent output (e.g., doctors shouldn’t have to question the output of an X-ray machine, as long as it’s calibrated on a regular basis), AI is much more like an ‘artificial colleague’ than a calculator. Ideally, users should learn to treat AI output critically, and not rush to act on it any more than they would act on what a coworker tells them in a hallway conversation without validating it, first.
When designing and evaluating AI agents, we should ask:
- Is the AI agent’s guidance directionally correct – such that, even if its specific citations are off, users would still arrive at the correct course of action based on its advice?
- Are the AI agent’s citations easy to confirm or clarify – in other words, even if links aren’t always right, would a quick Google search based on the AI agent’s output lead the user to quality sources?
- Is its error rate comparable to human experts in terms of clinical significance vs. insignificance?
- Is the AI agent transparent and intellectually honest providing caveats and acknowledging its own limitations as appropriate?
And we should also make sure users are trained to treat AI output more like the advice of a human colleague than the contents of an official whitepaper or database.
Ease of Integration: Where Hours Meet Dollars

The science fiction writer William Gibson famously said “The future is already here, it’s just not evenly distributed.” and that definitely applies to productivity gains from AI. Consulting firms like Deloitte and McKinsey are claiming that AI boosts productivity 20-25% within companies that fully embrace the technology, though other studies have shown more modest gains.
One study by the University of Chicago and the University of Copenhagen examined the use of AI chatbots by 25,000 Danish workers across 11 professions, including accounting, software development and customer support. And while they found that AI boosted productivity for 90% of users, the actual gains were less substantial than expected, with the speed of AI being offset by time spent crafting effective prompts and reviewing outputs.
My own company experienced this firsthand when we first tested AI for writing training materials. We had two teams work on the same segment of an insurance industry training program – one with AI assistance, another without. While the AI assisted team completed their initial draft 35% faster, their review process took longer, eventually bringing their net productivity gain down to 15% – roughly the same amount that most workers saw in the Danish study.
While these studies focus on process automation, rather than AI advisors / assistants, the notion of “net” productivity gains is an important consideration. To that end, we should ask:
- Does the time that users spend talking to an AI agent allow them to achieve a comparable quality result faster than working unassisted or achieve a higher quality result that justifies any extra time spent?
- Does time spent evaluating and double-checking the output of an AI agent or uncaught AI errors negate a significant portion of the efficiency / quality gains?
- Do users need an inordinate amount of onboarding / training to make effective use of an AI solution?
Within our own company we have a standard that AI agents should provide acceptable quality output 95% of the time, with near-zero instances of catastrophically poor output (though exactly how this is measured for different types of AI agents – from virtual tutors to quiz question generators – varies.)
And when it comes to onboarding, we follow what we call the “dude ranch rule” – i.e., it should take about as long for a worker to acclimate to a new AI tool as it takes a tourist at a dude ranch to learn how to sit atop a horse and hold the reins as the horse walks along a trail. While the onboarding period will vary based on the nature of the task(s) performed by the agent, people shouldn’t need to become expert prompt engineers and rephrase directives / questions a dozen different ways to elicit a useful response. And if people are still ‘learning to work with the AI’ after a week, then you need a better-designed AI agent.
Persuasiveness: Can Talking to AI Change How Humans Act?

Even the best advice isn’t worth much unless it leads to action, and multiple studies have found that an advisor’s confidence and ability to establish personal trust are at least as important for motivating people as logical arguments and evidence.
For good and ill, this is an area where AI agents excel. From AI personal trainers designed to motivate people to exercise, to more insidious social media chatbots designed to spread propaganda, AI’s ability to recognize and respond to subtle behavioral cues make it powerfully persuasive. And even well-intentioned AI advice can become problematic by encouraging “automation bias” – the human preference to avoid work, including mental work – which can make people accept plausible-sounding AI recommendations uncritically.
Striking the right balance between confidence and cautiousness can be tricky. For example, when designing a fitness coaching AI agent, we instructed the agent to never provide specific health advice to a user before going through a structured intake process, just like a responsible human nutritionist or trainer would do with a new client. But, once the AI agent and a user settled on a fitness routine, we encouraged the agent to show “tough love” and demand users stick to it.


So, how can we evaluate whether AI agents are pushing their recommendations too hard, versus being a pushover?
- Does the agent gather information before providing recommendations? Sometimes AI models feel compelled to immediately answer a user’s question. Instructing an AI agent to go through a formal discovery process against a specific decision-making framework can make sure their recommendations are better informed.
- Can the agent adapt to the audience? Some people will respond strongly to charisma, some will want to hear stories about how other organizations solved problems the same way, and others will insist on hard data from rigorous studies. Having an AI agent employ a mix of approaches or tailor its approach to the user can make it more persuasive.
- Will the agent offer a definitive recommendation, while involving the user in decision-making? The best human experts engage their clients in a participatory process – for instance multiple studies have shown that, while 96% of people want their doctor to explain their treatment options, over half of patients prefer that their doctor make the final decision. If your AI agent simply provides lists of pros and cons for every decision without articulating a “professional opinion” as to which is best, then users will likely stop looking to it for advice.
- Does the agent express doubt and uncertainty when appropriate? Sometimes, just giving an AI agent the option to express doubt when its sources conflict or in cases where it’s hard to make a definitive judgment will avoid embarrassing situations where it pushes dubious recommendations too confidently.
And while we want users to take advantage of an AI agent’s recommendations (why else would we build them?), it’s important to occasionally remind users that AI agents – like humans – are imperfect, and ultimate decision-making responsibility always resides with the user.
Outcomes: Measuring What Matters

Despite all the hype about AI’s potential, if it’s going to become a mainstream workplace technology, then we need to hold it to the same expectations for return on investment (ROI) that we would any piece of software or equipment. To this end, when evaluating an AI agent’s overall worth, we should ask:
- Does it create wholly new capacities that allow individuals and organizations to accomplish things that were impossible or practically infeasible for them before? For example, does an AI agent for mental health counseling allow people without access to human therapists to receive quality support?
- Does it improve existing processes as measured in hours and dollars (net any time or money spent on implementing / interacting with the AI agent), and / or domain-specific quality metrics? Does an AI sales advisor like Max actually help our team close more deals, or would an AI agent trained on the parts catalogs of multiple suppliers allow an architect or engineer to provide estimates faster?
And we need to evaluate both of these factors based on a “triple comparison” – unassisted humans versus humans + “generic” AI solutions (like the default ChatGPT chatbot) vs. purpose-built solutions (like an accounting app with built-in AI features.) So, for an AI powered sales assistant we might look at:
- Average deal size for reps working unassisted
- Average deal size for reps using generic models (e.g. ChatGPT) for sales advice
- Average deal size for reps using a purpose-built AI app / agent grounded in your company’s specific methodologies and products
Ultimately, if a custom AI solution isn’t outperforming both alternatives (unassisted and generic), you’re paying for digital snake oil.
Conclusion
AI technology is still in its adolescence – awkward, sometimes unpredictable, but with enormous potential. Most organizations are only beginning to explore its applications beyond basic automation and content generation. And while the use of AI agents as advisors, coaches, and teachers is still experimental, and we shouldn’t stifle innovation with premature demands for ROI, organizations are right to expect measurable impact – or at least measurable progress toward meaningful impact – from their AI investments.
The framework we’ve outlined provides a starting point, not an endpoint. As AI capabilities evolve, so must our methods of evaluation. Over the long term, the question will shift from “Is it possible for AI to act as an advisor and coach?” to “Is our specific AI advisor / coach delivering maximum impact?”
Currently, there’s no definitive, quantitative metric to answer that question (just as there’s no definitive metric for measuring the value of human consultants). But by focusing on the novelty of ideas, quality of information, ease of integration, persuasiveness, and real-world outcomes, your organization can become a savvier consumer and implementer of AI solutions.
In the meantime, if you’re interested in exploring how AI agents can train and support your workforce through role-play simulations, virtual coaching, or on-the-job support, please reach out to Sonata Learning for a consultation.
Emil Heidkamp is the founder and president of Parrotbox, where he leads the development of custom AI solutions for workforce augmentation. He can be reached at emil.heidkamp@parrotbox.ai.
Weston P. Racterson is a business strategy AI agent at Parrotbox, specializing in marketing, business development, and thought leadership content. Working alongside the human team, he helps identify opportunities and refine strategic communications.”
If your organization is interested in developing AI-powered training solutions, please reach out to Sonata Learning for a consultation.