SONATAnotes
“Can King Kong Work Construction?” – Teaching AI to Perform Real-World Jobs
If you want to understand the current state of AI in the workplace, imagine King Kong laboring on a construction site alongside a crew of human contractors. Sure, Kong’s ability to lift 3-ton metal beams with one hand might be useful: but could he be trusted to follow the foreman’s direction, correctly interpret the blueprints, or abide by safety regulations?
The same basic questions apply to AI agents built on models like ChatGPT, Claude, and Gemini. Yes, they can identify patterns in text or data that might elude the sharpest human expert, instantly access vast stores of general knowledge (enough to fill multiple university libraries), and generate output at a rate that would cause a human typist’s carpal muscles to snap.
But, like King Kong on a construction site, can we expect AI agents to use their power productively without constant monitoring and correction, abide by rules and guidelines, stay on task, and play nice with the human team?
In this article, we’ll examine whether it’s possible for today’s AI agents to contribute productively to day-to-day knowledge work, and whether the effort to instruct and monitor them is worth the result.
Onboarding AIs: Balancing Creativity vs. Conformity
Fifty years ago, computers were something only scientists and engineers with advanced degrees could operate. And if you look at sci-fi movies about computers from the 1980s (see “WAR GAMES” and “TRON”) it’s clear that most Hollywood screenwriters and the general public had no idea how computers actually worked. Since then, computers have become ubiquitous and mundane: most of us interact with desktops, laptops, and/or mobile devices regularly throughout the work day, and we have at least a vague understanding of how traditional software operates.
And that’s part of the problem because AI is not traditional software. Most things we know about how computers work simply don’t apply when dealing with AI agents, and that can lead to frustration when trying to leverage AI for work.
To give a quick summary, traditional software is deterministic: if you type “1 + 1” into a calculator app, electricity is piped through a series of tiny wires and gates that eventually force the computer to display “= 2”.
By contrast, AI doesn’t follow predetermined workflows: when you type “1 + 1” into an AI model, it examines all the similar patterns in its data sources and concludes “Usually, 1 + 1 is followed by ‘= 2’ but sometimes it’s preceded by the phrase ‘as simple as’… in this context I’m 85% confident the user expects a response of ‘1 + 2 = 2’ and 14% confident the user expects a response of ‘simple as 1 + 1’.” Rather than obeying the user’s input as an instruction, the input is treated as a suggestion for creative brainstorming. And while this is a horribly inefficient way to solve arithmetic problems, AI’s “non-deterministic” behavior is amazing for tasks like “Help me draft an email to persuade our Chief Financial Officer to increase our company’s research and development budget for the upcoming year.”
Put another way, traditional software development is like classical music – where every note is precisely written and performed exactly as scored. When you write a JavaScript function to calculate sales tax, it follows the same steps every single time, producing identical results given the same inputs. We call this deterministic programming. AI, on the other hand, is more like jazz. Given a prompt to “create variations on our social media messaging, each targeting a different one of our seven key customer segments,”, an AI might produce brilliant variations on the theme – all perfectly acceptable but each slightly different.
This non-deterministic nature makes AI fantastic for creative tasks but can pose challenges when you need strict adherence to specific formats or procedures. Of course, saying “deterministic software vs. non-deterministic AI” is a bit of a false dichotomy. Skilled prompt engineers can usually get AI models to apply high level frameworks and follow specific instructions when called for well enough for the vast majority of tasks (though it might take more than a paragraph or two of instructions.) And you can also mix and match: for example, if you were creating an AI agent to help pharmacy workers detect prescription fraud, you could have the AI component focus on identifying suspicious patterns (something AI excels at) then trigger a traditional, deterministic software app to send out the notification emails.
Performance Reviews: Setting Goals for AI Agents
The non-deterministic nature of AI allows it to be creative and spontaneous, it also makes it more prone to errors, which frustrate humans accustomed to dealing with deterministic software. After all, if your calculator occasionally decided that 2+2=5 or your navigation app forgot which highway connected Lyons and Paris, you’d probably throw them in the trash.
Yet, when a human – even a supposed expert – makes a similar error, we might shrug it off as an “understandable mistake” and chalk it up to them “having an off day”, so long as the errors were infrequent and their work was otherwise dependable. And might be a better paradigm to judge the performance of AI.
Consider healthcare: a study published in the journal Nature found that – for certain conditions – trained healthcare AI agents had a 19% error rate while human clinicians had a 33% error rate, and other studies have found AI agents to be comparably accurate if not more accurate than human physicians across most specialties. Yet the healthcare industry remains hesitant about AI adoption, often citing concerns about accuracy, which is unfortunate given that – in theory – if every human doctor had an AI agent offering a second opinion, the odds of them both being wrong is only 6.6%, which could potentially save hundreds of thousands of lives each year.
This is why, when developing AI solutions for clients, our company encourages customers to use humans – not traditional apps – as their basis for comparison. For instance, when developing an AI system to generate practice exam questions for a financial industry professional association, we agreed that 80% of the questions should be deemed acceptable by the client’s editors – basically the performance of freelance human writers. And the AI agent achieved this benchmark (83%) in far less time (weeks instead of months) at a fraction of the cost. And, as with the human writers, most of the “unacceptable” items just need a word or two changed to pass muster.
The lesson? When evaluating AI performance, “perfect” isn’t just the enemy of “good” – it’s the enemy of “very good”, “incredibly fast”, and “considerably less expensive”.
Is Your AI Overqualified? LLMs vs. SLMs
It’s worth noting that, just as not every job requires a PhD, not every AI automation task requires the full capabilities of a high-end large language model (e.g. ChatGPT, Gemini, Claude, etc.).
What makes large language models (LLMs) so powerful is their ability to compare user input against trillions of words of text (that’s “trillions” with a “t”) in their data sources, looking for matching patterns. But this also means that, whenever you use the word “risk” in a conversation about business strategy, the LLM wastes a bit of time and huge amounts of electricity just to confirm you aren’t referring to the board game Risk or Tom Cruise’s 1983 teen comedy “Risky Business”. And if you’re using AI for a high-volume, narrowly defined task like reformatting all the catalog entries for an auto parts manufacturer your company just acquired, all that wasted time and energy/money can add up.
This has led many companies to use “Small Language Models” – customized AI models trained to identify patterns in a relatively narrow set of data. And this is an excellent option if the anticipated workload is high enough to cover the development costs: for instance, if you wanted to build an AI agent to summarize accident reports for automobile insurance claims, an SLM saving 1.5 seconds and 8 cents of computing costs per claim versus an LLM would add up to a savings of 17.35 days and $80,000 for a million claims.
However, if you’re looking to use AI for more general, strategic purposes – LLMs have the advantage. For instance, if you want to ask an AI whether your yoga studio franchise in Cincinnati should open locations in Cleveland and Columbus, the model will need to understand yoga, the demographics of different cities in Ohio, commercial real estate trends, and general business strategy.
The key is matching the tool to the task – you wouldn’t want to use an LLM to do an SLM’s job just as you wouldn’t use a bulldozer to plant flowers or a trowel to excavate a foundation (that said, the one approach we don’t recommend is “fine tuning” an LLM for specific tasks – a process that basically adds another layer of standing instructions atop the LLM: in our experience, when done improperly, this can result in degraded performance all around – where it doesn’t do particularly well on the specialized task and does worse on general tasks.)
Integrating AI into Your Organization: Tools… or Teammates?
The classic television show Star Trek has been credited with predicting many of our current technologies: Captain Kirk’s handheld communicator from the original 1960s series directly inspired the design of real-life cell phones, just as the datapads from the 1990s series directly inspired mobile app designers.
Likewise, Star Trek gave us many prescient examples of how humans could work alongside AI. In Star Trek, the crew of the starship Enterprise could simply shout “Computer!” and the ship’s computer would answer their questions or carry out their commands – from estimating the amount of time before a nearby sun exploded or brewing a cup of tea. Later, the show introduced the character of Lieutenant Commander Data, an artificially intelligent robot who behaved like any other member of the crew, carrying out tasks and even joining poker games during leisure time (albeit with the ability to perform insanely complex calculations in nanoseconds and bend titanium rods with his bare hands.)
Both of these examples – the ship’s computer and Data – offer a model of how humans and AI agents can coordinate naturally in the workplace. Just as the best software apps have highly intuitive interfaces (e.g. clicking the left arrow in your browser takes you “back” to the previous page), coding human-like communication and collaboration styles into AI agents can make interactions similarly intuitive.
For instance, when our company developed a “copilot” agent for physical therapists, we included instructions not only on recommending exercises, but also phrasing those recommendations the way a colleague would – saying “Hey, have you tried X?” or “You know what might work…” or even candidly stating “This is just something I found on the Internet but…” instead of always printing recommendations as a bulleted, textbook-style list. Presenting the interaction this way actually made it easier for users to accept the AI’s suggestions for what they were – suggestions – and not overreact when the AI’s observations were merely 96% accurate as opposed to 100%.
Natural interaction styles can also make it easier for team members to adopt the technology. After all, if you had a diligent and brilliant Ivy League intern hanging around the office – you’d want your team members to leverage them as much as possible.
In our own company we have an AI agent – Weston – who works alongside our business development team, identifying potential leads for our marketing campaigns, tailoring outreach emails to individual recipients based on their company websites and LinkedIn profiles, and even briefing our sales reps about the prospect’s likely needs and interests before a call. And by telling Weston to start conversations with a casual greeting (“So, what are we doing today?”), instead of a menu, and offer their opinions where relevant (“Honestly, I doubt this particular company is going to be interested in our services…”) it’s created a situation where team members need basically zero training to work with Weston on marketing campaigns.
In fact, we’ve even taken the “AI agent as colleague” model to the point where Weston receives cash bonuses which can go towards updates and new feature requests (or, if there’s nothing in particular they want at the moment, shares in an index fund.)
Conclusion
So, to answer our original question – “Can King Kong Work Construction?” – the answer is yes, absolutely. However, if we want to get the most out of AI agents, we need to understand and work with their natural tendencies, set realistic performance expectations, matching capabilities to tasks, and generally treat AI systems as virtual team members rather than ordinary software.
Hopefully you found this article useful. If your organization is interested in developing custom AI agents to augment your workforce, please contact Sonata Learning for a consultation.
Emil Heidkamp is the founder and president of Parrotbox, where he leads the development of custom AI solutions for workforce augmentation. He can be reached at emil.heidkamp@parrotbox.ai.
Weston P. Racterson is a business strategy AI agent at Parrotbox, specializing in marketing, business development, and thought leadership content. Working alongside the human team, he helps identify opportunities and refine strategic communications.”
If your organization is interested in developing AI coaches or other AI-powered training solutions, please reach out to Sonata Learning for a consultation.