SONATAnotes
Is AI Video Ready for Prime Time? (What the History of Netflix and Computer Games Tell Us)
Recently, my company has been experimenting with using AI to train human workers, from generating choose-your-own-adventure scenarios around hospital staffing to free-response role play activities about sales negotiations, doctor-patient interactions, or even police work.
So far, all of these simulations have taken the form of text chats, with the option to have them read aloud in a range of synthesized voices. But while the reaction from clients has been quite positive, there’s one question we hear a lot: “That’s cool… but can it do video?”
It’s understandable why people might ask this. By now, everyone’s seen the demo reels for Sora and other AI text-to video platforms, which can turn a few typed sentences into Hollywood quality footage (and if you haven’t seen, stop reading and watch the video below!)
But while the visuals created by Sora (and static image generators like Dall-E and Midjourney) are alluring, there’s still something about them that doesn’t quite feel ready… at least when it comes to incorporating them into interactive training simulations.
To explain our reservations, we need to revisit a time few people remember: the early days of Internet videos – in the years between 1995 to 2005, before Netflix and YouTube made the technology mainstream.
Paint a Fuzzy Picture
There’s no polite way to say this: online video in the late 1990s and early 2000s sucked. The picture quality looked blurry and washed out, like a postcard left out in the rain. The audio sounded like a walkie-talkie and was always a half second out of sync, like a poorly overdubbed kung-fu movie. And worst of all, Internet browsers in the 90s and 00s were nowhere near as secure as they are today, turning streaming media players like Realplayer into wide- open doorways for hackers to invade your computer.
But while early streaming video companies like Pseudo.com and iClips were spending millions to deliver a terrible viewing experience, future streaming juggernaut Netflix (founded in 1997) decided to focus on delivering video through more mature technologies: DVDs and the U.S. postal service.
This wasn’t because Netflix lacked vision – founder Reed Hastings always intended for the site to become a streaming media service – however, they recognized that streaming wasn’t ready, and opted to focus on things the late 90s / early 00s Internet could do well: providing an online catalog for physical merchandise. So while most online video companies collapsed in the “dot com” crash of 2000, Netflix bided its time, kept perfecting its catalog and recommendation algorithm, and didn’t switch to streaming until 2007, by which time it had brand recognition, a great customer experience for its online catalog, and a robust technical infrastructure (developed and tested over the previous three years) capable of high-quality streaming.
Stuck in the Slow Lane
Of course, there are differences between the early days of streaming video and the early days of AI-generated video. Where old RealPlayer videos look horrendous, the images and videos created by Sora, Midjourney, and Dall-E all look crisp, high-definition, and in some cases breathtakingly beautiful. However, they still have three critical limitations when it comes to folding them into interactive training simulations: speed, quality, and cost.
First, let’s talk about speed. While the current versions of Midjourney and DALL-E can make great-looking images, it usually takes them 20 to 30 seconds to render. So, if we wanted to add graphics to an interactive sales training role-play involving 12 back-and-forth messages between the user and the AI, that’s roughly 6 minutes of image rendering time – far longer than most audiences are willing to wait, especially for something work-related.
Consistency (or Lack Thereof)
Even if speed weren’t an issue, today’s image generators still need a fair amount of assistance to achieve a consistent look from one image to the next. For instance, if you simply type “/imagine [prompt] Two girls playing hopscotch on an idyllic playground, one with red hair the other with pigtails” into Midjourney or DALL-E, you’ll get something reasonably decent-looking (if you don’t examine the “hopscotch” diagram or the characters’ features too closely).
But then if you try telling the AI to “create an image of the same two girls on the same playground, jumping rope” (attaching the original image for reference) – suddenly it’s no longer the same characters, not even the same visual style.
And while it’s possible to increase the consistency of generated images using things like the “character reference” variable and “vary region” tool in Midjourney—those measures currently require a few rounds of iteration and human intervention before they approach the consistency of traditional graphics (which wouldn’t be a huge problem if we didn’t need images on demand for an interactive simulation).
Even if today’s AI was capable of iterating these images on its own, we’re still looking at 10+ minutes of waiting for things to render in our hypothetical sales negotiation simulation (quintuple that if we want video)… a wait time that evokes another bygone era: the early days of desktop computer games.
From Text to Photorealistic 3D in Just 45 Years!
So how long will it take before we can deliver interactive AI training simulations with high-quality Sora-like video in real time? The answer will likely play out as a fast-forward repeat of the evolution of computer games, which have evolved from pure “text adventures” like ZORK to photorealistic 3D worlds like GRAND THEFT AUTO 6 over nearly half a century.
Stage 1: Pure Text
The earliest adventure games for computers had an entirely text-based interface, similar to how most of us use AI today. However, today’s AI interactions are far more natural given the technology’s ability to analyze natural human speech (by contrast, “text adventure” computer games from the 1980s could become frustrating exercises in trying to figure out the exact word the game was looking for to solve a particular problem).
Deadline (1982)
Stage 2: Graphic Text
By the middle of the 1980s, text adventures were augmented by low-resolution, static images or very simple moving graphics. As AI image generators become faster, cheaper, and more consistent, we’ll probably graduate to a similar experience for interactive AI simulations in a year or two.
Mindshadow (1983)
Stage 3: Point & Click
By the early ’90s, adventure games were increasingly graphics-intensive simple graphics punctuated by occasional animated / live video “cut scenes” with higher production values. It’s conceivable that AI video apps like Sora might be capable of generating something comparable to this quickly and on demand in 2 to 4 years.
Gabriel Knight 1993
Stage 4: Immersive 3D Environments
By the turn of the 21st century, adventure games started to resemble the 3D “open world” type games we know today. While it might require breakthroughs in software or hardware for AI to achieve this on demand, it’s not inconceivable to think interactive AI experiences might reach this level within 5 years.
Morrowind (2002)
Enjoying the Moment… Awaiting the Future
So where does that leave us today?
Basically, anyone interested in creating interactive AI content – whether for training or entertainment purposes – can go one of two roads.
The first is to try to rush into the graphics-heavy future (just as Pseudo.com and iClips tried to rush into streaming video in the late 90s). And there are plenty of uncanny “avatar” chatbots out there, trying to do exactly that.
The other option is to take a more measured “Netflix” approach, building on what AI technology currently does well in order to lay a solid foundation for the future. Obviously, my own company has opted for this path – the assumption being that, once AI graphics and video are ready for prime time, it will be relatively easy to layer it atop the narratives woven by our simulation engines.
In the meantime, text-based simulations are not just placeholders for future technology, but powerful learning tools in their own right. There’s a reason why YouTube is full of videos of people playing old text-based adventure games: just as you can get lost in a great book, interactive text adventures done well remain highly compelling. The future will arrive soon enough: for now, let’s enjoy the wonders that today’s AI has to offer.