How Robots Read: The Ultimate Guide to Using Documents as Data Sources for AI

SONATAnotes

How Robots Read: The Ultimate Guide to Using Documents as Data Sources for AI

By now most of us have heard the (mostly outdated) stories about AI citing non-existent court cases when asked to write a legal brief or, less amusingly, regurgitating misinformation and hate speech that it read somewhere on the internet. Hence, whenever an organization starts looking into using AI for work purposes, someone in the room will inevitably ask “Can we ‘train’ the AI on our data, to make sure the answers it gives are accurate?”

The short answer to that question is “Yes, you can… and if done properly it should be about as reliable as most human professionals.”

Now, if you’re interested in the long answer…

While the inner workings of today’s AI models are mysterious to most people (including the companies who created them), most people know the following…

By default most of the popular AI platforms get their information from the internet.
Most AI models’ knowledge of the Internet doesn’t go beyond the date when the model was initially trained, so its information can be out of date.
Sometimes this leads AI to make troublesome, disturbing, and or hilariously inaccurate statements.
As an alternative, certain AI platforms let you upload documents that the AI can subsequently refer to when answering questions. Beyond that, some platforms even allow AI to do impromptu web searches, for instance the way OpenAI’s chatbot integrates with Microsoft Bing.
With a bit of effort it’s even possible to integrate AI with online databases and other information sources, such as a company’s product catalog or tech support knowledge base.

All of the above statements are functionally true. However, they fail to answer a fundamental but important question: how exactly do AI models “read” all of these different data sources, and how do they process that information to answer our questions?

That question has significant practical implications for any organization seeking to use AI on a larger scale. And to answer it we will look at three scary sounding phrases that this blog will attempt to explain as simply as possible:

Parametric memory
Context windows
Retrieval Augmented Generation (RAG) and “Threading”

Parametric Memory

In 2001, future Academy award winner Christopher Nolan directed the low-budget crime movie Memento about a man suffering “anterograde amnesia” – a rare brain condition that prevents him from forming new memories. And because it’s a crime movie, the last thing the man remembers is the murder of his wife, which he is desperately trying to solve despite an inability to remember any of the clues or suspects. To help him keep track of the mystery, the character tattoos every significant piece of information he discovers on his body, so that he’ll be able to re-read his case notes in the mirror each day.

Weirdly, this is not too different from how generative AI operates. Today’s AI models don’t actually have a “memory” – at least not in the human sense. Instead, they are initially trained with a finite set of pre-indexed information stored in their “parametric memory” (like the detective in Memento being able to remember everything up to the night of his wife’s murder), but otherwise it doesn’t permanently retain new information.

It’s also worth noting that, when a professional talks about “training” an AI, they are referring specifically to the initial process of establishing the parameters the AI uses to analyze information, which includes populating its “parametric memory.”

This process involves feeding the model to a vast quantity of information pulverized into “tokens” (units of content roughly equivalent to words, though they also include things like punctuation and prefixes / suffixes), then having the model study the patterns in that information until it is able to make predictions about how certain groups of words relate to each other (and we’re talking about trillions of words of input to create something on par with ChatGPT or Gemini).

The process also involves stages of “supervised learning” where humans or other machines compare the AI model’s outputs against known correct answers (i.e., “If you see the phrase ‘one plus one’, then the next part of the conversation will usually be ‘equals two’), then apply “reinforcement learning” – a sort of virtual system of rewards and punishments to help guide the model’s development. Between the processing power consumed and labor hours involved, this can cost millions if not billions of dollars and take up to a year, even for large, well-funded corporations.

By contrast, when a layperson talks about “training” an existing AI model on the contents of a document, that’s not training in the technical sense. While it is possible to re-train or “fine-tune” an already-trained model to a limited extent, that’s different from simply handing the AI a document to reference.

Of course all of this begs the question: if AI doesn’t remember anything past the day it was initially trained – how can it hold conversations? To answer that, we need to talk about context.

Context Windows

Context is where AI behavior starts resembling the amnesiac protagonist from Memento.

Because AI has no working memory, every time you type a message it “wakes up” again – as a blank slate – and re-reads the entire conversation history up to that point from the beginning. Just like Memento’s protagonist studying his tattoos. Then, based on that reading, the AI tries to predict what should come next in the conversation.

The running transcript of a conversation that an AI references is what professionals call context, and the maximum length of context that an AI can successfully refer to when responding to a user is known as the model’s context window. While early AI models could only store a few thousand “words” (tokens) of context, more recent models can store 100,000+ tokens or up to a million in the case of Google’s Gemini model, meaning your conversation can basically run the length of a multi-volume encyclopedia before the system runs out of space.

However, even if your token count doesn’t exceed the context window, the length still matters for a few reasons.

First, AI has the same challenges that humans have paying attention in school. Quite literally, an AI’s “attention mechanism” – the set of algorithms it uses to determine which portions of a piece of text are relevant to the conversation at hand – causes it to pay more attention to information at the beginning and end of a long passage of text, and struggle with making sense of information in the middle. This is interesting because it mirrors a similar tendency in humans (the “serial-position effect” discovered by 19th century psychologist Hermann Ebbinghaus) and second because really long, 1,000,000+ token context windows result in an even more exaggerated emphasis on material at the beginning and end and de-emphasis on everything in between, which can hurt AI’s performance and the quality of its answers.

The second, more pragmatic, issue is that if you’re accessing models like ChatGPT from an outside platform – ChatGPT charges you for the volume of text it must re-read on every pass.

To give a functionally true (if technically slightly inaccurate) example, let’s pretend tokens and words are the same and you ask an AI “Can you recommend a French pastry?” to which it replies “Pain au chocolat”: congratulations you’ve been billed for 9 words. Then if you ask “What is pain au chocolat?” and AI says “Basically brioche bread filled with chocolate. It’s yummy.” then you get billed for the first 9 words again (because it had to re-read them) plus the 11 new words for 20 words total. Then if you say “Sorry, but I don’t like chocolate.” and AI says “That’s unfortunate” you get hit for 20 + 8 so 28.

So your total bill for the conversation thus far is 9 for the first exchange, 20 for the second exchange, 28 for the third exchange… so 57 words (or whatever that translates to in tokens – probably around 80) and counting! And, because of AI’s “serial-position effect”, the longer the conversation gets, the worse the AI is going to perform (though it can still get pretty long – tens of thousands of words – before you’ll notice any goofiness.

Documents, RAG, and Threading

So what happens when you click the paperclip button in ChatGPT’s chatbot interface and attach a PDF with the performance specifications for various motorcycles, or the latest draft of your company’s employee handbook?

Well, that depends… There are two approaches AI models like ChatGPT can take to “read” a document.

Method 1 – “Retrieval Augmented Generation” (RAG)

With Retrieval Augmented Generation (commonly referred to as “RAG”, rhymes with “bag”) a document is reduced to plain text then uploaded to a database with its own separate “search” AI. Then, every time the user types a message, that second AI does a search of the text in its database and forwards the top N results to the “conversational AI” to reference when responding to you. It’s as if a politician had an aide sitting off to the side during a press conference, Googling every question the reporters ask then handing the top results to the politician to inform their response.

The advantage of this approach is that it avoids having to shove the entire text of the document to the conversation’s context window, keeping the token count down (which is good for saving money) and avoiding the dreaded serial-position effect from overly long information sources (which is good for performance).

The downside of RAG is that the conversational AI doesn’t actually have the full text of the source document available – just the top N snippets. So it’s kind of useless if you want to summarize an entire document or compare and contrast the entire contents of two or more different documents.

Method 2 – “Threading”

Threading is basically the more-is-more approach to accessing the content of documents, and it’s what ChatGPT does every time you add a file attachment to a message in a conversation. Basically, the AI shoves the full text of the entire document into the context window, like Andy Kaufman reading the full text of The Great Gatsby at one of his comedy shows (the joke being he’d actually stand there and read all of it, even if it took until 4 a.m. the next day). And the AI will then read the entire text of the document before answering each of your questions.

Obviously, this bloats the word/token count really, really fast. To show just how much, let’s pretend you really wanted it to reference the full text of Julia Child’s “Mastering the Art of French Cooking” (both volumes) when answering your questions. It might shove the entire book into the conversation but hide the text from you.

You: “Can you recommend a French pastry?” AI: “Pain au chocolat” [9 words]

You: “What is pain au chocolat?” AI: “Basically brioche bread filled with chocolate. It’s yummy.” [20 words]

You: “Sorry, but I don’t like chocolate.” AI “That’s unfortunate.” [28 words]

You: “Do you think the recipes in this book are any good?” [+Attach MasteringArtFrenchCooking.pdf] AI: “Yeah, it seems to have all the greatest hits of French cooking that would be familiar to American audiences.” [28 + 328,000 + 29 = 328,057 words]

Now, even if ChatGPT could handle that many words in a single conversation (the current version can’t) it would be ludicrously expensive to reference the entire book on every message going forward. So what ChatGPT does is remove the full text of the book from the conversational record in a separate “thread” until it’s needed again (or multiple threads tagged with key words so it can be more selective about accessing portions of the document in the future). So your conversation would be more like…

You: “Can you recommend a French pastry?” AI: “Pain au chocolat” [9 words]

You: “What is pain au chocolat?” AI: “Basically brioche bread filled with chocolate. It’s yummy.” [20 words] You: “Sorry, but I don’t like chocolate.” AI “That’s unfortunate.” [28 words]

You: “I don’t know much about French cuisine.” AI: “Well, then, this book would be a good place to start.” [28 + |full text removed| + 29 + 18 = 75 words]

You: “Can you summarize all the recipes in the book and reorganize them by primary ingredient?” AI: [re-injecting full text of book] “Sure, we have: Chicken dishes, these include…” [75 words + 328,000 words (full text again) + let’s say the summary is 500 words = 328,575 words]

So, add up all those messages and we can expect a bill for… 655,736 words, total.

Method 3 – Combining Context + RAG

Ultimately, any reasonable strategy for navigating massive volumes of text will involve a combination of context threading and RAG.

OpenAI claims that it will try using RAG as a first resort, for efficiency, before resorting to pulling in the full text of a document. So, if you handed ChatGPT a more reasonably sized 25,000 word issue of the Bulletin of the World Health Organization and said “Is there anything in this issue about lung disease?” then the AI would perform a RAG search on the phrase ‘Is there anything in this article about lung cancer’, import the handful of snippets it found referencing lung disease and reply “Yes, lung cancer is discussed as a leading cause of death for Scottish and Bangladeshi populations in the UK, and mentioned in another article about testing on genetically modified mice.” But that entire response is just based on a couple of snippets totaling 500 words or so, and it’s not adding a ton to the conversational record.

But if you said “Give me a 200 word synopsis of every article in the issue” then the AI would have no choice but to load up the entire 20,000 word issue in a thread, read it and summarize it, And your conversation record would have a +20,000 word spike, but ideally would go back to more normal word counts until you asked a similarly comprehensive question again.

Similarly, my own company – Sonata Learning – uses a combination of context and RAG for pulling data into our AI-generated interactive training simulations. For example, in the doctor-patient interview simulation below the AI is given a table of medical billing codes (what insurance companies use to tag patient symptoms and conditions in medical records) sorted into “highly common” , “less common” and “rare” categories, then when a patient in a simulation has a particular condition it goes out and fetches the full data on the condition associated with that ICD-10 code. And we did a similar thing with the US Food & Drug Administration food safety regulations for the food handling quiz generator below.

Conclusion

So what does all this mean?

Well, if you’re just using the “regular” ChatGPT chatbot to assist with your next college term paper or if you want to share a simple Q&A prompt free of charge with a couple of colleagues via the “Assistants” feature – then you can probably just attach the file in the chatbot interface and let OpenAI worry about the rest.

However, if you want to go beyond personal / limited use and create AI apps at scale (like copilots to advise all of the physiotherapists in a healthcare network or sales training role-plays that reference the specifications of your company’s entire product line plus those of your competitors), then it’s worth taking some time to work out a strategy for handling documents that balances performance with cost-effectiveness.

Hopefully this article provided some useful “context” (pun intended) for how AI deals with documents and other data sources. And if you need assistance creating Ai solutions for training or on-the-job performance support, please consider reaching out to my company, Sonata Learning, for a consultation.

CASE STUDIES

MOUTHWATCH TELEDENT

Female dentist wearing surgical mask treating woman. Male coworker is assisting mid adult healthcare worker. Medical professionals are examining patient in clinic.

JOIN OUR NEWSLETTER

Name

Organization

Organization type

By signing up for the newsletter, I agree with the storage and handling of my data by this website. - Privacy Policy

Comments

This field is for validation purposes and should be left unchanged.

SONATAnotes