SONATAnotes
Grading AI’s Homework: How LLMs Handle Research and Citations
The internet has given everyone access to a wealth of information that was once confined to university libraries and government filing cabinets. At the same time, it has made it possible for anyone to spread misinformation on a mass scale, without any academic or editorial oversight. Together, these trends have created a world where you can find instant answers to almost any question, but a significant percentage of those answers might be rubbish. Generative AI is already amplifying this dynamic to an unprecedented extreme—with some people using AI to churn out immense quantities of dubious content, even as the rest of us turn to AI in search of honest answers.
In this environment, the ability to check facts and cite sources is invaluable for anyone who needs accurate information, from a nurse looking up symptoms of a rare disease to an attorney citing case precedent to concerned citizens verifying the claims of politicians. However, citing sources can be a challenge for generative AI. This isn’t because the AI is malicious or careless, but because AI and humans process information in fundamentally different ways. The very mechanisms that make AI seem near-omniscient on subjects as varied as migratory bird patterns and air conditioner maintenance also make it difficult for AI to know where it found a particular piece of information (or for AI to “know” anything, at least in the way humans define “knowledge”.)
In this article, we’ll explore how AI gets its information, how it can be made more reliable, and why AI might not have to be 100% reliable to still be incredibly useful.
How AIs Approach Text
John W. Campbell, the famous science fiction magazine editor from the 1950s, once challenged an author to “Write me a creature that thinks as well as a man or better than a man, but not like a man.”
When it comes to reading, AI certainly achieves that, processing text in a remarkable yet fundamentally non-human way. For LLMs, “reading between the lines” isn’t just a figure of speech. Instead of reading text one word at a time, they analyze the relationships between key words in a sentence, paragraph, or page all at once. Then, they compare these patterns to all of the text the AI was trained on—typically trillions of words from the public Internet and other sources.
When an AI processes a sentence like, “In 1903, Gandhi started his law practice in Johannesburg,” it will refer to its training data to assesses the statistical probabilities of key words like “1903”, “Gandhi”, “law practice” and “Johannesburg” appearing so closely together in various contexts. And, based on those patterns, it will likely conclude this passage relates to the biography of Mahatma Gandhi, and not the career of prime minister Indira Gandhi nor Gandhi Selim Law (a real estate law office in present-day Johannesburg.)
But the important thing to note is that the AI isn’t “reading” a million articles about Gandhi to reach this conclusion—at least, not in the human sense. Instead, it’s drawing from the statistical distribution of keywords across an aggregated, interconnected body of knowledge. This makes the task of pinpointing or “citing” a specific source for any text it generates extremely difficult. In a sense, the AI isn’t citing specific sources but is referencing the weighted average of all sources it has been exposed to.
Asking an LLM where it got certain information is like asking a human which meal provided the protein to build a specific muscle cell in your bicep. Just as the body processes nutrients from many meals over time, an LLM draws on patterns from a vast body of data without being able to isolate any single source.
Search vs. Generation
When a fundamentally new technology comes along, people will often try to use it for the same things as older, more familiar technologies – without grasping its unique potential. In the case of generative AI this includes people trying to use it in the same manner as a search engine.
If you ask an AI model “What does the World Health Organization say about preventing hospital-acquired infections (HAIs).” it won’t actually go out and search for WHO publications on preventing HAIs. Instead the AI will generate a response synthesized from patterns of words that commonly occur in passages of text that mention “World Health Organization”, “preventing” and “hospital acquired infections” in a certain arrangement. And while – 95% of the time – the results will be a concise and eloquent synopsis of medical best practices, they won’t necessarily reflect the verbatim guidance published by the WHO, and for people expecting a list of peer-reviewed articles akin to what they’d get from a Google search or querying the WHO’s document library, that can be a frustrating experience.
To add to the confusion, an AI model might generate random citations or “citation-like strings of text” in cases where citations commonly appear in the training data it’s referencing. And while, sometimes, this might serendipitously result in an accurate reference, it could include citations to irrelevant sources or even a fabricated citation consisting of journal names, publication years, and authors that commonly appear together, without a real reference behind it.
To compensate for these frustrating tendencies, some AI applications augment their models’ training data with search engine results (like Bing in ChatGPT ‘s public-facing chatbot or Perplexity leveraging Google search), allowing the model to identify articles related to keywords from the user’s input or the AI’s output, then use them as a heavily weighted data source when generating subsequent messages.
But while this feature is useful, it still runs into problems – especially if the results include references to articles hidden behind paywalls.
The Paywall Problem
One challenge for improving the quality and accuracy of AI output is that the best-curated sources of information are often inaccessible without a paid subscription. Prominent newspapers like the Washington Post or Wall Street Journal will only expose the first few paragraphs of an an article, beyond which users must register for an account to continuing reading, and most academic databases will only show the abstract (summary) of a research publication, with a link to pay for the full article (usually for $30 to $70, sometimes more.)
This creates a situation where most AI models (and most humans) cannot access the most reliable and in-depth data sources. Sometimes, AI models will merely skim the synopses of paywalled articles, which can lead to misrepresentations of their actual contents. In other cases, AI responses will gravitate towards less substantive source material such as Wikipedia and blogs, or – worse – be unduly colored by misinformation.
Improving AI’s Marks
AI providers are acutely aware of all of these challenges, and have been taking steps to improve their models’ ability to access and give attribution to quality sources. A few of the measures includes:
- More Granular Training – When AI models are trained, developers “reward” behavior patterns that produce quality output, in much the same manner as a dog trainer giving a Doberman a treat. In the past this was done on the basis of an AI model’s final response to user messages, but now OpenAI and other companies encourage AI models to share their reasoning and reward it for each valid step. This has taken some of the randomness out of the process, including how AI formulates citations.
- Specialized Apps – Perplexity is an AI tool designed specifically for research. While it lacks the versatility of ChatGPT or Claude, it does an excellent job of summarizing top search engine results related to user questions, complete with citations. Basically, it gives people who treat AI like a search engine something closer to what they expect.
- Content Deals – Many AI platforms have been signing deals with owners of paywalled content libraries: OpenAI’s recently signed a deal to give its models access to magazine publisher Condé Nast’s massive archives of data (including back issues of Wired, Vogue, Architectural Digest, and more.)
However, there’s also quite a bit that AI application developers and end users can do to work around challenges with citing sources and data quality, such as:
- Retrieval Augmented Generation (RAG) – There are many services that allow users to supplement an AI’s pre-trained knowledge with targeted searches of specific documents or databases supplied by the user or developer. So if you want an AI agent to give answers based on your organization’s standard operating procedures or a list of government regulations, taking some time up front to reformat your information in a machine-friendly way can improve results (for instance, adding an explicit “citation” line near the top of each major section.)
- Integrating Multiple AI Models – There are “proxy” apps available (like my own company’s app, Parrotbox.ai) which allow users to chain multiple AI models together: so, for instance, you can create AI interactions where search-oriented tasks are directed towards an app like Perplexity while more conversational or reasoning-oriented tasks are directed to a model like ChatGPT or Claude, without the user realizing that the conversation is being handled by multiple models.
- Prompt Engineering – By default, AI wants to give definitive-sounding answers, but instructing it to express uncertainty or qualify statements based on the availability and quality of sources can help users put its responses in perspective. For instance, whenever our company develops an AI copilot / chatbot that references client policy documents, we tell it to say “While my official source documents don’t say anything about [raising yaks], a scan of sources on the Internet suggests…” And you can even instruct AI to immediately critique its own output after generating it by literally asking “How might your last statement be wrong?”.
The Real Value of AI
However, all of these incremental improvements sort of miss a larger point: the best way to get value from an AI agent is to stop treating generative AI like a search engine and instead focus on what AI does well: synthesizing data and drawing connections across multiple sources. While it’s natural to compare AI to the technologies that directly preceded it (search, auto-correct, etc.) a better way to think of AI is as an extremely knowledgeable colleague who’s available to answer your questions 24/7/365. And just as you would (hopefully) verify the suggestions of a colleague or review the first draft written by a research assistant, using AI as a brainstorming or writing partner—while still involving humans for final validation—is an incredibly powerful approach.
To that end, one area our company has been focusing on is making AI agents that communicate in a casual, conversational style – like a human colleague – and less like a faux encyclopedia entry. While it might seem like a mere stylistic change, the psychological impact on users can be significant, and reduce humans’ tendency towards “algorithmic paternalism” – a fancy term for taking whatever a computer tells you at face value.
Conclusion
Hopefully this article helped explain why current AI models have difficulty citing sources, while at the same time putting their real value in perspective. In the years to come, AI models will likely improve their ability to track the sources of their information, and human users will develop a better understanding of AI’s true capabilities, beyond serving as a “talking search engine.”
If your organization is looking to leverage generative AI for workforce training or creating copilots / chatbots based on your knowledge products, please consider reaching out to Sonata Learning for a consultation.