Introduction to OpenAI and LLMs
I focus most of my blog posts on the data platform and how companies can make better business decisions using queries/reports/dashboards on structured data (think SQL tables), but I’m seeing more and more customers interested in OpenAI and how they can make better business decisions using OpenAI on unstructured data (think text in documents). And they want to know if it is possible to use OpenAI on structured data? This is my first blog in a three-part series on the topic.
This first blog will focus on using OpenAI on unstructured data, where the ideal solution is a bot, like ChatGPT, that is used to ask questions on documents from your company.
First I want to explain in layman’s terms what OpenAI, ChatGPT and Bing’s Copilot are. ChatGPT and Copilot are basically bots that work with what are called Generative AI models, commonly known as Large Language Models or LLMs, that were “trained” on multiple data sets that include essentially the entire web as well as millions of digital books. The models were built using OpenAI’s technology (OpenAI is a leading artificial intelligence research lab that focuses on developing advanced AI technologies). So these models are very smart! Sitting on top of these LLM’s are “bots” that allow you to ask questions (via prompts), and the bot returns answers using the LLM. The more details you put in the question, the better the answer will be (this technique is called prompt engineering – designing prompts for LLMs that improves accuracy and relevancy in responses, optimizing the performance of the model). An example question would be “What are the best cities in the USA?”, and the LLM would return an answer based on all the websites, blog posts, Reedit posts, books, etc. that it found that talked about the best USA cities.
But what if you wanted to ask questions on data sets that the LLM’s did not use for its training, such as PDF’s that your company has that are not available publicly (not on their website) and thus not used when the LLM was created? For example, maybe you are a company that makes refrigerators and have a bunch of material such as refrigerator user guides, model specifications, repair manuals, customer problems and solutions, etc. And you would like to improve the answers from the LLM by using the text in those documents and have a bot on top of it so that customers can ask questions of that material. Just think of the improved customer service as customers would not need to talk to a customer service person and can get quick, accurate answers about features of a refrigerator, as well as get quick answers to fix problems they are having with their refrigerator.
LLMs, when it comes to using them in a real-world production scenario, have some limitations, mainly due to the fact that they can answer questions related only to the data they were trained on (called the base model or pre-trained LLM). This means that they do not know facts that happened after their date of training, and they do not have access to data protected by firewalls or not accessible to the internet. So how do you get LLMs to also use PDF’s from your company? There are two approaches that can be used to supplement the base model: further training of the base model with new data, called fine-tuning, or RAG (retrieval augmented generation) which uses prompt engineering to supplement or guide the model in real time.
Let’s first talk about RAG. RAG supplements the base model by providing the LLM with the relevant and freshest data to answer a user question by injecting the new information through the prompt. This means RAG works with pre-trained LLMs and your own data to generate responses. Your own data can be PDF documents. Think of it is you are making the LLM smarter by providing it with additional information in your prompt.
A system that implements the RAG pattern has in its architecture a knowledge base that hosts the validated docs (usually private data) on which the model should base its answer on. Each time a user question comes to the system the following steps happen:
- Information Retrieval: The user question is converted into a query to search into the knowledge base for relevant docs, which are your private docs such as the previously mentioned refrigerator user guides. A search index is commonly used to optimize the search process. Compare this to having to send ALL the docs to the LLM to see the benefit of finding only the relevant docs
- Prompt Engineering: The matching docs are combined with the user question and a system message and injected into the pre-trained LLM. The system message contains instructions that guides the LLM in generating the desired output, such as “the user is a 5th grader” so its answer will be more simple to understand
- LLM Generation: The LLM, trained on a massive dataset of text, generates text based on the prompt and the retrieved information from the model
- Output Response: The generated text is then presented to the user, written in natural language, providing them with insights and assistance based on their private docs
Note that you can choose to have user questions answered only with the knowledge base of private docs, or also with the text that was used to train the LLM (“the internet”). For example, if a user question is for an older refrigerator model that is not part of the private docs, you can decide to return an answer of “not found”, or you can choose to search the pre-trained LLM and return what is found from the public information. You can also choose to combine the two: for example, if the user question is for a model you have in your private docs, you can return information from the private docs and combine it with public information to give a more detailed answer, perhaps with the public information giving customer reviews that the private docs do not have (the system message is used to indicate you wish to combine the two).
The other approach, fine-tuning, enhances an existing pre-trained LLM using example data, like your refrigerator user guides (a domain-specific dataset). This results in a new “custom” LLM, or fine-tuned LLM, that has been optimized for the provided example data. The main issue with fine-tuning is the time and cost it takes to enhance (“retrain”) the LLM, and it will still only have information from when it was last retrained, as opposed to RAG that has “real-time” information. But it could be a good idea to use fine-tuning if you have tons of documents where questions are asked on most of them – so you are not constantly passing in documents to the LLM using RAG.
When deciding between RAG and fine-tuning, it’s essential to consider the distinct advantages each offers. RAG, by leveraging existing models to intelligently process new inputs through prompts, facilitates in-context learning without the significant costs associated with fine-tuning. This approach allows businesses to precisely tailor their solutions, maintaining data relevance and optimizing expenses. In contrast, fine-tuning enables models to adapt specifically to new domains, markedly enhancing their performance (returning quicker answers) and providing more accurate answers, but often at a higher cost due to the extensive resources required to retrain the model. Employing RAG enables companies to harness the analytical capabilities of LLMs to interpret and respond to novel information efficiently, supporting the periodic incorporation of fresh data into the model’s framework without undergoing the fine-tuning process. This strategy simplifies the integration and maintenance of LLMs in business settings, effectively balancing performance improvement with cost efficiency.
Bing’s Copilot is using RAG to give you the most updated answers to your questions (by scraping web pages so it is getting real-time information), as opposed to rebuilding the LLM using fine-tuning, which would take tons of hours and just be impractical to do each day, and still also lag behind real-time. Microsoft’s Copilot in its Office 365 products also uses RAG on your data (PowerPoint, Word files, etc.) – see How Microsoft Copilot Incorporates Private Enterprise Data. Think of Office 365 Copilot as a customized bot for a specific purpose (working with Office 365 files).
Two popular choices for building RAG are Azure AI studio and Microsoft Copilot Studio – see Building your own copilot with low-code approach: a comparison between Azure AI Studio and Microsoft Copilot Studio.
Now that you understand the “what”, what is OpenAI and LLM, the next blog post talks about the “how” part (how to use OpenAI on your own unstructured data via Azure OpenAI On Your Data), and the third blog post will be about using OpenAI on structured data such as SQL tables or semi-structured data such as CSV/Excel files (or on both semi-structured data and structured data at the same time).
More info:
What is Retrieval Augmented Generation (RAG)?
Retrieval Augmented Generation using Azure Machine Learning prompt flow (preview)
Full Fine-Tuning, PEFT, Prompt Engineering, and RAG: Which One Is Right for You?
RAG vs. fine-tuning: A comparison of two techniques for enhancing LLMs
Building your own copilot – yes, but how? (Part 1 of 2)
How Microsoft 365 Copilot works
The Fashionable Truth About AI
When to use Azure OpenAI fine-tuning
Generative AI Defined: How It Works, Benefits, and Limitations
Good introduction James. One challenge my team spent time and still struggling to some extent while implementing RAG using azure open ai ( using azure ai search service as knowledge base ) is quality of responses.
Usual issues found were
1. Curtailed context in responses
2. Inconsistency of results for same queries
I understand part of these problems can be solved by playing around configuration parameters of LLM and system message etc.
It will be great if down the line on series of these blogs on LLM/RAG you could have one on tuning RAG for quality responses.
Thanks for the post.
I am a follower of you content for don’t know how long. Thanks for all the knowledge.
Hi Vijender…great idea on a blog on tuning RAG. I will add it to my list of blogs to write. Thanks!