Introduction to OpenAI and LLMs – Part 3

My previous blog posts on this topic were Introduction to OpenAI and LLMs, the “what” part (what is OpenAI and LLM), and the “how” part Introduction to OpenAI and LLMs – Part 2. Those blogs focused on using LLMs on unstructured data, such as emails and documents in formats like .pdf, docx or .txt files (think documents that include text). This blog will focus on using LLMs on semi-structured data, such as files and logs in CSV, Parquet, XML, or JSON formats (often in table format, meaning rows and columns) as well as Excel files; and structured data such as relational databases (SQL tables and fields).

First, let’s review some key definitions from the previous two blogs:

AI: Computational systems and models capable of performing tasks that typically require human intelligence. GenAI and ML are subsets of AI.
Generative AI (GenAI): AI systems capable of generating new content, such as text, images, or audio. It does this by employing neural networks, a type of machine learning process that is loosely inspired by the way the human brain processes, interprets and learns from information over time.
Large Language Models (LLMs): A type of GenAI designed for natural language (text) understanding and generation, trained on diverse datasets. Think of it like a super-smart auto-complete on steroids. GenAI uses other specialized models (non-LLMs) for image or video generation.
Machine learning (ML): A broad subset of AI that encompasses algorithms and models capable of learning from data to make predictions or decisions. LLMs are a type of deep learning within ML, focused on language understanding and generation, whereas ML includes techniques for diverse tasks like image recognition, data analysis, and predictive modeling.
OpenAI: A leading organization specializing in AI research and development, known for creating GenAI models such as GPT (used in ChatGPT) and other AI technologies.
ChatGPT, Copilot: Applications (“bots”) built on GenAI models (e.g., GPT). These tools allow users to interact with GenAI via natural language prompts, generating responses tailored to various contexts, such as conversations (ChatGPT) or coding (Copilot).
Prompt engineering: The practice of designing input prompts to guide LLMs in generating accurate and relevant responses, optimizing the model’s performance for specific tasks.

There are many possible use cases for using GenAI on semi-structured and structured data: Conversational querying, data enrichment and cleaning, sample data creation, data summarization, trend identification, forecasting and predictions, what-if analysis, anomaly detection and correction, product/service recommendations, mapping fields, MDM, and creating semantic models. In this blog I will focus on the first two: Conversational querying and data enrichment and cleaning,

Let’s first talk about using ChatGPT or Copilot on semi-structured data. It’s something I don’t think many are aware of, but you can use these tools to query and transform data that sits in files such as CSV and Excel simply by uploading those files and entering prompts. The data in those files just needs to be in a table format (rows and columns). The tool will take the uploaded data and create an internal table that you can modify and then save to a file on your computer. Check the features and limits of the version of ChatGPT and Copilot you are using to make sure it supports file uploads (only the enterprise version of Copilot supports file uploads).

As an example using ChatGPT, I can upload a csv movie file and then ask questions like “Tell me about this file”, “How many movies does it contain?”, and “List the movies with a rating of 10 or above”. After any question, I can ask “Generate the T-SQL query for the last question I just asked”. I can then start to modify the data with prompts like “Change ‘sci-fi’ in the genres field to ‘science fiction’ and “Replace any empty fields with a zero” (this is data cleaning). You can get real sophisticated with prompts such as “Add two more columns to the csv file: leading actor in the movie and leading actress, then populate the movies by searching publicly available information from Wikipedia or IMDb to find the leading actor and actress for each movie” and “Add a column to the csv file called MovieGross and lookup each movie from Wikipedia or IMDb for the total gross at the box office each movie made and populate that column” (this is data enrichment). Just think of all the time it can save you from manually doing those lookups!

Now let’s talk about using Microsoft Fabric AI Skill on structured data. The Microsoft Fabric AI Skill is a feature powered by GenAI that makes data more accessible by enabling users to interact with it using plain English questions. Leveraging LLMs, the AI Skill generates queries via T-SQL based on the user’s question and the database schema. When a user poses a question, the system sends the question, along with metadata about the selected data (such as table and column names and data types), to the LLM. The model then generates a T-SQL query to answer the question, ensuring that the query does not alter the data before executing it and returning the results along with the T-SQL. This can be used against a lakehouse, warehouse, and soon also a semantic model or Eventhouse KQL DB.

Configuring the AI Skill is similar to creating a Power BI report. Once set up, it can be shared with colleagues for collaborative data exploration. While the AI Skill often provides accurate answers out of the box, it may need refinement to better understand your organization’s specific context. You can provide additional instructions and example question-query pairs to help guide the AI to generate more precise responses, making it a powerful tool for intuitive and collaborative data analysis. Note that a Fabric tenant admin has to enable the Tenant settings for a Fabric AI Skill so you can use them. Check out this video to see a demo of AI Skill in action.

You may be asking what is the difference between AI Skill and Copilot? The technology behind AI Skill and Fabric Copilot is similar, as both use GenAI to analyze and interact with data. However, there are notable differences between the two capabilities. AI Skill offers extensive configuration options, allowing users to customize the AI’s behavior to meet specific needs. Users can provide instructions and examples to guide the AI, ensuring it aligns with their use case. Fabric Copilot, on the other hand, does not provide this level of configuration flexibility. While Copilot assists users in performing tasks within Fabric, such as generating Notebook code or Data Warehouse queries, AI Skill functions independently. AI Skill can also be integrated with other platforms, such as Microsoft Teams, making it a versatile tool for broader applications beyond Fabric. Copilot in Power BI is designed to work within reports or models, focusing on updating and enhancing those reports. In contrast, AI Skill is geared toward handling ad-hoc queries, working directly against a lakehouse or warehouse to generate and execute T-SQL queries and return query results.

So now technology allows us to use GenAI on both structured and unstructured data at the same time. An industry use case that highlights this is a healthcare organization aiming to improve patient outcomes by leveraging GenAI to optimize treatment plans by combining unstructured and structured data. This approach involves analyzing unstructured data, such as doctor notes and lab reports, and linking it with structured data, including diagnostic codes, treatment history, and patient demographics stored in relational databases. By doing so, GenAI can suggest the most effective treatments for patients. This optimization not only improves patient outcomes but also reduces the time required to identify suitable care paths, enabling more efficient and effective healthcare delivery.

Now there are two key architectural approaches to handling data queries and analysis: the traditional method and a GenAI-driven method (discussed in this blog).

The traditional approach involves extracting numbers and text from documents (unstructured data), storing them in a database alongside other structured data, and using SQL to query the information. This method is ideal when the questions to be asked are predefined, as it ensures consistent and accurate results. Tools like Azure AI Document Intelligence can extract data from documents (and then use LLMs to convert from the JSON output to CSV to more easily add to a database), while solutions like Microsoft Fabric AI Skill can use LLMs to convert natural language questions into SQL queries. However, this approach relies heavily on the accuracy of document data extraction and does not use LLMs to enhance answers beyond the scope of SQL queries. It is particularly suited for scenarios requiring high accuracy and consistency, such as financial reporting or operational dashboards. The workflow follows a straightforward pattern: Document → Database → SQL Query → Answer, making it best for accuracy and predefined queries.

The GenAI approach takes a more flexible and exploratory route by feeding various types of data—structured, semi-structured, and unstructured—into an LLM. An application/bot is then used to ask questions on this data, which is all sent to an LLM. This method is beneficial when the questions are not predefined, allowing for broader exploration and discovery. The bot can upload unstructured data (documents), semi-structured data (CSV files), and structured data exported to CSV or pulled directly from a database. However, challenges include ensuring CSV files have sufficient metadata and maintaining accuracy in data extraction when using retrieval-augmented generation (RAG). This approach is particularly useful for applications like customer support bots or summarizing and analyzing large document repositories. The workflow typically follows the pattern: Document/CSV/Database → LLM → Bot Query → Answer, making it best for exploration and leveraging many types of data.

In summary, the traditional method excels in accuracy and predefined queries, while the GenAI approach is ideal for exploration and working with diverse data sources.

One more point about Microsoft Fabric AI Skill: It is not being used to send data to an LLM. Rather, it is taking natural language questions and turning them into SQL and running the SQL on structured data and returning the results, of which the results could then be fed to an LLM by manual means (i.e. by exporting the data to a csv and sending it to the LLM along with unstructured data to help answer a question). The AI Skill essentially acts as a natural language interface to your structured data. But you can argue that AI skill sending metadata to an LLM to create a SQL statement to answer a question is as good as sending the actual data to the LLM, with the caveat you are not able to combine it with unstructured data. Also, SQL only deals with specific questions (“What are the top-selling products?”, “What is the average sales growth by region?”) not with more general questions that an LLM can answer (“find anomalies in the data” or “Identify patterns indicative of fraud”). While metadata-driven SQL generation is efficient for many business intelligence use cases, it falls short of providing the deeper contextual or semantic analysis that LLMs excel at.

I gave a presentation on this topic for the Toronto Data Professional Community that you can view and download the slides.

More info:

Build Chatbot with Large Language Model (LLM) and Azure SQL Database

Natual Language to SQL Query

NLP based Data Engineering and ETL Tool – Ask On Data

Ask questions to your data, is Copilot the way to go or should we consider alternatives like AI Skills?