How to use GPT-4 with streaming data for real-time generative AI?

Updated 2 years ago on July 11, 2023

By now, just about everyone has tried playing around with ChatGPT, making it do all sorts of wonderful and weird things. But how do you go beyond mere tomfoolery and use it to create a real, production application? Combining ChatGPT's general capabilities with your unique data and needs plays a big part in this.

What do I mean by that? Let me give you an example of a scenario that every company is thinking about right now. Imagine you are an airline, and you want an AI-enabled support agent to help your customers if a human is unavailable.

Your customer may have a question about how much it costs to carry skis on an airplane. If this is a general airline policy, this information is probably available on the Internet and ChatGPT may be able to answer it correctly.

But what about the more personal questions, such as

Is my flight delayed?
Can I upgrade to first class?
Am I still on the waiting list for tomorrow's flight?

It depends! First of all, who are you? Where and when are you flying to? Which airline did you book your ticket with?

ChatGPT can't help here because it doesn't know the answer to these questions. This is not something that can be "fixed" by innovations in OpenAI. Your personal data is (thankfully) not available on the open Internet, so even a Bing implementation linking ChatGPT to the open Internet won't work.

The main obstacle is that the airline (in our scenario, you) must securely provide ChatGPT with timely data from its internal data stores. Surprisingly, how to do this does not follow the standard pattern of how machine learning infrastructure works. Big language models have changed the relationship between data design and model building. Let me explain this with a short diagram.

In traditional machine learning, most of the data design work happens in the model building phase. You take a specific set of training data and use feature engineering to create the correct model. When training is complete, you get a single model that is capable of solving the task at hand, but nothing more. Most of the task-specific intelligence is laid down during training. Since training is usually done in batch mode, the data stream is also batch and comes from a data lake, data warehouse, or other batch-oriented system.

With large language models, the relationship is reversed. Here, the model is built from a huge shared dataset, and deep learning algorithms perform end-to-end training once, creating a model with extensive features and reusability. This means that services like OpenAI and Google are essentially providing functionality based on reusable pre-trained models, rather than requiring them to be recreated for each task. And that's why ChatGPT is useful for many tasks already out of the box. In this paradigm, when you want to teach a model something specific, you do it in every query. This means that data design now has to happen at the time of the prompt, so the data flow problem goes from batch to real time.

What is the best tool for the job? Event streaming is probably the best, as its strength is to circulate data streams around the company in real time.

In this article, I will show how streaming and ChatGPT interact. I'll cover how to create a live support agent, discuss the architecture that makes it work, and point out a few pitfalls.

How ChatGPT works

While there is no shortage of detailed discussion of how ChatGPT works, I'll start by describing its internals to make sense of this post.

ChatGPT, or actually the GPT model, is a very large neural network trained on text from the Internet. By training on a huge data set, GPT was able to learn how to talk like a human and appear intelligent.

When you request ChatGPT, your text is broken down into a sequence of tokens that are fed into a neural network. One token at a time, it determines what should be output by the next logical action.

Man: Hello.

AI: How

AI: How can you

AI: How can I

AI: What can I do to help

AI: What can I do for you

AI: How can I help you today?

One of the most interesting aspects of ChatGPT is that it can memorize previous parts of your conversation. For example, if you ask it "What is the capital of Italy?", it will correctly answer "Rome". If you then ask "How long ago did it become the capital?", it can infer that "it" refers to Rome as the capital, and will correctly answer "1871". How is this possible?

ChatGPT has a so-called context window, which is a kind of working memory. Each of the OpenAI models has a different window size, limited by the sum of input and output tokens. When the number of tokens exceeds the window size, the oldest tokens are discarded and ChatGPT "forgets" about them.

As we will see later, contextual windows are key to the development of ChatGPT's capabilities.

Make GPT-4 understand your business

With a basic knowledge of how ChatGPT works, it's easy to see why he can't tell a customer about a flight delay or the possibility of an upgrade to first class. He doesn't know anything about it. So what to do?

The answer is to modify the GPT and work with it directly rather than through the high-level ChatGPT interface. For the purposes of this article, I will focus on the GPT-4 model (and will refer to it as GPT for short).

In general, there are two ways to change the behavior of GPT: fine-tuning and search. In fine-tuning, the underlying neural network is retrained on new data to adjust each of the weights. However, this approach is not recommended by OpenAI and others because it is difficult to get the model to memorize data with the accuracy needed to run an enterprise application. Not to mention the fact that all the data it is tuned from can immediately become obsolete.

That leaves the search. The basic idea is that just before you send a query to GPT, you go to another source of information, search for relevant information, and add it to the query. You instruct GPT to use this information as a prefix to the hint, essentially providing your own set of facts to the context window at runtime.

If you were performing this manually, your query would look something like this:

You are a friendly airline customer service agent. When answering questions, use only the facts below. If you don't know the answer, say, "I'm sorry, I don't know. Let me get in touch with a person to help," and nothing else.

The client talking to you is named Michael.

Michael booked a ticket on flight 105.

Michael's flying economy class on flight 105.

Flight 105 is scheduled for June 2.

Flight 105 flies from Seattle to Austin.

Michael booked a ticket on flight 210.

Michael is flying economy class on flight 210.

Flight 210 is scheduled for June 10.

Flight 210 flies from Austin to Seattle.

Flight 105 has two seats left in first class.

Flight 210 has 0 seats left in first class.

A customer may upgrade from economy class to first class if there is at least one first class seat left on the flight and the customer is no longer a first class passenger on that flight.

If a customer requests an upgrade to first class, you confirm which flight.

When you are ready to begin, say, "How can I help you today?"

Compared to fine-tuning, the search-based approach is much easier to understand, less prone to errors, and more suited to situations that require actual answers. And while it may seem hacky, this is the approach used in some of the most well-known AI products, such as GitHub Copilot.

So, how exactly do you create all of this?

Building the concept of "customer 360

Let's get away from the GPT for a moment. Before we can create a support agent, we need to accomplish one key task - to gather all the information that can relate to each customer.

Going back to the example of whether a customer can upgrade to first class, remember that the answer depends on many different factors for a particular flight. To have enough context for the answer, you need to consolidate the data for:

  • Customer Identification
  • Upcoming flight bookings for the client
  • Seat layout of the aircraft assigned to the flight
  • Current flight capacity
  • Reward in the form of points for free upgrades

In most companies, this data is spread across many different systems such as databases, data warehouses, SaaS applications, queues, and file systems. Much of this data is not designed for low-latency interactive queries, and none of it is organized in a way that can be easily consolidated. The communication between these systems is point-to-point, making it very difficult to get a unified view of the data.

Event streaming is a good solution for bringing all of these systems together. By receiving streams of information as each one changes, you can create a single view of each client that is easy to query with low latency.

Confluent connectors make it easy to retrieve data from these isolated systems. Enable a source connector for each of them, and changes will flow into Confluent in real time.

Because these streams typically contain somewhat raw information, you may need to process this data to get a more accurate representation. Stream processing is the conversion, filtering, and aggregation of individual streams into a representation more suitable for different access models. This data will probably end up in a relational database, key/value store, or document repository.

Connecting client data to GPT

Once all the customer 360 data is represented in a single view, it is necessary to programmatically associate this information with each request. At this stage, the architecture is as follows:

And the interaction with the support agent is as follows:

AI: How can I help you today?

Man: Hello. I would like to upgrade my seat to first class. Can you do that?

AI: Of course, I would be happy to help you with this. However, could you please clarify on which flight you would like to upgrade to first class? Is it flight 105 or flight 210?

105, please.

AI: Great news, Michael. We have 2 first class seats left on flight 105. I can transfer your economy class seat to first class on this flight. Do I have to make this change for you?

Man: Yeah.

AI: Very good, Michael. I have successfully upgraded your seat to first class on Flight 105. You will now enjoy all the benefits and amenities of our first class. Let me know if there is anything else I can help you with today. Have a safe trip.

Person: Can you move my 210 seat to first grade?

AI: I'm sorry, Michael, but there are currently no seats left in First Class on Flight 210, so I cannot upgrade your seat on this flight. If you need any other assistance, please let me know.

So when the app receives a chat message, all it needs to do is find the ID of the customer who sent it, request a 360 view of the customer, and provide a block of useful information at the beginning of each request - similar to what I showed in the manual pre-request.

Connecting the knowledge base to the GPT

This method is great for questions about an individual customer, but what if you want a customer service agent to have a broad knowledge of your business? For example, if a customer asks: "Can I bring an infant on my lap?", that question can't be answered with 360 customer database data. Every airline has general requirements that you would want to communicate to the customer, such as that they must carry a child's birth certificate.

This kind of information is typically found on many web pages, in internal knowledge base articles, and in support calls. In theory, you could get all this information and add it to each request as I described above, but this approach is wasteful. In addition to taking up a lot of space in the context window, you'll be sending back and forth a lot of tokens that are mostly unnecessary, driving up the cost of using them.

How to overcome this problem? The answer lies in the use of embeddings. When you ask a GPT question, you need to figure out what information is associated with it in order to provide it along with the original clue. Embeddings are a way of mapping information into a "concept space" in the form of vectors of numbers. Quick operations can then be used to determine how related any two concepts are.

Okay, but where do these vectors of numbers come from? They come from running the data through the neural network and capturing the values of the neurons in the hidden layers. This works because the neural network is already trained to recognize similarities.

To calculate embeddings, the OpenAI embedding API is used. You send a text fragment, and in response you get a vector of numbers.

curl https://api.openai.com/v1/embeddings \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $OPENAI_API_KEY" \ -d '{"input": "Your text string goes here", "model": "text-embedding-ada-002" }' { "data": [ { "embedding": [ -0.006929283495992422, -0.00533642200753030928, ... -4.54713226645252536e-05, -0.024047505110502243 ], "index": 0, "object": "embedding" } ], "model": "text-embedding-ada-002", "object": "list", "usage": { "prompt_tokens": 5, "total_tokens": 5 } }

Since we're going to use embeds for all of our political information, there are going to be a lot of them. So where do we put them? The answer is into a vector database. A vector database specializes in organizing and storing this kind of data. Popular choices include Pinecone, Weaviate, Milvus, and Chroma, and more are popping up all the time.

As a small digression, I'd like to know why you shouldn't use a vector database exclusively, wouldn't it be easier to put the 360 client data there as well? The problem is that vector database queries retrieve data based on the distance between embeddings, and this is not the easiest thing to debug and customize. In other words, when a customer starts a chat with a customer service agent, you absolutely want the agent to know the set of flights the customer has booked. You don't want to leave that to chance. So in this case, it's better to simply request a 360 view by customer ID and put the resulting data at the beginning of the request.

If your policies are stored in a vector database, gathering the information you need becomes much easier. Before you send a query to GPT, you create an insert from the query itself. You then take that insert and query the vector database for related information. The result of this query becomes a set of facts that are added to the tooltip, which reduces the size of the context window because only relevant information is used.

Let's get in touch!

Please feel free to send us a message through the contact form.

Drop us a line at mailrequest@nosota.com / Give us a call over skypenosota.skype