Phil Booth

Existing by coincidence, programming deliberately

Lessons learned from integrating with GPT in production

For the last ten months or so I've worked on integrating GPT (various flavours) with a large production codebase. It's been one of the most chaotic periods of my career, featuring numerous false starts, changes of direction and rapid improvements followed by major setbacks. This is what I've learned.

AI assistants are only as good as their surrounding infrastructure

Adding an LLM to a production codebase is not a shortcut to anything. You still have to figure out how data is queried, how authorisation works, how errors are handled and how 3rd-party services are interacted with.

If those abstractions are clean, reliable and well-tested, it puts you in a strong position. But if any of them are incomplete or unreliable, the issues will be magnified when you throw an LLM into the mix. It pays to make them robust ahead of time, so that you're not faced with hard-to-debug issues when generative AI starts throwing spanners at your system.

Reliability is inversely proportional to team size

The ripple effect from making changes to an LLM-based system can be hard to predict. Every time you edit a prompt or finetune a new model, or even when you're modifying some adjacent functionality, things have a tendency to fan out in surprising ways.

With one person working on it, this can be straightforward to keep track of. You know which requests work and what their expected responses are. You know which ones are being worked on and what the plan is for the future. So you coordinate all that knowledge as you go and progress trends up and to the right.

Adding another person to the mix makes it harder, but it's still doable if you're in constant communication. Three people worked okay for our team too but when a fourth person was added, it quickly become chaotic. Engineers would frequently report their stuff breaking and the pace of change made it hard to pinpoint when or where the breakage occurred.

This is nothing new of course, essentially it's a reframing of Brooks' Law, but LLMs seem to amplify it due to their probabilistic nature. And engineers focused on one specific thing are habitually guilty of excessive optimism in other areas. "This tiny prompt tweak couldn't possibly break anyone else's work" is an easy trap to fall into.

Testing is hard

Writing integration tests for any probabilistic system is tricky. Some flakiness is inevitable, so you must add retries and apply Postel's Law to your assertions.

You must also consider which tests to write and how many are needed. AI assistants typically handle a great variety of different requests, so you'll want to cover as many as possible in your test suite. But if you're working against a rate limit, you'll also need to balance that against how many you can run before tripping the limit.

This situation is exacerbated if you're using OpenAI, because they apply rate limits across an organisation rather than per API key. Running tests using an API key from the same org you use in production risks DOSing your real users.

Instead you have to navigate past the various obstacles that OpenAI throw in the way of creating a second org for testing: sign up using a different email address (but then you can invite your original email afterwards 🤷) and, if your phone number has been used for two accounts already, using a different phone number too. Your prize for making it that far is a new workspace that doesn't have access to any of the finetuned models you've trained. After you've rectified that, you'll finally be able to run some integration tests but beware that rate limits for this new org will be low and you'll quickly hit them if you're running tests in CI.

It's all a massive pain in the arse and frankly, I don't consider OpenAI as suitable for production use right now because of it. It ends up being more reliable to run alternative models on your own infrastructure, even though the GPT models are better.

Write your own abstractions

The generative AI space is filled with open-source libraries and frameworks of questionable value. You will likely have standard procedures in place for logging, metrics, error handling and so on. Writing your own abstraction around the LLM of your choice, to work cohesively with those other components, is not a massive effort.

In my case we flipped back-and-forth between a couple of alternative implementations: one that parsed everything up front, determining which actions to use in advance; and one that parsed iteratively, determining the next action to use based on what had come before. We found the second approach worked well for prompts with the base models, but the first approach was ultimately superior in combination with finetuned models.

Separate queries and commands

CQRS is a useful pattern that has nothing to do with LLMs, but it defines a princple that can be helpful to apply.

Broadly speaking, AI assistants handle two kinds of request: queries ("get x") and commands ("do y"). The key distinction is that queries do not modify state1 and do not have side effects. Queries and commands must be handled differently, so those differences should exist in your code structure too.

Firstly, common to both is authorisation: each user should only be able to access things they have permission for. This is one reason I stressed the importance of surrounding infrastructure earlier. If your system already enforces access control on user sessions, then as long as the assistant uses the user's session for everything it does, you should have no problems. With authorisation in place, queries should be allowed to run autonomously.

Commands should never be allowed to run autonomously though. Because they have side effects, the fallout from potential hallucinations is too risky. Instead you need to implement a feedback loop so a human, in most cases the user, can approve the command. Crucially, the human should be shown all of the relevant context around the command too. For instance, when sending an email they must approve the recipients, the subject, the body and any attachments.

With a feedback loop in place, you can then track these approval rates alongside your other application metrics. If you see that chart deteriorate, you know something is wrong and needs investigating.

Embrace fuzziness

In the event an LLM generates something you can't parse, which is not uncommon, you have three options: fail, retry or fuzzier parsing.

In most cases failing leads to suboptimal user experience, so it's to be avoided if possible. Short retry loops can be okay, but they add latency and you probably don't want to allow more than two or three iterations. So before you get to that point, it can be useful to have a cleanup function that wrangles the data into something usable before it's parsed.

The implementation of this function depends entirely on what you're doing of course. But be prepared to do any of the following:

I found that by keeping an eye on our production logs, I was able to discover new unhandled edge cases in our cleanup function and gradually improve it. I also discovered that it pays to comment this function liberally. Otherwise it's guaranteed that a helpful person will come along later and remove lots of "unnecessary" code that isn't being used. Except, of course, sometimes it is.

Don't make LLMs a SPOF

Like anything, LLMs can go wrong in lots of ways, so you should consider what happens to your system when they fail and handle it appropriately.

For example, say you have an AI assistant that helps users through your application signup flow. What happens if it hits a rate limit or the service it connects to is unavailable? Make sure you test those scenarios and fail gracefully so users aren't blocked from signing up.

Prompt "engineering" is a lie

I've never felt less like an engineer than when I've been hacking prompts to try and make a thing work. There is some solid advice available (try to keep your prompts short, prefer positive instructions) but also a huge body of pseudo-scientific bollocks out there. If you're using ChatGPT then it's also based on a somewhat faulty premise, because the models are updated periodically and you're not building on a stable foundation. It's a textbook example of programming by coincidence.

If you find yourself struggling to make a prompt work well for multiple usecases, it might be a sign that you need to break it into two or more distinct prompts. Or it might be a sign that you should investigate finetuned models. There's some legwork involved in setting those up, you need to generate a lot of examples to act as training data. But if you do that well, and don't forget you can use an LLM to help, the end result is usually more reliable than trying to craft the perfect prompt for a base model.