Should I solve it with AI?

Countless people have written about AI in the past few years. They largely break down into two camps: the lovers and the haters. They’re both wrong.

AI is a tool much like a database or a hammer. If you apply it well to solve the right problems, it can be incredibly powerful. Otherwise, it’s like trying to use a hammer to sweep the floor.

This post is not about the power or the dangers of AI. It’s about how to know whether AI is the right solution for the problem at hand.

A quick note: This post is formulated with a focus on large language models (LLMs), because that’s what most people are using right now. It holds equally for other machine learning techniques, though, from random forests to deep learning. These other methods are far cheaper in terms of compute and much faster at inference time, but more expensive to develop, because they require acquiring a large training set, defining features, and implementing a custom predictor. Unlike LLMs, they generally require a reasonable amount of machine learning expertise to implement correctly. These considerations should color your cost benefit analysis as you answer the questions below.

Is this a high priority problem? AI is expensive to build, maintain, and run, both in terms of development costs and in terms of compute costs. Use it only for critical problems.
Is there a simpler or cheaper solution? Make sure to consider rule-based heuristic solutions and other methods before investing in AI.
What is the tolerance for mistakes and non-determinism? Different AI applications will have different accuracy, but it’s never 100%. AI shouldn’t be used for problems that require certainty, and it often can’t even guarantee determinism and reproducibility.
Can I give the model the information it needs to solve the problem? For any question you want an AI model to answer, you need to understand how a human expert would go about answering it. What information would they need to look at, and can you pull that information from somewhere to put it in a prompt for an LLM¹? Imagine the model is a new hire on your team, who has the needed domain expertise, but knows nothing about your internal workings, data, or processes. Also imagine this new hire has been in a coma for the last couple of years², and knows nothing about recent developments in the field. You need to pull together all the information needed for this person to solve the problem at hand, and fit it into a prompt of limited length³.
Is there a good evaluation set? Suppose you come up with an AI-based solution: can you automatically determine how good the results are, and iterate until you get good results? If you’re reliant on painstaking human triage of AI-generated results, the project becomes very expensive.
Is there a risk of bias? Keep in mind that, like a human, AI can be biased. Models are often trained on data produced by humans⁴, and can propagate the biases built into their training set. The risk of bias differs significantly for different applications, but give careful consideration to what biases your training set might contain and what harm a biased model response might cause. Relatedly, consider whether there are other categories of risk associated with your use of AI, such as misleading users or encouraging self harm. If you’re not certain how to ascertain or mitigate these risks, consider consulting with an expert on this one, because the potential for harm is very real.

Now, if you’ve made it through these six questions and you think AI is the right solution, then you’re well on your way to putting together a prototype and beginning to explore. Pull together your evaluation set, write a script that gathers the needed information for each sample and writes it into a prompt, generate responses from the model, and start computing how good they are. Examine the samples that failed, to understand what other data the model needs or where you may need to be more explicit in the instructions you give the model. Iterate until you reach results that are good enough for your needs. Even if you have a great automated evaluation set, remember to manually spot-check the results occasionally, to make sure everything is as you expect. And remember to check for bias and other harmful outputs: how to do so is a big question beyond the scope of this post, but many others have written about it.

A colleague recently reached out on Slack with an interesting problem he was thinking of solving with machine learning. I sent him the questions above, and a few Slack messages later it became apparent that, even under optimal conditions, the cost of applying a machine learning solution would far exceed the benefits. I see this as a clear win: A short conversation circumvented months of development and experimentation that would ultimately have led to disappointing results.

I hope you find this framework helpful! Hype aside, AI can be a powerful tool when applied to the right problems.

If you’re considering a classical machine learning or deep learning solution rather than an LLM, then instead of pulling the information into a prompt you’d need to pull it into a set of structured features to use as input for the model. In some cases the model may be able to learn complex patterns and make predictions that a human expert cannot, but you still have to think about which data to feed into the model. ↩
Large language models are expensive to train, and huge training sets are expensive to collect. Many of the state-of-the-art models used today were trained a couple of years ago, on training sets built back then, and have no knowledge of developments that have happened since. ↩
The maximal prompt length depends on the model, but is often on the order of one to four thousand words. ↩
For example, LLMs are usually trained at least partially on vast amounts of text from the internet. ↩

Should I solve it with AI?

Footnotes