One of the most important considerations when building complex AI systems, including LLM based systems, is evaluating their performance. It is important because we need to 1) understand how well the model works overall, as well as 2) what impact did the recent change made.

In this article, I will cover model evaluation of LLM models that are already producing good results by themselves (like GPT-3.5 or GPT-4) on a generic standard metrics (like superglue) and we want to evaluate performance for a specific task we are building the model for. I.e, we are not evaluating general performance of LLM but performance of GPT-4 for a task like chatbot for a bank.

Three main points to decide on for model evaluation is to choose the main metric, create a test set and decide how to compare the outputs of the model to the ground truth.

  1. Choose the main metric. There might be multiple metrics to get an intuition of what is happening in the system but it’s always easier to compare a single metric rather than multiple. Often simple metric like accuracy works, where we count example as true if the meaning matches one or multiple expected results. Point 3 covers how to compare output to the expected result by meaning.

  2. Create a test set, either manually, by generating it with LLM or to use both. My preference is to use the last option because manual examples allow to build an intuition about the queries and answers that we want. The rest can be quicky generated by the LLM.

    LangChain has a very nice chain QAGenerateChain to help to automatically generate the questions and answers from examples.

  3. Decide how to compare results of the model with the ground truth. It’s easy to compare numbers but with the fuzzy nature of text inputs and outputs it can become a trickier question. The answer, again, is to use LLM itself to do the comparison.

Additional metrics to keep track of:

  • Bias and fairness
  • Drifts. Data and model drifts
  • Data quality

If there is enough interest, I can explore these additional concepts and relevant metrics in future articles.


If you like what I write, consider subscribing to my newsletter, where I share weekly practical AI tips, write my about thoughts on AI and experiments.
Photo by Chris Liverani on Unsplash


This article reflects my personal views and opinions only, which may be different from the companies and employers that I am associated with.