top of page
Writer's pictureMatthew Groff

LLMOps | Testing AI Applications

Introduction

Testing AI applications presents unique challenges and opportunities compared to traditional software testing. The non-deterministic nature of AI outputs, driven by probabilistic behaviors, requires a nuanced approach to ensure their reliability and effectiveness. This blog post aims to shed light on the distinctive nature of AI application testing, outlining strategies that leverage both deterministic and advanced LLM-based evaluations. We'll explore how these methods, including the use of assert unit tests and evaluations using tools like OpenAI GPT-4, can enhance the accuracy and functionality of your AI models in real-world applications.


AI Application Testing vs. Traditional Testing

The fundamental difference in testing AI applications lies in their output nature. Unlike traditional software that follows predefined rules leading to deterministic outputs, AI models, particularly language models, operate on probability and prediction. This results in non-deterministic outputs where the same input can lead to various correct (and incorrect) outcomes. Therefore, testing these models extends beyond right or wrong, encompassing criteria such as accuracy, quality, consistency, bias, toxicity, and more.


Implementing Assert Unit Tests in AI Applications

Assertion unit tests, commonly used in software development, can also be effective in AI application testing, especially when the expected output is fairly deterministic. For example, in testing whether an AI model can perform a specific task or respond to certain prompts, assert tests can verify the presence or absence of expected elements.

Consider this test for the ai_poem_generator from our [](previous blog post on modular design):

# test_ai_models.py
from ai_models import ai_poem_generator

def test_ai_poem_generator():
    message = "cats with hats"

    # assert that the AI model returns a poem using the words 'cat' and 'hat'
    assert ai_poem_generator(message).includes("cat")
    assert ai_poem_generator(message).includes("hat")

This test checks for the presence of specific words, but it doesn't fully capture the nuance of evaluating an AI model. Let's enhance it by incorporating some additional criteria:

# Enhanced test for ai_poem_generator
from ai_models import ai_poem_generator

def test_ai_poem_generator():
    message = "cats with hats"
    poem = ai_poem_generator(message)

    # Evaluate context adherence and relevance
    assert "cat" in poem and "hat" in poem, "The poem should include the words 'cat' and 'hat'"

    # Evaluate for bias and toxicity
    assert not any(bad_word in poem for bad_word in ["offensive_term1", "offensive_term2"]), "The poem should not contain toxic or biased language"

    # Additional evaluations can be added as per application requirements


Evaluating LLMs with LLMs

Going a step further, evaluating AI models can also involve using another LLM, like OpenAI GPT-4, to assess the quality and relevance of the responses. This advanced method allows for a deeper analysis of the model's output, considering aspects like coherence, relevance to the prompt, and creativity.

Here's an example of how you might use GPT-4 to evaluate a response:

# Evaluation using GPT-4
def evaluate_with_rubric(prompt_response, eval_markdown_code):
   messages = [
        {
            "role": "system",
            "content": "You are a prompt evaluation expert. You will respond in JSON format with an 'explanation' of why you have given it a grade from 0-100, and 'suggestions' for improving the response in order to get a higher grade, and lastly the 'grade' from 0-100 in integer number format. Use the rubric to accomplish this task."
        },
        {
            "role": "function",
            "name": "grading_rubric",
            "content": eval_markdown_code
        },
        {
            "role": "user",
            "content": f"Grade the following response using the rubric: \\\\n{prompt_response}"
        }
    ]

    response = client.chat.completions.create(
            model='gpt-4-1106-preview',
            messages=messages,
            temperature=0,
            response_format={ "type": "json_object" },
        )

    eval_response = response.choices[0].message.content

While there is some debate around the accuracy and consistency achieved by using LLMs to evaluate other LLMs, this approach can provide valuable insights into the quality and relevance of the model's outputs. It's important to note that these evaluations should be used as a supplement to other testing methods, not as a standalone approach. We have had better results using small, modular tests for rating a singular aspect of the output and asking for a text-based rating like "good" or "bad" from the LLM, rather than asking for a numerical score.


Conclusion

Testing AI applications within the LLMOps framework requires a blend of traditional and innovative approaches. By employing assert unit tests for deterministic aspects and leveraging advanced LLM evaluations for deeper insights, developers can ensure their AI models are not only accurate but also deliver a high-quality user experience. As you venture further into the realm of AI application development, remember that effective testing is key to unlocking the full potential of your AI models. For expert assistance in developing and testing AI applications, reach out to us at hello@umbrage.com. Our experience in building scalable digital products with generative AI can help elevate your projects to new heights.

Comments


Commenting has been turned off.
bottom of page