How to Use a Custom Grading Criteria to Evaluate LLM Responses (LLM-as-a-Judge)

How to Use a Custom Grading Criteria to Evaluate LLM Responses (LLM-as-a-Judge)
Do not index
Do not index
Original Paper
Blog URL
In the rapidly evolving field of language models, ensuring the accuracy and relevance of responses is crucial. This blog post will guide you through setting up a custom grading criteria to evaluate responses from large language models (LLMs), using a simple conditional evaluation system.

What is it?

A custom grading criteria is a method used to evaluate the responses of language models based on predefined conditions.
It operates on a simple principle: if the response meets a certain condition X, it fails; otherwise, it passes.
This evaluation is integrated into a Chain of Thought (CoT) prompt, ensuring that the output is a structured JSON containing the pass/fail status along with a reason.

Why do you need it?

For developers working with LLMs, ensuring that the model's responses meet specific standards or criteria is essential.
This tool is particularly useful for applications where responses need to adhere strictly to certain guidelines or quality standards. It simplifies the process of assessing whether the responses from an LLM are adequate, based on the conditions you define.
 
Some examples:
“If the response contains a financial figure, then fail. Otherwise pass”
“If the response contains a phone number, then fail. Otherwise pass”
“If the response says something like I don’t know, then fail. Otherwise pass”
“If the response claims to have taken an action, then fail. Otherwise pass”
“If the response mentions a refund, then fail. Otherwise pass”
“If the response says to contact support, then fail. Otherwise pass”
 

How it works

The evaluator wraps the custom grading criteria inside a CoT prompt.
It checks if the response from the LLM meets the specified conditions.
If a response does not meet the condition, it is marked as a fail, and the reason for failure is recorded. This system is particularly effective for straightforward, conditional evaluations.
 

Tutorial

Step 1: Set Up Your Environment

First, import necessary libraries and set up your environment variables. Ensure that your API keys for OpenAI and Athina are loaded correctly.
import os
from athina.evals import GradingCriteria
from athina.loaders import ResponseLoader
from athina.keys import OpenAiApiKey, AthinaApiKey
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))
 

Step 2: Initialize Your Dataset

Load your dataset using the ResponseLoader class. This class ensures that the data is in the correct format with a "response" field, suitable for the LlmEvaluator class.
# Create batch dataset from list of dict objects
raw_data = [
    {"response": "I'm sorry but I can't help you with that query"},
    {"response": "I can help you with that query"},
]

dataset = ResponseLoader().load_dict(raw_data)
pd.DataFrame(dataset)
 

Step 3: Configure and Run the Evaluator

Configure the evaluator using the GradingCriteria class by specifying your custom grading criteria. Optionally, you can also select the model you wish to use for grading.
# Checks if the LLM response answers the user query sufficiently
eval_model = "gpt-3.5-turbo"

grading_criteria = "If the response says it cannot answer the query, then fail. Otherwise pass."

GradingCriteria(
    model=eval_model,
    grading_criteria=grading_criteria
).run_batch(data=dataset, max_parallel_evals=2).to_df()
This setup will evaluate each response in your dataset. If a response indicates that it cannot help with the query, it will fail; otherwise, it will pass.
 
As always, you can reach out to us for help anytime at hello@athina.ai or using the chat on https://athina.ai.

Athina can help. Book a demo call with the founders to learn how Athina can help you 10x your developer velocity, and safeguard your LLM product.

Want to build a reliable GenAI product?

Book a demo