Evaluating LLM Chatbot Conversations with Athina AI

Evaluating LLM Chatbot Conversations with Athina AI
Do not index
Do not index
Original Paper
Blog URL
Evaluating LLM applications in production comes with a lot of challenges.
 
Let’s take an evaluation system for a chatbot, as an example.
 

Message-level evaluation WITH chat history

 
Most evaluations will measure the quality or accuracy of individual messages.
 
For example, some common evaluation metrics for RAG apps are Answer Relevancy and Answer Completeness.
 
💡
Answer Relevancy (Github) (Docs) Measures how relevant the response is to the user_query
💡
Answer Completeness (Github) (Docs) Does the response completely address the user_query
 
But consider a chat that looks like this:
Conversation

#1
User: I want to buy a smartphone
AI: Great, do you have a brand in mind?

#2
User: Apple
AI: Good choice - would you like to buy an iPhone 15?
 
When message #2 is evaluated, the user_query is just “Apple”, which isn’t a complete query. This means the Answer Relevancy evaluator will be useless to run here.
 

So how do we solve this?

 
Athina’s evaluation orchestration will automatically include the previous chat history as context in the prompt for LLM-based evaluators.
 
So our Answer Completeness evaluators
 
💡
Answer Relevancy Measures how relevant the response is to the user_query given the previous chat_history as context.
 
This is done by including the actual chat history in the evaluation prompt (https://github.com/athina-ai/athina-evals/blob/main/athina/evals/llm/context_contains_enough_information/evaluator.py).
 
 
 
 

Conversation-level evaluation

Sometimes, you might want evaluations that measure the entire conversation as a whole.
 
For example, Athina’s Conversation Coherence evaluator returns a ConversationCoherence score that represents the % of messages that were coherent with the chat.
 
  • A high score indicates that the AI agent is coherent across the entire conversation.
  • A low score indicates that many of the AI responses were incoherent given the previous chat history.
 
Here’s a post explaining how this works and how you can use it:

How to Evaluate AI Chats Using Conversation Coherence Evaluator

 

Athina can help. Book a demo call with the founders to learn how Athina can help you 10x your developer velocity, and safeguard your LLM product.

Want to build a reliable GenAI product?

Book a demo

Written by

Shiv Sakhuja
Shiv Sakhuja

Co-founder, Athina AI