Evaluating LLM Chatbot Conversations with Athina AI

Do not index

Original Paper

Blog URL

Evaluating LLM applications in production comes with a lot of challenges.

Let’s take an evaluation system for a chatbot, as an example.

Message-level evaluation WITH chat history

Most evaluations will measure the quality or accuracy of individual messages.

For example, some common evaluation metrics for RAG apps are Answer Relevancy and Answer Completeness.

💡

Answer Relevancy (Github) (Docs) Measures how relevant the response is to the user_query

💡

Answer Completeness (Github) (Docs) Does the response completely address the user_query

But consider a chat that looks like this:

Conversation

#1
User: I want to buy a smartphone
AI: Great, do you have a brand in mind?

#2
User: Apple
AI: Good choice - would you like to buy an iPhone 15?

When message #2 is evaluated, the user_query is just “Apple”, which isn’t a complete query. This means the Answer Relevancy evaluator will be useless to run here.

So how do we solve this?

Athina’s evaluation orchestration will automatically include the previous chat history as context in the prompt for LLM-based evaluators.

So our Answer Completeness evaluators

💡

Answer Relevancy Measures how relevant the response is to the user_query given the previous chat_history as context.

This is done by including the actual chat history in the evaluation prompt (https://github.com/athina-ai/athina-evals/blob/main/athina/evals/llm/context_contains_enough_information/evaluator.py).

Conversation-level evaluation

Sometimes, you might want evaluations that measure the entire conversation as a whole.

For example, Athina’s Conversation Coherence evaluator returns a ConversationCoherence score that represents the % of messages that were coherent with the chat.

A high score indicates that the AI agent is coherent across the entire conversation.

A low score indicates that many of the AI responses were incoherent given the previous chat history.

Here’s a post explaining how this works and how you can use it:

Evaluating LLM Chatbot Conversations with Athina AI

Message-level evaluation WITH chat history

So how do we solve this?

Conversation-level evaluation

How to Evaluate AI Chats Using Conversation Coherence Evaluator

Want to build a reliable GenAI product?

Related posts

Why think step by step? Reasoning emerges from the locality of experience

Introducing Athina Prompt Management: A Powerful and Flexible Prompt Playground and CMS

Evaluating LLM Chatbot Conversations with Athina AI

Message-level evaluation WITH chat history

So how do we solve this?

Conversation-level evaluation

How to Evaluate AI Chats Using Conversation Coherence Evaluator

Want to build a reliable GenAI product?

Related posts

Why think step by step? Reasoning emerges from the locality of experience

Introducing Athina Prompt Management: A Powerful and Flexible Prompt Playground and CMS

Join 2000+ AI engineers