Do not index
Do not index
Original Paper
Blog URL
Evaluating LLM applications in production comes with a lot of challenges.
Let’s take an evaluation system for a chatbot, as an example.
Message-level evaluation WITH chat history
Most evaluations will measure the quality or accuracy of individual messages.
For example, some common evaluation metrics for RAG apps are Answer Relevancy and Answer Completeness.
But consider a chat that looks like this:
Conversation
#1
User: I want to buy a smartphone
AI: Great, do you have a brand in mind?
#2
User: Apple
AI: Good choice - would you like to buy an iPhone 15?
When message #2 is evaluated, the user_query is just “Apple”, which isn’t a complete query. This means the Answer Relevancy evaluator will be useless to run here.
So how do we solve this?
Athina’s evaluation orchestration will automatically include the previous chat history as context in the prompt for LLM-based evaluators.
So our Answer Completeness evaluators
Answer Relevancy
Measures how relevant the
response
is to the user_query
given the previous chat_history
as context.This is done by including the actual chat history in the evaluation prompt (https://github.com/athina-ai/athina-evals/blob/main/athina/evals/llm/context_contains_enough_information/evaluator.py).
Conversation-level evaluation
Sometimes, you might want evaluations that measure the entire conversation as a whole.
For example, Athina’s Conversation Coherence evaluator returns a
ConversationCoherence
score that represents the % of messages that were coherent with the chat.- A high score indicates that the AI agent is coherent across the entire conversation.
- A low score indicates that many of the AI responses were incoherent given the previous chat history.
Here’s a post explaining how this works and how you can use it: