How to evaluate your Llama Index query engine using Ragas evals + Athina AI

If you're using Llama Index to work with advanced retrieval strategies, you're going to need a great evaluation setup. Here's how you can use Athina's SDK to run Ragas evals on your Llama Index RAG pipeline.

How to evaluate your Llama Index query engine using Ragas evals + Athina AI
Do not index
Do not index
Original Paper
If you're using Llama Index to work with advanced retrieval strategies, you're going to need a great evaluation setup.
Ragas is a library of state-of-the-art evaluation metrics for RAG pipelines.
Here's how you can use Athina's SDK to run Ragas evals on your Llama Index RAG pipeline.

Why use Athina?

  • Suite of over 35 state-of-the-art evals
  • UI to log and view results for easy comparison
  • Quick setup - run a suite of evals in a few lines of code, or set up automatic evals in a few clicks using the UI
  • Granular analytics on model performance, segmented at every level.
You can try the sandbox account here.

How to run Ragas evals on your Llama Index query engine using Athina

  1. Install the dependencies
pip install athina
  1. Import dependencies and set your API keys
You'll need to set the OPENAI_API_KEY and the ATHINA_API_KEY in your .env file.
(This will still run without the ATHINA_API_KEY but you won't be able to view the results in the nice dashboard UI, or track your experiments)
import os
from athina.evals import (
    RagasContextRelevancy,
    RagasAnswerRelevancy,
    RagasFaithfulness,
    RagasAnswerCorrectness,
)
from athina.runner.run import EvalRunner
from athina.loaders import RagasLoader
from athina.keys import AthinaApiKey, OpenAiApiKey
from llama_index import VectorStoreIndex, ServiceContext
from llama_index import download_loader
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))
  1. Create a Llama Index query engine
Here's some boilerplate code to do this, but if you have a real project using llama index, you're probably already doing something like this.
WikipediaReader = download_loader("WikipediaReader")

loader = WikipediaReader()

documents = loader.load_data(pages=['Y Combinator'])

vector_index = VectorStoreIndex.from_documents(
    documents, service_context=ServiceContext.from_defaults(chunk_size=512)
)

query_engine = vector_index.as_query_engine()
  1. Load your evaluation dataset
Here's a very basic example with 2 datapoints.
raw_data_llama_index = [
    {
        "query": "How much equity does YC take?",
        "expected_response": "YC invests $500k in exchange for 7% equity."
    },
    {
        "query": "Who founded YC?",
        "expected_response": "YC was founded by Paul Graham, Jessica Livingston, Robert Tappan Morris, and Trevor Blackwell."
    },
]

llama_index_dataset = RagasLoader(query_engine=query_engine).load_dict(raw_data_llama_index)

# Optional - print as a pandas DataFrame for nicer visibility
pd.DataFrame(llama_index_dataset)
All datapoints must contain a query.
For certain Ragas evaluators, you will also require an expected_response . This is a reference response that the evaluator can compare to the LLM generated response.
Note that all evaluators do NOT need the expected response field. See the docs for more info.
  1. Run the eval!
Now all that's left is to run the evaluator!
eval_suite = [
    RagasAnswerCorrectness(),
    RagasFaithfulness(),
    RagasContextRelevancy(),
    RagasAnswerRelevancy(),
]

# Run the evaluation suite
EvalRunner.run_suite(
    evals=eval_suite,
    data=llama_index_dataset,
    max_parallel_evals=1,   # If you increase this, you may run into rate limits
)
Evaluation metrics are on a scale of 0 to 1, with 0.0 being the worst score possible, and 1.0 being the best score possible.
You'll see the results as a DataFrame like this:
notion image
 
The evaluation request will also be logged and saved to Athina, and you can view the results from your Athina dashboard anytime!
notion image
These evaluations can help you quickly diagnose issues in your RAG pipeline. Eval-driven development can help you iterate 10x more rapidly than by simply eyeballing retrievals and responses.
The best teams will run a similar suite of evaluations in production as well.
If you want to run such evals automatically on your production logs, try our sandbox, or contact us to book a demo.

Athina can help. Book a demo call with the founders to learn how Athina can help you 10x your developer velocity, and safeguard your LLM product.

Want to build a reliable GenAI product?

Book a demo

Written by

Shiv Sakhuja
Shiv Sakhuja

Co-founder, Athina AI