How to evaluate your Llama Index query engine using Ragas evals + Athina AI

Do not index

Original Paper

Blog URL

If you're using Llama Index to work with advanced retrieval strategies, you're going to need a great evaluation setup.

Ragas is a library of state-of-the-art evaluation metrics for RAG pipelines.

Here's how you can use Athina's SDK to run Ragas evals on your Llama Index RAG pipeline.

Why use Athina?

Suite of over 35 state-of-the-art evals

UI to log and view results for easy comparison

Quick setup - run a suite of evals in a few lines of code, or set up automatic evals in a few clicks using the UI

Granular analytics on model performance, segmented at every level.

You can try the sandbox account here.

How to run Ragas evals on your Llama Index query engine using Athina

Install the dependencies

pip install athina

Import dependencies and set your API keys

You'll need to set the OPENAI_API_KEY and the ATHINA_API_KEY in your .env file.

(This will still run without the ATHINA_API_KEY but you won't be able to view the results in the nice dashboard UI, or track your experiments)

import os
from athina.evals import (
    RagasContextRelevancy,
    RagasAnswerRelevancy,
    RagasFaithfulness,
    RagasAnswerCorrectness,
)
from athina.runner.run import EvalRunner
from athina.loaders import RagasLoader
from athina.keys import AthinaApiKey, OpenAiApiKey
from llama_index import VectorStoreIndex, ServiceContext
from llama_index import download_loader
import pandas as pd
from dotenv import load_dotenv

load_dotenv()

OpenAiApiKey.set_key(os.getenv('OPENAI_API_KEY'))
AthinaApiKey.set_key(os.getenv('ATHINA_API_KEY'))

Create a Llama Index query engine

Here's some boilerplate code to do this, but if you have a real project using llama index, you're probably already doing something like this.

WikipediaReader = download_loader("WikipediaReader")

loader = WikipediaReader()

documents = loader.load_data(pages=['Y Combinator'])

vector_index = VectorStoreIndex.from_documents(
    documents, service_context=ServiceContext.from_defaults(chunk_size=512)
)

query_engine = vector_index.as_query_engine()

Load your evaluation dataset

Here's a very basic example with 2 datapoints.

raw_data_llama_index = [
    {
        "query": "How much equity does YC take?",
        "expected_response": "YC invests $500k in exchange for 7% equity."
    },
    {
        "query": "Who founded YC?",
        "expected_response": "YC was founded by Paul Graham, Jessica Livingston, Robert Tappan Morris, and Trevor Blackwell."
    },
]

llama_index_dataset = RagasLoader(query_engine=query_engine).load_dict(raw_data_llama_index)

# Optional - print as a pandas DataFrame for nicer visibility
pd.DataFrame(llama_index_dataset)

All datapoints must contain a query.

For certain Ragas evaluators, you will also require an expected_response . This is a reference response that the evaluator can compare to the LLM generated response.

Note that all evaluators do NOT need the expected response field. See the docs for more info.

Run the eval!

Now all that's left is to run the evaluator!

eval_suite = [
    RagasAnswerCorrectness(),
    RagasFaithfulness(),
    RagasContextRelevancy(),
    RagasAnswerRelevancy(),
]

# Run the evaluation suite
EvalRunner.run_suite(
    evals=eval_suite,
    data=llama_index_dataset,
    max_parallel_evals=1,   # If you increase this, you may run into rate limits
)

Evaluation metrics are on a scale of 0 to 1, with 0.0 being the worst score possible, and 1.0 being the best score possible.

You'll see the results as a DataFrame like this:

The evaluation request will also be logged and saved to Athina, and you can view the results from your Athina dashboard anytime!

These evaluations can help you quickly diagnose issues in your RAG pipeline. Eval-driven development can help you iterate 10x more rapidly than by simply eyeballing retrievals and responses.

The best teams will run a similar suite of evaluations in production as well.

If you want to run such evals automatically on your production logs, try our sandbox, or contact us to book a demo.

How to evaluate your Llama Index query engine using Ragas evals + Athina AI

Why use Athina?

How to run Ragas evals on your Llama Index query engine using Athina

Want to build a reliable GenAI product?

Related posts

Post-Semantic-Thinking: A Robust Strategy to Distill Reasoning Capacity from Large Language Models

Cookbook: How to set up Langchain tracing on Athina in 2 minutes

How to evaluate your Llama Index query engine using Ragas evals + Athina AI

Why use Athina?

How to run Ragas evals on your Llama Index query engine using Athina

Want to build a reliable GenAI product?

Related posts

Post-Semantic-Thinking: A Robust Strategy to Distill Reasoning Capacity from Large Language Models

Cookbook: How to set up Langchain tracing on Athina in 2 minutes

Join 2000+ AI engineers