Evaluation - LlamaIndex.TS

Evaluation helps you measure the quality of your RAG pipeline and identify areas for improvement. LlamaIndex provides evaluators for faithfulness, relevancy, and correctness.

Overview

All evaluators implement the BaseEvaluator interface:

interface BaseEvaluator {
  evaluate(params: EvaluatorParams): Promise<EvaluationResult>;
  evaluateResponse?(params: EvaluatorResponseParams): Promise<EvaluationResult>;
}

Evaluators return an EvaluationResult:

type EvaluationResult = {
  query?: string;
  contexts?: string[];
  response: string;
  score: number;        // Numeric score (0-1 or 0-5)
  passing: boolean;     // Whether evaluation passed
  feedback: string;     // Detailed feedback from evaluator
};

Faithfulness

What it measures: Whether the response is grounded in the provided context. Faithfulness checks if the answer contains hallucinations or makes claims not supported by the source documents.

import { FaithfulnessEvaluator } from "llamaindex/evaluation";

const evaluator = new FaithfulnessEvaluator({
  raiseError: false, // Don't throw on failing evaluation
});

const result = await evaluator.evaluate({
  query: "What is LlamaIndex?",
  response: "LlamaIndex is a data framework for LLM applications.",
  contexts: [
    "LlamaIndex is a data framework for building LLM applications.",
    "It provides tools for data ingestion, indexing, and querying.",
  ],
});

console.log(result.passing); // true
console.log(result.score);   // 1.0
console.log(result.feedback); // "Yes"

Evaluate Response Objects

Directly evaluate query engine responses:

const response = await queryEngine.query({
  query: "What is LlamaIndex?",
});

const result = await evaluator.evaluateResponse({
  query: "What is LlamaIndex?",
  response: response,
});

Custom Prompts

import { PromptTemplate } from "@llamaindex/core/prompts";

const faithfulnessPrompt = new PromptTemplate({
  template: `Context: {context}

Response: {query}

Is the response faithful to the context? Answer yes or no.`,
});

const evaluator = new FaithfulnessEvaluator({
  faithfulnessSystemPrompt: faithfulnessPrompt,
});

Relevancy

What it measures: Whether the response actually answers the question. Relevancy checks if the response addresses the user’s query.

import { RelevancyEvaluator } from "llamaindex/evaluation";

const evaluator = new RelevancyEvaluator();

const result = await evaluator.evaluate({
  query: "What is the capital of France?",
  response: "Paris is the capital of France.",
  contexts: ["Paris is the capital and largest city of France."],
});

console.log(result.passing); // true
console.log(result.score);   // 1.0

How It Works

Relevancy uses an LLM to determine if the response answers the question:

Formats query and response together
Queries a SummaryIndex of the contexts
LLM answers “yes” or “no”
Returns score (1.0 for yes, 0.0 for no)

Correctness

What it measures: How correct the response is compared to a reference answer. Correctness requires a reference (ground truth) answer:

import { CorrectnessEvaluator } from "llamaindex/evaluation";

const evaluator = new CorrectnessEvaluator({
  scoreThreshold: 4.0, // Passing score threshold
});

const result = await evaluator.evaluate({
  query: "What is 2+2?",
  response: "2+2 equals 4",
  reference: "The answer is 4",
});

console.log(result.score);   // 5.0 (scale of 1-5)
console.log(result.passing); // true (>= 4.0)
console.log(result.feedback); // Reasoning for the score

Score Scale

Correctness uses a 1-5 scale:

5 - Perfect match
4 - Correct with minor differences
3 - Partially correct
2 - Mostly incorrect
1 - Completely incorrect

Custom Parser

Parse LLM responses differently:

function customParser(response: string): [number, string] {
  // Extract score and reasoning from response
  const scoreMatch = response.match(/Score: (\d+)/);
  const score = scoreMatch ? parseInt(scoreMatch[1]) : 0;
  const reasoning = response.split("\n").slice(1).join("\n");
  return [score, reasoning];
}

const evaluator = new CorrectnessEvaluator({
  parserFunction: customParser,
});

Batch Evaluation

Evaluate multiple queries:

import {
  FaithfulnessEvaluator,
  RelevancyEvaluator,
  CorrectnessEvaluator,
} from "llamaindex/evaluation";

const testCases = [
  {
    query: "What is LlamaIndex?",
    reference: "LlamaIndex is a data framework for LLMs",
  },
  {
    query: "How do I create an index?",
    reference: "Use VectorStoreIndex.fromDocuments()",
  },
];

const faithfulness = new FaithfulnessEvaluator();
const relevancy = new RelevancyEvaluator();
const correctness = new CorrectnessEvaluator();

const results = [];

for (const testCase of testCases) {
  const response = await queryEngine.query({
    query: testCase.query,
  });

  const [faithResult, relResult, corrResult] = await Promise.all([
    faithfulness.evaluateResponse({ query: testCase.query, response }),
    relevancy.evaluateResponse({ query: testCase.query, response }),
    correctness.evaluate({
      query: testCase.query,
      response: response.toString(),
      reference: testCase.reference,
    }),
  ]);

  results.push({
    query: testCase.query,
    faithfulness: faithResult.score,
    relevancy: relResult.score,
    correctness: corrResult.score,
    passing: faithResult.passing && relResult.passing && corrResult.passing,
  });
}

// Calculate averages
const avgFaithfulness = results.reduce((sum, r) => sum + r.faithfulness, 0) / results.length;
const avgRelevancy = results.reduce((sum, r) => sum + r.relevancy, 0) / results.length;
const avgCorrectness = results.reduce((sum, r) => sum + r.correctness, 0) / results.length;

console.log({
  avgFaithfulness,
  avgRelevancy,
  avgCorrectness,
  passRate: results.filter(r => r.passing).length / results.length,
});

Rate Limiting

Avoid API rate limits:

const results = [];

for (const testCase of testCases) {
  const result = await evaluator.evaluate({
    query: testCase.query,
    response: testCase.response,
    contexts: testCase.contexts,
    sleepTimeInSeconds: 1, // Wait 1 second between calls
  });
  results.push(result);
}

Evaluation Pipeline

Create a comprehensive evaluation workflow:

class RAGEvaluationPipeline {
  constructor(
    private queryEngine: any,
    private evaluators = {
      faithfulness: new FaithfulnessEvaluator(),
      relevancy: new RelevancyEvaluator(),
      correctness: new CorrectnessEvaluator(),
    }
  ) {}

  async evaluate(testCases: Array<{
    query: string;
    reference?: string;
  }>) {
    const results = [];

    for (const testCase of testCases) {
      const response = await this.queryEngine.query({
        query: testCase.query,
      });

      const evalPromises = [
        this.evaluators.faithfulness.evaluateResponse({
          query: testCase.query,
          response,
        }),
        this.evaluators.relevancy.evaluateResponse({
          query: testCase.query,
          response,
        }),
      ];

      if (testCase.reference) {
        evalPromises.push(
          this.evaluators.correctness.evaluate({
            query: testCase.query,
            response: response.toString(),
            reference: testCase.reference,
          })
        );
      }

      const [faithfulness, relevancy, correctness] = await Promise.all(evalPromises);

      results.push({
        query: testCase.query,
        response: response.toString(),
        scores: {
          faithfulness: faithfulness.score,
          relevancy: relevancy.score,
          correctness: correctness?.score,
        },
        passing: {
          faithfulness: faithfulness.passing,
          relevancy: relevancy.passing,
          correctness: correctness?.passing ?? true,
        },
        feedback: {
          faithfulness: faithfulness.feedback,
          relevancy: relevancy.feedback,
          correctness: correctness?.feedback,
        },
      });
    }

    return this.summarize(results);
  }

  private summarize(results: any[]) {
    return {
      results,
      summary: {
        total: results.length,
        passed: results.filter(r => 
          r.passing.faithfulness && 
          r.passing.relevancy && 
          r.passing.correctness
        ).length,
        avgScores: {
          faithfulness: this.average(results.map(r => r.scores.faithfulness)),
          relevancy: this.average(results.map(r => r.scores.relevancy)),
          correctness: this.average(
            results.map(r => r.scores.correctness).filter(Boolean)
          ),
        },
      },
    };
  }

  private average(numbers: number[]) {
    return numbers.reduce((sum, n) => sum + n, 0) / numbers.length;
  }
}

// Usage
const pipeline = new RAGEvaluationPipeline(queryEngine);

const evaluation = await pipeline.evaluate([
  { query: "What is LlamaIndex?", reference: "A data framework" },
  { query: "How do I use it?", reference: "Import and create an index" },
]);

console.log(evaluation.summary);

Best Practices

Test Set Creation:

Create diverse test cases covering different query types
Include edge cases and common failure modes
Use real user queries when possible
Maintain reference answers for correctness evaluation

Metric Selection:

Faithfulness - Critical for preventing hallucinations
Relevancy - Ensures responses answer the question
Correctness - Requires reference answers, best for regression testing

Iteration:

Establish baseline scores
Make changes (prompts, retrievers, etc.)
Re-run evaluation
Compare scores to baseline
Keep improvements, discard regressions

Performance:

Run evaluations in parallel when possible
Cache LLM responses to avoid redundant calls
Use rate limiting to avoid API errors

Next Steps

Postprocessors

Improve retrieval quality with filtering and reranking

Memory

Manage conversation context and history

Documentation Index

​Overview

​Faithfulness

​Evaluate Response Objects

​Custom Prompts

​Relevancy

​How It Works

​Correctness

​Score Scale

​Custom Parser

​Batch Evaluation

​Rate Limiting

​Evaluation Pipeline

​Best Practices

​Next Steps