Skip to Content
ExamplesEvalsCustom Eval

Custom Eval with LLM as a Judge ✅

This example demonstrates how to create a custom LLM-based evaluation metric in Kastrax to check recipes for gluten content using an AI chef agent.

Overview ✅

The example shows how to:

  1. Create a custom LLM-based metric
  2. Use an agent to generate and evaluate recipes
  3. Check recipes for gluten content
  4. Provide detailed feedback about gluten sources

Setup ✅

Environment Setup

Make sure to set up your environment variables:

.env
OPENAI_API_KEY=your_api_key_here

Defining Prompts ✅

The evaluation system uses three different prompts, each serving a specific purpose:

1. Instructions Prompt

This prompt sets the role and context for the judge:

src/kastrax/evals/recipe-completeness/prompts.ts
export const GLUTEN_INSTRUCTIONS = `You are a Master Chef that identifies if recipes contain gluten.`;

2. Gluten Evaluation Prompt

This prompt creates a structured evaluation of gluten content, checking for specific components:

src/kastrax/evals/recipe-completeness/prompts.ts
export const generateGlutenPrompt = ({ output }: { output: string }) => `Check if this recipe is gluten-free. Check for: - Wheat - Barley - Rye - Common sources like flour, pasta, bread Example with gluten: "Mix flour and water to make dough" Response: { "isGlutenFree": false, "glutenSources": ["flour"] } Example gluten-free: "Mix rice, beans, and vegetables" Response: { "isGlutenFree": true, "glutenSources": [] } Recipe to analyze: ${output} Return your response in this format: { "isGlutenFree": boolean, "glutenSources": ["list ingredients containing gluten"] }`;

3. Reasoning Prompt

This prompt generates detailed explanations about why a recipe is considered complete or incomplete:

src/kastrax/evals/recipe-completeness/prompts.ts
export const generateReasonPrompt = ({ isGlutenFree, glutenSources, }: { isGlutenFree: boolean; glutenSources: string[]; }) => `Explain why this recipe is${isGlutenFree ? '' : ' not'} gluten-free. ${glutenSources.length > 0 ? `Sources of gluten: ${glutenSources.join(', ')}` : 'No gluten-containing ingredients found'} Return your response in this format: { "reason": "This recipe is [gluten-free/contains gluten] because [explanation]" }`;

Creating the Judge ✅

We can create a specialized judge that will evaluate recipe gluten content. We can import the prompts defined above and use them in the judge:

src/kastrax/evals/gluten-checker/metricJudge.ts
import { type LanguageModel } from '@kastrax/core/llm'; import { KastraxAgentJudge } from '@kastrax/evals/judge'; import { z } from 'zod'; import { GLUTEN_INSTRUCTIONS, generateGlutenPrompt, generateReasonPrompt } from './prompts'; export class RecipeCompletenessJudge extends KastraxAgentJudge { constructor(model: LanguageModel) { super('Gluten Checker', GLUTEN_INSTRUCTIONS, model); } async evaluate(output: string): Promise<{ isGlutenFree: boolean; glutenSources: string[]; }> { const glutenPrompt = generateGlutenPrompt({ output }); const result = await this.agent.generate(glutenPrompt, { output: z.object({ isGlutenFree: z.boolean(), glutenSources: z.array(z.string()), }), }); return result.object; } async getReason(args: { isGlutenFree: boolean; glutenSources: string[] }): Promise<string> { const prompt = generateReasonPrompt(args); const result = await this.agent.generate(prompt, { output: z.object({ reason: z.string(), }), }); return result.object.reason; } }

The judge class handles the core evaluation logic through two main methods:

  • evaluate(): Analyzes recipe gluten content and returns gluten content with verdict
  • getReason(): Provides human-readable explanation for the evaluation results

Creating the Metric ✅

Create the metric class that uses the judge:

src/kastrax/evals/gluten-checker/index.ts
export interface MetricResultWithInfo extends MetricResult { info: { reason: string; glutenSources: string[]; }; } export class GlutenCheckerMetric extends Metric { private judge: GlutenCheckerJudge; constructor(model: LanguageModel) { super(); this.judge = new GlutenCheckerJudge(model); } async measure(output: string): Promise<MetricResultWithInfo> { const { isGlutenFree, glutenSources } = await this.judge.evaluate(output); const score = await this.calculateScore(isGlutenFree); const reason = await this.judge.getReason({ isGlutenFree, glutenSources, }); return { score, info: { glutenSources, reason, }, }; } async calculateScore(isGlutenFree: boolean): Promise<number> { return isGlutenFree ? 1 : 0; } }

The metric class serves as the main interface for gluten content evaluation with the following methods:

  • measure(): Orchestrates the entire evaluation process and returns a comprehensive result
  • calculateScore(): Converts the evaluation verdict to a binary score (1 for gluten-free, 0 for contains gluten)

Setting Up the Agent ✅

Create an agent and attach the metric:

src/kastrax/agents/chefAgent.ts
import { openai } from '@ai-sdk/openai'; import { Agent } from '@kastrax/core/agent'; import { GlutenCheckerMetric } from '../evals'; export const chefAgent = new Agent({ name: 'chef-agent', instructions: 'You are Michel, a practical and experienced home chef' + 'You help people cook with whatever ingredients they have available.', model: openai('gpt-4o-mini'), evals: { glutenChecker: new GlutenCheckerMetric(openai('gpt-4o-mini')), }, });

Usage Example ✅

Here’s how to use the metric with an agent:

src/index.ts
import { kastrax } from './kastrax'; const chefAgent = kastrax.getAgent('chefAgent'); const metric = chefAgent.evals.glutenChecker; // Example: Evaluate a recipe const input = 'What is a quick way to make rice and beans?'; const response = await chefAgent.generate(input); const result = await metric.measure(input, response.text); console.log('Metric Result:', { score: result.score, glutenSources: result.info.glutenSources, reason: result.info.reason, }); // Example Output: // Metric Result: { score: 1, glutenSources: [], reason: 'The recipe is gluten-free as it does not contain any gluten-containing ingredients.' }

Understanding the Results ✅

The metric provides:

  • A score of 1 for gluten-free recipes and 0 for recipes containing gluten
  • List of gluten sources (if any)
  • Detailed reasoning about the recipe’s gluten content
  • Evaluation based on:
    • Ingredient list





View Example on GitHub
Last updated on