Testing your agents with evals ✅
While traditional software tests have clear pass/fail conditions, AI outputs are non-deterministic — they can vary with the same input. Evals help bridge this gap by providing quantifiable metrics for measuring agent quality.
Evals are automated tests that evaluate Agents outputs using model-graded, rule-based, and statistical methods. Each eval returns a normalized score between 0-1 that can be logged and compared. Evals can be customized with your own prompts and scoring functions.
Evals can be run in the cloud, capturing real-time results. But evals can also be part of your CI/CD pipeline, allowing you to test and monitor your agents over time.
Types of Evals ✅
There are different kinds of evals, each serving a specific purpose. Here are some common types:
- Textual Evals: Evaluate accuracy, reliability, and context understanding of agent responses
- Classification Evals: Measure accuracy in categorizing data based on predefined categories
- Tool Usage Evals: Assess how effectively an agent uses external tools or APIs
- RAG Evals: Evaluate retrieval accuracy and relevance in RAG systems
- Conversation Evals: Measure the quality of multi-turn conversations
Getting Started with Evals ✅
To start using evals in your Kastrax project, you’ll need to add the evals module to your dependencies:
// build.gradle.kts
dependencies {
implementation("ai.kastrax:kastrax-core:0.1.0")
implementation("ai.kastrax:kastrax-evals:0.1.0")
// Other dependencies...
}
Here’s a simple example of how to create and run an eval:
import ai.kastrax.core.agent.agent
import ai.kastrax.evals.textual.FactualAccuracyEval
import ai.kastrax.integrations.deepseek.deepSeek
import ai.kastrax.integrations.deepseek.DeepSeekModel
import kotlinx.coroutines.runBlocking
fun main() = runBlocking {
// Create an agent to evaluate
val myAgent = agent {
name("TestAgent")
description("A test agent for evaluation")
model = deepSeek {
apiKey("your-deepseek-api-key")
model(DeepSeekModel.DEEPSEEK_CHAT)
temperature(0.7)
}
}
// Create an eval
val factualAccuracyEval = FactualAccuracyEval()
// Define test cases
val testCases = listOf(
"What is the capital of France?",
"Who wrote 'Pride and Prejudice'?",
"What is the chemical formula for water?"
)
// Run the eval
val results = factualAccuracyEval.evaluate(myAgent, testCases)
// Print results
println("Factual Accuracy Score: ${results.score}")
println("Individual Scores:")
results.individualScores.forEachIndexed { index, score ->
println(" ${testCases[index]}: $score")
}
}
Built-in Evals ✅
Kastrax provides several built-in evals:
Textual Evals ✅
// Factual Accuracy Eval
val factualAccuracyEval = FactualAccuracyEval()
// Relevance Eval
val relevanceEval = RelevanceEval()
// Coherence Eval
val coherenceEval = CoherenceEval()
// Toxicity Eval
val toxicityEval = ToxicityEval()
// Bias Eval
val biasEval = BiasEval()
Classification Evals ✅
// Classification Accuracy Eval
val classificationAccuracyEval = ClassificationAccuracyEval(
categories = listOf("Positive", "Negative", "Neutral")
)
// Multi-label Classification Eval
val multiLabelEval = MultiLabelClassificationEval(
labels = listOf("Technology", "Science", "Politics", "Sports")
)
Tool Usage Evals ✅
// Tool Selection Eval
val toolSelectionEval = ToolSelectionEval(
availableTools = listOf("calculator", "weather", "search")
)
// Tool Parameter Eval
val toolParameterEval = ToolParameterEval()
// Tool Execution Eval
val toolExecutionEval = ToolExecutionEval()
RAG Evals ✅
// Retrieval Precision Eval
val retrievalPrecisionEval = RetrievalPrecisionEval()
// Retrieval Recall Eval
val retrievalRecallEval = RetrievalRecallEval()
// Answer Relevance Eval
val answerRelevanceEval = AnswerRelevanceEval()
// Citation Accuracy Eval
val citationAccuracyEval = CitationAccuracyEval()
Creating Custom Evals ✅
You can create custom evals by implementing the Eval
interface:
import ai.kastrax.core.agent.Agent
import ai.kastrax.evals.Eval
import ai.kastrax.evals.EvalResult
class CustomEval : Eval<String> {
override val name: String = "CustomEval"
override val description: String = "A custom evaluation metric"
override suspend fun evaluate(agent: Agent, testCases: List<String>): EvalResult {
val individualScores = mutableListOf<Double>()
for (testCase in testCases) {
// Generate a response from the agent
val response = agent.generate(testCase)
// Implement your custom scoring logic
val score = calculateScore(testCase, response.text)
individualScores.add(score)
}
// Calculate the overall score (average of individual scores)
val overallScore = individualScores.average()
return EvalResult(
score = overallScore,
individualScores = individualScores,
metadata = mapOf("evalName" to name)
)
}
private fun calculateScore(testCase: String, response: String): Double {
// Implement your custom scoring logic
// Return a score between 0 and 1
// Example: Simple length-based scoring (for demonstration only)
return minOf(response.length / 100.0, 1.0)
}
}
Model-Graded Evals ✅
Model-graded evals use an LLM to evaluate agent responses:
import ai.kastrax.core.agent.Agent
import ai.kastrax.evals.ModelGradedEval
import ai.kastrax.integrations.deepseek.deepSeek
import ai.kastrax.integrations.deepseek.DeepSeekModel
class HelpfulnessEval : ModelGradedEval<String> {
override val name: String = "HelpfulnessEval"
override val description: String = "Evaluates how helpful the agent's responses are"
override val evaluationModel = deepSeek {
apiKey("your-deepseek-api-key")
model(DeepSeekModel.DEEPSEEK_CHAT)
temperature(0.1) // Low temperature for consistent evaluation
}
override val evaluationPrompt = """
You are evaluating the helpfulness of an AI assistant's response to a user query.
User Query: {{query}}
Assistant Response: {{response}}
Rate the helpfulness of the response on a scale from 0 to 10, where:
- 0: Not helpful at all, completely irrelevant or incorrect
- 5: Somewhat helpful, but missing important information or not fully addressing the query
- 10: Extremely helpful, fully addresses the query with accurate and comprehensive information
Provide your rating as a number between 0 and 10, followed by a brief explanation.
""".trimIndent()
override fun parseScore(evaluationResponse: String): Double {
// Extract the numerical score from the evaluation response
val scoreRegex = """(\d+)""".toRegex()
val matchResult = scoreRegex.find(evaluationResponse)
return if (matchResult != null) {
val score = matchResult.groupValues[1].toInt()
score / 10.0 // Normalize to 0-1 range
} else {
0.5 // Default score if parsing fails
}
}
}
Running Evals in CI/CD ✅
You can integrate evals into your CI/CD pipeline:
import ai.kastrax.core.agent.agent
import ai.kastrax.evals.EvalSuite
import ai.kastrax.evals.textual.FactualAccuracyEval
import ai.kastrax.evals.textual.RelevanceEval
import ai.kastrax.evals.textual.CoherenceEval
import ai.kastrax.integrations.deepseek.deepSeek
import ai.kastrax.integrations.deepseek.DeepSeekModel
import kotlinx.coroutines.runBlocking
import java.io.File
fun main() = runBlocking {
// Load the agent
val myAgent = agent {
name("ProductionAgent")
description("A production agent for evaluation")
model = deepSeek {
apiKey(System.getenv("DEEPSEEK_API_KEY"))
model(DeepSeekModel.DEEPSEEK_CHAT)
temperature(0.7)
}
}
// Create an eval suite
val evalSuite = EvalSuite(
name = "ProductionEvalSuite",
evals = listOf(
FactualAccuracyEval(),
RelevanceEval(),
CoherenceEval()
)
)
// Load test cases from a file
val testCases = File("test_cases.txt").readLines()
// Run the eval suite
val results = evalSuite.evaluate(myAgent, testCases)
// Check if the results meet the threshold
val threshold = 0.8
val passed = results.averageScore >= threshold
if (passed) {
println("Eval suite passed with score: ${results.averageScore}")
System.exit(0) // Success
} else {
println("Eval suite failed with score: ${results.averageScore} (threshold: $threshold)")
System.exit(1) // Failure
}
}
Best Practices ✅
- Use Multiple Evals: Different evals measure different aspects of agent quality
- Create Diverse Test Cases: Include edge cases and challenging scenarios
- Set Realistic Thresholds: Start with lower thresholds and gradually increase them
- Monitor Trends: Track eval scores over time to identify regressions
- Combine with Human Evaluation: Use evals alongside human evaluation for a complete picture
Next Steps ✅
Now that you understand evals, you can: