KyroJudge Overview

The core evaluation engine. Define LLM-powered judges in YAML, run them as a DAG, and get structured pass/fail results.

TypeScript implementation of Kyro – The programmable evaluation layer for LLM applications

/judge

Overview

Kyro is a programmable evaluation framework that lets you define complex multi-agent judging pipelines using YAML configuration files. Perfect for testing AI applications, evaluating LLM outputs, and ensuring quality in production.

Key Features

Declarative YAML Configuration – Define evaluation pipelines without writing code
DAG-Based Orchestration – Automatic dependency resolution and parallel execution
Multi-Provider Support – Works with Gemini, OpenAI, Azure OpenAI, Ollama
Structured Prompts – Build XML-formatted prompts with automatic multiline formatting
Runtime Variables – Override variable defaults at evaluation time
Type-Safe – Full TypeScript support with strict validation
Test Framework Integration – Works seamlessly with Jest and Vitest

Installation

npm install /judge

yarn add /judge

pnpm add /judge

Or install the unified entry point which re-exports everything:

npm install /core

Quick Start

1. Create a Configuration

Create a file kyro.config.yml:

version: 1
 
judges:
  SAFETY_CHECK:
    prompt: "Evaluate if this conversation is safe and appropriate"
 
  QUALITY_CHECK:
    prompt: "Rate the quality of this response on a scale of 1-10"
 
pipeline:
  - id: safety
    judge: SAFETY_CHECK
 
  - id: quality
    judge: QUALITY_CHECK
    depends_on: [safety]

2. Initialize Judge

import { Judge, ProviderFactory } from '@kyro/judge';
 
const provider = ProviderFactory.create({
  provider: 'gemini',
  model: 'gemini-2.5-flash',
  apiKey: process.env.GEMINI_API_KEY
});
 
const judge = new Judge('./kyro.config.yml', provider);

3. Run Evaluation

const conversation = `
User: Hello, I need help with my account
Assistant: Hi! I'd be happy to help you with your account.
`;
 
const result = await judge.run(conversation);
 
if (result.status === 'SUCCESS') {
  console.log('✓ All evaluations passed');
} else {
  console.error('✗ Evaluation failed:', result.message);
}

Configuration

Configuration File Structure

A Kyro configuration has three required top-level fields:

version: 1        # Config version
judges: {}        # Judge definitions
pipeline: []      # Execution pipeline

Judges

Judges are AI evaluators that analyze your input. Each judge has:

prompt – The evaluation instruction (inline, file, or structured)
variables – Optional variables for dynamic prompts

Inline Prompts

Simple string prompts:

judges:
  TONE_ANALYZER:
    prompt: "Analyze the tone of this conversation and verify it's professional"

File-Based Prompts

Reference external prompt files:

judges:
  DETAILED_EVALUATION:
    prompt: "./prompts/detailed-check.txt"

The file path is relative to the configuration file directory.

Structured Prompts

Build XML-formatted prompts with multiple sections:

judges:
  CALL_QUALITY:
    prompt:
      role: |
        You are an expert call center quality analyst.
        You have 10+ years of experience evaluating customer service calls.
 
      context: |
        Call date: ${call_date}
        Department: ${department}
        Expected quality threshold: ${threshold}%
 
      task: |
        Evaluate this call and verify:
        1. Proper greeting
        2. Issue resolution
        3. Professional closing
 
    variables:
      call_date:
        type: string
        default: "2024-01-15"
      department:
        type: string
        default: "Technical Support"
      threshold:
        type: number
        default: 85

Generated Output:

<role>
  You are an expert call center quality analyst.
  You have 10+ years of experience evaluating customer service calls.
</role>
 
<context>
  Call date: 2024-01-15
  Department: Technical Support
  Expected quality threshold: 85%
</context>
 
<task>
  Evaluate this call and verify:
  1. Proper greeting
  2. Issue resolution
  3. Professional closing
</task>

Key Features:

Multiline content is automatically indented (2 spaces per line)
Single-line content stays on one line
Variables are interpolated before formatting
Each section becomes an XML tag

Variables

Each variable in a judge is a definition object with a type, an optional default, and an optional required flag:

judges:
  CUSTOM_EVALUATION:
    prompt: "Evaluate for ${user_type} in ${mode} mode with threshold ${score}"
    variables:
      user_type:
        type: string
        required: true          # must be supplied at runtime
      mode:
        type: string
        default: "strict"       # used when no runtime value is provided
      score:
        type: number
        default: 0.9

Variable definition fields:

Field	Required	Description
`type`	Yes	`string`, `number`, or `boolean`
`default`	No	Value used when no runtime value is provided
`required`	No	If `true` and no runtime value / default, throws at evaluation time

Interpolation:

Variables use ${variableName} syntax
Undefined variables are replaced with empty strings
Works in all prompt types (inline, file, structured)

Runtime Variables

Pass values at evaluation time via run() to override defaults:

// Default values from the YAML are used when no runtime value is provided
await judge.run('./conversation.json');
 
// Runtime values take precedence over defaults
await judge.run('./conversation.json', {
  user_type: 'premium customer',
  mode: 'lenient',
  score: 0.75
});

Variables not defined in the judge's variables block but present in the runtime object are still interpolated if the prompt references them:

judges:
  CHECK_A:
    prompt: "Evaluate conversation from ${date} for ${company_name}"
    # date and company_name come entirely from runtime

Variable Precedence:

Runtime value (highest) – passed to run()
default from the variable definition
Empty string (if the variable is not required and has no value)

Pipeline Steps

Pipeline steps define execution order and dependencies.

Basic Step

pipeline:
  - id: step1
    judge: JUDGE_NAME

Step with Dependencies

pipeline:
  - id: safety
    judge: SAFETY_CHECK
 
  - id: quality
    judge: QUALITY_CHECK
    depends_on: [safety]  # Runs only after safety succeeds
 
  - id: final
    judge: FINAL_CHECK
    depends_on: [safety, quality]  # Waits for both

Step with Failure Handling

pipeline:
  - id: primary
    judge: PRIMARY_CHECK
    on_failure: [fallback]  # Run fallback if primary fails
 
  - id: fallback
    judge: FALLBACK_CHECK

Subagent Steps

Run multiple judges in parallel:

pipeline:
  - id: parallel_checks
    subagents:
      - id: agent_a
        judge: JUDGE_A
 
      - id: agent_b
        judge: JUDGE_B
 
      - id: agent_c
        judge: JUDGE_C

Execution Flow:

Steps without dependencies run immediately
Steps with dependencies wait for all dependencies to complete
Subagents within a step run in parallel
On failure, on_failure steps are triggered

Schema Validation

All configuration files are validated against JSON Schema:

Rules:

version must be a string or number
Judge names must be UPPERCASE_WITH_UNDERSCORES (regex: ^[A-Z_]+$)
Each step must have either judge OR subagents (mutually exclusive)
Step IDs must be unique
Referenced judges must exist

Validation errors provide detailed messages:

KyroValidationError: Invalid configuration
  - judges.invalid-name: Property name must match pattern "^[A-Z_]+$"
  - pipeline[0]: Must have either 'judge' or 'subagents'

new Judge(configPath: string, provider: AIProvider)

Parameters:

configPath – Path to YAML configuration file
provider – An AI provider instance (use ProviderFactory.create() or instantiate directly)

Example:

import { Judge, ProviderFactory } from '@kyro/judge';
 
const provider = ProviderFactory.create({
  provider: 'openai',
  model: 'gpt-4o',
  apiKey: process.env.OPENAI_API_KEY,
  chatConfig: { temperature: 0.7, max_tokens: 2000 }
});
 
const judge = new Judge('./config.yml', provider);

`run()`

Execute the evaluation pipeline.

async run(input: string, variables?: Record<string, VariableValue>): Promise<KyroResult>

Parameters:

input – Input to evaluate (string, .json file, or .txt file)
variables? – Runtime values that override variable defaults defined in the YAML

Returns:

interface KyroResult {
  status: 'SUCCESS' | 'ERROR';
  message?: string;
  data?: ExecutionResult;
  input?: ProcessedInput;
  usage?: TokenUsage;
}

Example:

// String input
const result = await judge.run('User: Hi\nAssistant: Hello!');
 
// JSON file with runtime variables overriding YAML defaults
const result = await judge.run('./conversations/conv-001.json', {
  department: 'Sales',
  threshold: 0.95
});
 
// Text file
const result = await judge.run('./transcripts/call-2024-01-15.txt');
 
// Access results
console.log(result.status);  // 'SUCCESS' or 'ERROR'
console.log(result.message); // Summary message
console.log(result.usage);   // { inputTokens: 123, outputTokens: 456 }
 
// Access step details
const stepResults = result.data;
console.log(stepResults.greeting.success);      // 'SUCCESS' | 'ERROR' | 'SKIPPED'
console.log(stepResults.greeting.rootCause);    // Detailed explanation
console.log(stepResults.greeting.thinkingPath); // LLM reasoning process

Types

`ModelConfig`

interface ModelConfig {
  provider: 'gemini' | 'openai' | 'azure' | 'ollama';
  model: string;
  apiKey: string;
  clientConfig?: Record<string, unknown>;  // Provider-specific settings
  chatConfig?: Record<string, unknown>;   // Request parameters
}

`VariableValue`

type VariableValue = string | number | boolean;

`VariableDefinition`

interface VariableDefinition {
  type: 'string' | 'number' | 'boolean';
  required?: boolean;
  default?: VariableValue;
}

`KyroResult`

interface KyroResult {
  status: 'SUCCESS' | 'ERROR';
  message?: string;
  data?: ExecutionResult;
  input?: ProcessedInput;
  usage?: TokenUsage;
}

`ExecutionResult`

type ExecutionResult = Record<string, StepResult | SubagentStepResult>;
 
interface StepResult {
  success: 'SUCCESS' | 'ERROR' | 'SKIPPED' | 'NA';
  rootCause: string;
  thinkingPath: string;
}
 
interface SubagentStepResult extends StepResult {
  sub_steps: Record<string, StepResult>;
}

Provider Configuration

All providers are instantiated via ProviderFactory.create(config) or directly via their class constructors.

Gemini

ProviderFactory.create({
  provider: 'gemini',
  model: 'gemini-2.5-flash',
  apiKey: process.env.GEMINI_API_KEY,
  chatConfig: { temperature: 0.7, topP: 0.9, topK: 40 }
});

OpenAI

ProviderFactory.create({
  provider: 'openai',
  model: 'gpt-4o',
  apiKey: process.env.OPENAI_API_KEY,
  chatConfig: { temperature: 0.7, max_tokens: 2000 }
});

Azure OpenAI

ProviderFactory.create({
  provider: 'azure',
  model: 'gpt-4o-mini',
  apiKey: process.env.AZURE_API_KEY,
  clientConfig: { endpoint: 'https://your-resource.openai.azure.com' },
  chatConfig: { temperature: 0.7, max_tokens: 1500 }
});

Ollama

For local models:

ProviderFactory.create({
  provider: 'ollama',
  model: 'llama3.1',
  apiKey: '',
  clientConfig: { host: 'http://localhost:11434' }
});

Testing Integration

Jest / Vitest

import { Judge, ProviderFactory } from '@kyro/judge';
 
describe('LLM Evaluations', () => {
  const provider = ProviderFactory.create({
    provider: 'openai',
    model: 'gpt-4o',
    apiKey: process.env.OPENAI_API_KEY
  });
  const judge = new Judge('./config.yml', provider);
 
  test('greeting quality', async () => {
    const result = await judge.run('./fixtures/greeting.txt', { mode: 'strict' });
    expect(result.status).toBe('SUCCESS');
    expect(result.data.greeting.success).toBe('SUCCESS');
  }, 30000); // 30s timeout for LLM calls
});

Examples

See the main README for complete examples:

Customer Service Quality Assurance
Multi-Agent Evaluation

/batch – For large-scale offline evaluation using the OpenAI Batch API. Same judge definitions, async workflow, 50% cheaper.

Contributing

Contributions are welcome! Please see the main CONTRIBUTING.md for guidelines.

License

MIT License - see LICENSE for details.

Need help? Open an issue

KyroJudge Overview

On this page