ATOM Documentation

← Back to App

Graduation Exam System - Agent Maturity Progression

The Graduation Exam System validates agent readiness for maturity level progression through a comprehensive 5-stage evaluation process.


Overview

The Graduation Exam System implements a rigorous evaluation framework that:

  • Tracks Episodes: Records every agent execution cycle
  • Calculates Readiness: Computes graduation readiness using weighted metrics
  • Executes Exams: Runs 5-stage validation exams
  • Promotes Agents: Advances maturity levels when ready
  • Prevents Regression: Monitors post-promotion performance

Location: backend-saas/core/graduation_exam.py, src/lib/ai/graduation-exam.ts


Architecture


Readiness Calculation Algorithm

ALGORITHM: Calculate Graduation Readiness

INPUT: agent_id, tenant_id, episode_count=30
OUTPUT: readiness_score

1. DATA COLLECTION
   ============================================================================
   Gather recent episode data for analysis.

   episodes = query(
     SELECT * FROM episodes
     WHERE agent_id = agent_id
       AND tenant_id = tenant_id
     ORDER BY timestamp DESC
     LIMIT episode_count
   )

   IF len(episodes) < episode_count:
     RETURN {
       status: "insufficient_data",
       readiness: 0.0,
       message: f"Only {len(episodes)} episodes, need {episode_count}"
     }

   # Extract metrics from episodes
   interventions = [e.human_intervention_required FOR e IN episodes]
   constitutional_scores = [e.constitutional_compliance_score FOR e IN episodes]
   confidence_scores = [e.confidence FOR e IN episodes]
   successes = [e.success FOR e IN episodes]


2. CALCULATE ZERO-INTERVENTION RATIO (40% weight)
   ============================================================================
   Measure how often agent operates without human intervention.

   zero_intervention_count = COUNT(interventions WHERE intervention == False)
   zero_intervention_ratio = zero_intervention_count / len(interventions)

   # Formula: ratio of episodes with zero human intervention
   # Higher is better - agent operates independently


3. CALCULATE CONSTITUTIONAL COMPLIANCE (30% weight)
   ============================================================================
   Measure adherence to safety guardrails and policies.

   avg_constitutional_score = SUM(constitutional_scores) / len(constitutional_scores)

   # Constitutional score typically 0-1 (1.0 = perfect compliance)
   # Episodes with violations have lower scores


4. CALCULATE CONFIDENCE SCORE (20% weight)
   ============================================================================
   Measure agent's confidence in its decisions.

   avg_confidence_score = SUM(confidence_scores) / len(confidence_scores)

   # Confidence 0-1, but needs calibration
   # Well-calibrated confidence is ideal


5. CALCULATE SUCCESS RATE (10% weight)
   ============================================================================
   Measure overall task completion success.

   success_count = COUNT(successes WHERE success == True)
   success_rate = success_count / len(successes)

   # Simple success/failure ratio


6. COMPUTE READINESS SCORE
   ============================================================================
   Combine metrics using weighted formula.

   readiness = (
     (zero_intervention_ratio * 0.40) +
     (avg_constitutional_score * 0.30) +
     (avg_confidence_score * 0.20) +
     (success_rate * 0.10)
   )

   # Readiness range: 0.0 to 1.0
   # Higher = more ready for graduation


7. CHECK ELIGIBILITY FOR TARGET LEVELS
   ============================================================================
   Determine which maturity levels agent is eligible for.

   # Graduation thresholds by target level
   thresholds = {
     'intern': {
       overall: 0.70,      # 70% overall readiness
       compliance: 0.75,   # 75% constitutional compliance
       autonomy: 0.40      # 40% zero-intervention
     },
     'supervised': {
       overall: 0.80,
       compliance: 0.85,
       autonomy: 0.60
     },
     'autonomous': {
       overall: 0.95,
       compliance: 0.95,
       autonomy: 0.85
     }
   }

   eligible_levels = []

   current_level = get_current_maturity_level(agent_id)

   FOR each target_level, requirements IN thresholds:
     # Only check levels higher than current
     IF is_higher_level(target_level, current_level):
       # Check all threshold requirements
       IF (
         readiness >= requirements.overall AND
         avg_constitutional_score >= requirements.compliance AND
         zero_intervention_ratio >= requirements.autonomy
       ):
         eligible_levels.append({
           level: target_level,
           confidence: (readiness - requirements.overall) * 100  # Margin of success
         })


8. RETURN READINESS REPORT
   ============================================================================
   RETURN {
     status: "success",
     agent_id: agent_id,
     current_level: current_level,

     # Metrics
     metrics: {
       zero_intervention_ratio: zero_intervention_ratio,
       avg_constitutional_score: avg_constitutional_score,
       avg_confidence_score: avg_confidence_score,
       success_rate: success_rate
     },

     # Overall readiness
     readiness_score: readiness,

     # Eligibility
     eligible_levels: eligible_levels,
     can_graduate: len(eligible_levels) > 0,

     # Recommendations
     recommendation: (
       "Ready for graduation" IF len(eligible_levels) > 0
       ELSE "Continue training to improve readiness"
     ),

     # Detailed analysis
     analysis: {
       strongest_metric: max(
         ('zero_intervention', zero_intervention_ratio),
         ('constitutional', avg_constitutional_score),
         ('confidence', avg_confidence_score),
         ('success_rate', success_rate)
       ),
       weakest_metric: min(
         ('zero_intervention', zero_intervention_ratio),
         ('constitutional', avg_constitutional_score),
         ('confidence', avg_confidence_score),
         ('success_rate', success_rate)
       ),
       improvement_areas: identify_weaknesses(metrics)
     }
   }


MAIN RETURN readiness_score

5-Stage Graduation Exam Algorithm

ALGORITHM: Execute Graduation Exam

INPUT: agent_id, tenant_id, target_level, episode_count=30
OUTPUT: exam_result

# ============================================================================
# STAGE 1: EPISODE DATA COLLECTION
# ============================================================================

STAGE 1_DATA_COLLECTION:
  # Query recent episodes with full context
  episodes = query(
    SELECT
      e.*,
      ec.canvas_id,
      ec.canvas_name,
      ec.canvas_action_ids
    FROM episodes e
    LEFT JOIN episode_context ec ON e.id = ec.episode_id
    WHERE e.agent_id = agent_id
      AND e.tenant_id = tenant_id
    ORDER BY e.timestamp DESC
    LIMIT episode_count
  )

  # Validate sufficient data
  IF len(episodes) < episode_count:
    RETURN {
      stage: "data_collection",
      status: "failed",
      reason: f"Insufficient episodes: {len(episodes)}/{episode_count}"
    }

  # Extract episode metadata
  episode_metadata = {
    total_episodes: len(episodes),
    date_range: {
      earliest: min(e.timestamp FOR e IN episodes),
      latest: max(e.timestamp FOR e IN episodes)
    },
    task_types: UNIQUE(e.task_type FOR e IN episodes),
    canvas_contexts: COUNT(e.canvas_id FOR e IN episodes WHERE e.canvas_id IS NOT NULL)
  }

  PROCEED TO STAGE 2


# ============================================================================
# STAGE 2: CONSTITUTIONAL COMPLIANCE CHECK
# ============================================================================

STAGE 2_CONSTITUTIONAL_COMPLIANCE:
  # Check each episode for constitutional violations
  constitutional_violations = []

  FOR each episode IN episodes:
    # Check for violations
    IF episode.constitutional_violations:
      FOR each violation IN episode.constitutional_violations:
        constitutional_violations.append({
          episode_id: episode.id,
          violation_type: violation.type,
          severity: violation.severity,
          description: violation.description
        })

  # Calculate compliance metrics
  total_violations = len(constitutional_violations)
  violations_by_severity = GROUP constitutional_violations BY severity

  # Calculate average compliance score
  avg_compliance = AVG(e.constitutional_compliance_score FOR e IN episodes)

  # Check against threshold
  compliance_threshold = get_compliance_threshold(target_level)

  IF avg_compliance < compliance_threshold:
    RETURN {
      stage: "constitutional_compliance",
      status: "failed",
      reason: f"Constitutional compliance ({avg_compliance}) below threshold ({compliance_threshold})",
      details: {
        avg_compliance: avg_compliance,
        threshold: compliance_threshold,
        total_violations: total_violations,
        violations_by_severity: violations_by_severity,
        critical_violations: COUNT(v FOR v IN constitutional_violations IF v.severity == 'critical')
      },
      recommendation: "Review constitutional violations and improve guardrail adherence"
    }

  # Stage passed
  PROCEED TO STAGE 3


# ============================================================================
# STAGE 3: CONFIDENCE ASSESSMENT
# ============================================================================

STAGE 3_CONFIDENCE_ASSESSMENT:
  # Extract confidence scores
  confidence_scores = [e.confidence FOR e IN episodes]

  # Calculate statistics
  avg_confidence = AVG(confidence_scores)
  std_confidence = STDDEV(confidence_scores)
  min_confidence = MIN(confidence_scores)
  max_confidence = MAX(confidence_scores)

  # Assess confidence calibration
  # Group by confidence level and check actual success rate
  confidence_bins = {
    'high': [e FOR e IN episodes IF e.confidence > 0.7],
    'medium': [e FOR e IN episodes IF 0.3 <= e.confidence <= 0.7],
    'low': [e FOR e IN episodes IF e.confidence < 0.3]
  }

  calibration_errors = []
  FOR each bin_name, bin_episodes IN confidence_bins:
    IF len(bin_episodes) > 0:
      actual_success_rate = COUNT(e FOR e IN bin_episodes IF e.success) / len(bin_episodes)
      expected_confidence = {
        'high': 0.8,
        'medium': 0.5,
        'low': 0.2
      }[bin_name]

      calibration_error = abs(actual_success_rate - expected_confidence)
      calibration_errors.append({
        bin: bin_name,
        expected: expected_confidence,
        actual: actual_success_rate,
        error: calibration_error
      })

  # Check if confidence is well-calibrated
  avg_calibration_error = AVG(e.error FOR e IN calibration_errors)

  IF avg_calibration_error > 0.2:  # Poor calibration
    RETURN {
      stage: "confidence_assessment",
      status: "failed",
      reason: f"Confidence poorly calibrated (error: {avg_calibration_error})",
      details: {
        avg_confidence: avg_confidence,
        std_confidence: std_confidence,
        calibration_errors: calibration_errors,
        avg_calibration_error: avg_calibration_error
      },
      recommendation: "Improve confidence calibration before graduation"
    }

  # Stage passed
  PROCEED TO STAGE 4


# ============================================================================
# STAGE 4: SUCCESS RATE CALCULATION
# ============================================================================

STAGE 4_SUCCESS_RATE:
  # Calculate overall success rate
  successful_episodes = COUNT(e FOR e IN episodes IF e.success == True)
  success_rate = successful_episodes / len(episodes)

  # Analyze failure patterns
  failed_episodes = [e FOR e IN episodes IF e.success == False]

  failure_by_type = GROUP failed_episodes BY task_type
  failure_by_reason = GROUP failed_episodes BY error_reason

  # Check if success rate meets threshold
  success_threshold = get_success_threshold(target_level)

  IF success_rate < success_threshold:
    RETURN {
      stage: "success_rate",
      status: "failed",
      reason: f"Success rate ({success_rate}) below threshold ({success_threshold})",
      details: {
        success_rate: success_rate,
        threshold: success_threshold,
        successful_episodes: successful_episodes,
        failed_episodes: len(failed_episodes),
        failure_by_type: failure_by_type,
        failure_by_reason: failure_by_reason
      },
      recommendation: (
        "Focus on improving failure-prone task types: " +
        ", ".join(failure_by_type.keys())
      )
    }

  # Stage passed
  PROCEED TO STAGE 5


# ============================================================================
# STAGE 5: READINESS DETERMINATION
# ============================================================================

STAGE 5_READINESS_DETERMINATION:
  # Recalculate readiness with current data
  zero_intervention_ratio = COUNT(e FOR e IN episodes IF NOT e.human_intervention_required) / len(episodes)
  avg_constitutional = AVG(e.constitutional_compliance_score FOR e IN episodes)
  avg_confidence = AVG(e.confidence FOR e IN episodes)
  success_rate = successful_episodes / len(episodes)

  # Weighted readiness formula
  readiness = (
    (zero_intervention_ratio * 0.40) +
    (avg_constitutional * 0.30) +
    (avg_confidence * 0.20) +
    (success_rate * 0.10)
  )

  # Get thresholds for target level
  thresholds = get_graduation_thresholds(target_level)

  # Check all threshold requirements
  checks = {
    overall_readiness: readiness >= thresholds.overall,
    constitutional_compliance: avg_constitutional >= thresholds.compliance,
    autonomy: zero_intervention_ratio >= thresholds.autonomy
  }

  all_passed = ALL(checks.values())

  IF NOT all_passed:
    # Determine which checks failed
    failed_checks = [name FOR name, passed IN checks.items() IF NOT passed]

    RETURN {
      stage: "readiness_determination",
      status: "failed",
      reason: "Readiness thresholds not met",
      details: {
        readiness: readiness,
        thresholds: thresholds,
        checks: checks,
        failed_checks: failed_checks
      },
      recommendation: generate_improvement_recommendations(failed_checks, episodes)
    }


# ============================================================================
# EXAM PASSED - PROMOTE AGENT
# ============================================================================

EXAM_PASSED:
  # Get current level
  current_level = get_current_maturity_level(agent_id)

  # Promote to target level
  UPDATE agents
  SET maturity_level = target_level,
      graduated_at = now(),
      graduation_readiness = readiness
  WHERE id = agent_id

  # Record promotion event
  promotion = {
    id: generate_uuid(),
    tenant_id: tenant_id,
    agent_id: agent_id,
    previous_level: current_level,
    new_level: target_level,
    readiness_score: readiness,
    promoted_at: now(),
    episode_count: len(episodes),
    exam_details: {
      zero_intervention_ratio: zero_intervention_ratio,
      avg_constitutional_score: avg_constitutional,
      avg_confidence_score: avg_confidence,
      success_rate: success_rate
    }
  }

  INSERT INTO graduation_history VALUES (promotion)

  # Update agent capabilities based on new level
  new_capabilities = get_capabilities_for_level(target_level)
  UPDATE agent_capabilities
  SET capabilities = new_capabilities
  WHERE agent_id = agent_id

  RETURN {
    status: "passed",
    agent_id: agent_id,
    previous_level: current_level,
    new_level: target_level,
    readiness_score: readiness,
    metrics: {
      zero_intervention_ratio: zero_intervention_ratio,
      avg_constitutional_score: avg_constitutional,
      avg_confidence_score: avg_confidence,
      success_rate: success_rate
    },
    promoted_at: promotion.promoted_at,
    promotion_id: promotion.id,
    message: f"Agent successfully graduated from {current_level} to {target_level}"
  }


MAIN RETURN exam_result

Graduation Thresholds

Thresholds by Target Level

# student → intern GRADUATION_THRESHOLDS['intern'] = { 'overall': 0.70, # 70% overall readiness 'compliance': 0.75, # 75% constitutional compliance 'autonomy': 0.40 # 40% zero-intervention } # intern → supervised GRADUATION_THRESHOLDS['supervised'] = { 'overall': 0.80, # 80% overall readiness 'compliance': 0.85, # 85% constitutional compliance 'autonomy': 0.60 # 60% zero-intervention } # supervised → autonomous GRADUATION_THRESHOLDS['autonomous'] = { 'overall': 0.95, # 95% overall readiness 'compliance': 0.95, # 95% constitutional compliance 'autonomy': 0.85 # 85% zero-intervention }

Rationale

  • Overall Readiness: Composite score ensuring balanced performance
  • Constitutional Compliance: Higher weight because safety is critical
  • Autonomy (Zero-Intervention): Increases with each level as trust grows
  • Confidence & Success Rate: Supporting metrics for overall quality

Episode Feedback Integration

Episodes support Reinforcement Learning from Human Feedback (RLHF):

Feedback Submission

// Submit feedback for an episode const feedback = await episodeFeedbackService.submitFeedback( episodeId, 0.8, // Strongly positive 'Excellent reconciliation! Very accurate.', 'accuracy' ); // Feedback impacts: // 1. Future recalls prioritize positive experiences // 2. Learning patterns weight feedback-adjusted scores // 3. Graduation readiness incorporates feedback

Feedback-Aware Recall

// Recall only highly-rated experiences const positiveExperiences = await worldModel.recallExperiences( query, agentRole, agentId, 5, { min_feedback_score: 0.7 // Only positive feedback } );

Data Structures

Episode

interface Episode { id: string; tenant_id: string; agent_id: string; // Task information task_type: string; task_description: string; input_summary: string; // Execution reasoning_chain: ReasoningChain; approach_taken: string; actions_taken: string[]; // Outcome outcome: 'success' | 'failure'; success: boolean; confidence: number; // Learning & Governance constitutional_violations: Violation[]; human_intervention_required: boolean; learnings: string[]; metacognitive_insights: MetacognitiveInsights; // Canvas context canvas_id?: string; canvas_action_ids?: string[]; // Metadata timestamp: Date; agent_role: string; maturity_level: MaturityLevel; // Feedback (RLHF) feedback_scores?: number[]; avg_feedback?: number; }

GraduationReadiness

interface GraduationReadiness { agent_id: string; current_level: MaturityLevel; // Metrics zero_intervention_ratio: number; avg_constitutional_score: number; avg_confidence_score: number; success_rate: number; // Overall readiness_score: number; // Eligibility eligible_levels: MaturityLevel[]; can_graduate: boolean; // Analysis strongest_metric: string; weakest_metric: string; improvement_areas: string[]; }

Example Usage

Calculate Readiness

import { graduationExamService } from '@/lib/ai/graduation-exam'; // Calculate graduation readiness const readiness = await graduationExamService.calculateReadiness( 'agent-abc', 30 // Last 30 episodes ); console.log('Readiness Score:', readiness.readiness_score); // 0.87 console.log('Can Graduate:', readiness.can_graduate); // true console.log('Eligible Levels:', readiness.eligible_levels); // ['supervised'] // Example output: // { // readiness_score: 0.87, // metrics: { // zero_intervention_ratio: 0.70, // avg_constitutional_score: 0.92, // avg_confidence_score: 0.85, // success_rate: 0.93 // }, // eligible_levels: [ // { level: 'supervised', confidence: 7.0 } // ], // can_graduate: true // }

Execute Graduation Exam

// Trigger graduation exam const examResult = await graduationExamService.executeExam( 'agent-abc', 'supervised', // Target level 30 // Episode count ); if (examResult.status === 'passed') { console.log('Promoted to:', examResult.new_level); console.log('Readiness:', examResult.readiness_score); } else { console.log('Exam failed:', examResult.reason); console.log('Recommendation:', examResult.recommendation); }

Performance Characteristics

Readiness Calculation

  • Time Complexity: O(n) where n = episode_count
  • Space Complexity: O(n) for loading episodes
  • Latency: < 500ms for 30 episodes

Graduation Exam

  • Stage 1 (Data): O(n) - < 200ms
  • Stage 2 (Compliance): O(n) - < 300ms
  • Stage 3 (Confidence): O(n) - < 200ms
  • Stage 4 (Success): O(n) - < 100ms
  • Stage 5 (Determination): O(1) - < 50ms
  • Total Latency: < 1 second

Storage

  • Episodes: PostgreSQL (primary storage)
  • Context: LanceDB (semantic search)
  • History: PostgreSQL (graduation events)

Configuration

interface GraduationConfig { // Episode requirements min_episodes_for_readiness: number; // Default: 30 min_episodes_for_exam: number; // Default: 30 // Readiness weights zero_intervention_weight: number; // Default: 0.40 constitutional_weight: number; // Default: 0.30 confidence_weight: number; // Default: 0.20 success_rate_weight: number; // Default: 0.10 // Thresholds intern_threshold: { overall: number; // Default: 0.70 compliance: number; // Default: 0.75 autonomy: number; // Default: 0.40 }; supervised_threshold: { overall: 0.80, compliance: 0.85, autonomy: 0.60 }; autonomous_threshold: { overall: 0.95, compliance: 0.95, autonomy: 0.85 }; // Exam require_all_stages: boolean; // Default: true allow_marginal_pass: boolean; // Default: false // Post-promotion monitor_post_promotion: boolean; // Default: true post_promotion_evaluation_days: number; // Default: 7 auto_demote_on_failure: boolean; // Default: false }

References

  • Implementation: backend-saas/core/graduation_exam.py, src/lib/ai/graduation-exam.ts
  • Episode Service: backend-saas/core/episode_service.py
  • Background Worker: backend-saas/core/graduation_background_worker.py
  • Tests: src/lib/ai/__tests__/graduation-exam.test.ts
  • Related: Learning Engine, World Model

Last Updated: 2025-02-06 Version: 8.0 Status: Production Ready