ATOM Documentation

← Back to App

Auto-Dev User Guide

Learn how to use Auto-Dev's self-evolving agent capabilities to help your agents learn from experience and automatically improve their skills over time.

**Version:** 1.0.0

**Last Updated:** 2026-04-10

---

Table of Contents

---

Introduction

What is Auto-Dev?

Auto-Dev is a self-evolving agent system that enables AI agents to learn from their experiences and automatically improve their capabilities over time. Instead of manually writing and updating skills, you can let your agents learn from failures and optimize their own code.

**Key Benefits:**

  • **Automatic Skill Generation**: Agents create new skills when they encounter repeated failures
  • **Continuous Optimization**: Existing skills are refined through iterative mutation and testing
  • **Safe by Default**: All changes are validated in sandboxed environments before promotion
  • **Maturity-Gated**: Features unlock as agents demonstrate competence

Two Learning Loops

Auto-Dev provides two complementary learning loops:

  1. **Memento-Skills** (Skill Generation) - Creates NEW capabilities from failures
  2. **AlphaEvolver** (Skill Optimization) - Improves EXISTING capabilities through mutation

Both loops are gated by agent maturity level, ensuring agents have proven competence before accessing self-evolution features.

---

Understanding Auto-Dev

How It Works

┌─────────────────────────────────────────────────────────────┐
│                     Agent Executes Task                      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │  Task Fails?    │
                    └─────────────────┘
                      │ Yes        │ No
                      ▼            ▼
            ┌──────────────┐   ┌──────────────┐
            │ MementoSkills│   │ AlphaEvolver │
            │ (INTERN)     │   │ (SUPERVISED) │
            └──────────────┘   └──────────────┘
                      │                    │
                      ▼                    ▼
            ┌──────────────┐   ┌──────────────┐
            │ Generate New │   │ Mutate &     │
            │ Skill        │   │ Optimize     │
            └──────────────┘   └──────────────┘
                      │                    │
                      ▼                    ▼
            ┌──────────────┐   ┌──────────────┐
            │ Sandbox      │   │ Sandbox      │
            │ Validation   │   │ Validation   │
            └──────────────┘   └──────────────┘
                      │                    │
                      ▼                    ▼
            ┌──────────────┐   ┌──────────────┐
            │ User         │   │ Fitness      │
            │ Approval     │   │ Comparison   │
            └──────────────┘   └──────────────┘

Agent Maturity Levels

Auto-Dev capabilities unlock as agents progress through maturity levels:

Maturity LevelMemento-SkillsAlphaEvolverBackground Evolution
**STUDENT**❌ Blocked❌ Blocked❌ Blocked
**INTERN**✅ Enabled❌ Blocked❌ Blocked
**SUPERVISED**✅ Enabled✅ Enabled❌ Blocked
**AUTONOMOUS**✅ Enabled✅ Enabled✅ Enabled

**How Agents Graduate:**

  • Complete episodes successfully
  • Maintain low intervention rates
  • Demonstrate constitutional compliance
  • Receive positive user feedback

---

Memento-Skills Learning Loop

Overview

Memento-Skills generates **new capabilities** when agents fail tasks repeatedly. It analyzes failure patterns, creates skill proposals, and validates them before promotion.

When It Triggers

Memento-Skills activates when:

  1. Agent maturity is **INTERN** or higher
  2. Auto-Dev is enabled in workspace settings
  3. Agent fails the **same task** 2+ times (configurable threshold)

Step-by-Step Workflow

Step 1: Failure Detection

Agent attempts task → Fails → Episode recorded

The system records:

  • Task description
  • Error trace
  • Tools attempted
  • Execution context

Step 2: Pattern Recognition

ReflectionEngine monitors failures → Detects pattern → Triggers Memento

The ReflectionEngine:

  • Buffers recent failures per agent
  • Compares task descriptions for similarity
  • Triggers when threshold exceeded (default: 2 similar failures)

Step 3: Episode Analysis

MementoEngine analyzes episode → Extracts failure pattern

Analysis includes:

  • What task was attempted
  • What went wrong (error trace)
  • What tools were tried
  • Suggested skill name

Step 4: Skill Generation

LLM proposes new skill → Python code generated

The LLM creates:

  • Function with clear name
  • Type hints on parameters
  • Docstring explaining purpose
  • Error handling
  • Self-contained logic

**Example Generated Skill:**

def extract_invoice_id(email_body: str) -> str | None:
    """
    Extract invoice ID from email body using regex patterns.

    Args:
        email_body: Email text content

    Returns:
        Invoice ID if found, None otherwise
    """
    import re

    # Pattern: INV followed by digits
    pattern = r'INV[-_]?(\d{4,})'
    match = re.search(pattern, email_body, re.IGNORECASE)

    return match.group(1) if match else None

Step 5: Sandbox Validation

Generated skill → Executed in sandbox → Results captured

Validation checks:

  • Syntax correctness
  • Execution success
  • Output format
  • Error handling

Step 6: User Review

Skill candidate queued → You review → Approve or reject

You'll see:

  • Skill name and description
  • Generated code
  • Validation results
  • Failure pattern context

Step 7: Promotion

Approved skill → Registered to skill catalog → Agent can use it

The skill becomes:

  • Available to all agents in workspace
  • Version controlled (v1.0.0)
  • Tracked in skill registry

Example Scenario

**Initial State:**

  • Agent: INTERN maturity
  • Task: "Extract invoice ID from customer emails"
  • Result: Fails repeatedly (no matching skill)

**After 2 Failures:**

  1. ReflectionEngine detects pattern
  2. MementoEngine analyzes episodes
  3. LLM generates extract_invoice_id() skill
  4. Sandbox validates successfully
  5. You review and approve
  6. Skill registered

**Future:**

  • Agent successfully extracts invoice IDs
  • No more failures on this task
  • Skill available to other agents

---

AlphaEvolver Learning Loop

Overview

AlphaEvolver **optimizes existing skills** through iterative mutation and fitness comparison. It's like A/B testing for code - small variations are tried, measured, and the best wins.

When It Triggers

AlphaEvolver activates when:

  1. Agent maturity is **SUPERVISED** or higher
  2. Auto-Dev is enabled in workspace settings
  3. Skill execution shows optimization opportunities:
  • High latency (>5 seconds)
  • High token usage (>5000 tokens)
  • Partial failures or retries

Step-by-Step Workflow

Step 1: Performance Monitoring

Skill executes → Metrics captured → EvolutionEngine evaluates

Tracked metrics:

  • Execution time
  • Token consumption
  • Success rate
  • Error patterns

Step 2: Optimization Trigger

Threshold exceeded → EvolutionEngine triggers AlphaEvolver

Trigger conditions:

  • execution_seconds > 5.0
  • token_usage > 5000
  • success == False

Step 3: Code Mutation

Original skill → LLM mutates → New variant created

Mutation prompts:

  • "Reduce execution time by 50%"
  • "Optimize for lower token usage"
  • "Fix intermittent failures"

**Example Mutation:**

**Original:**

def process_invoice(invoice_id: str) -> dict:
    # Linear search through 10K invoices
    for invoice in invoices:
        if invoice["id"] == invoice_id:
            return invoice
    return None

**Mutated:**

def process_invoice(invoice_id: str) -> dict:
    # O(1) dictionary lookup
    return invoice_index.get(invoice_id)

Step 4: Sandbox Execution

Mutated code → Executed in sandbox → Proxy signals captured

Proxy signals:

  • execution_success: Ran without crashing
  • syntax_error: Code compiles
  • execution_latency_ms: Runtime performance
  • user_approved_proposal: HITL feedback

Step 5: Fitness Evaluation

FitnessService calculates score → Variant ranked

**Stage 1: Initial Proxy Score**

score = 0.0
if not syntax_error:
    score += 0.2  # Survived syntax check
if execution_success:
    score += 0.3  # Ran successfully
if user_approved:
    score += 0.5  # Human approval

**Stage 2: Delayed External Signals**

if invoice_created:
    score += 0.4  # Business value
if crm_conversion:
    score += 0.5  # Downstream success
if conversion_value:
    score += min(0.5, value / 1000)  # Scaled value

Step 6: Variant Comparison

Original vs. Mutated → Higher fitness wins

**Example:**

  • Original fitness: 0.65 (slow but works)
  • Mutated fitness: 0.85 (fast and works)
  • **Winner: Mutated variant**

Step 7: Research Mode (Optional)

Iterative mutation → Progressive improvement → Best variant selected

For AUTONOMOUS agents, AlphaEvolver can run multi-iteration experiments:

Iteration 1: Base code → Mutate → Test → Keep winner
Iteration 2: Winner → Mutate → Test → Keep winner
Iteration 3: Winner → Mutate → Test → Final best

Example Scenario

**Initial State:**

  • Skill: send_slack_notification()
  • Agent: SUPERVISED maturity
  • Performance: 7 seconds, 6000 tokens
  • Issue: Too slow for high-volume usage

**After Trigger:**

  1. EvolutionEngine detects high latency
  2. AlphaEvolver generates mutation
  3. Sandbox tests both versions
  4. Fitness scores calculated
  5. Mutated variant wins (2 seconds, 2000 tokens)
  6. Queued for your review

**After Approval:**

  • Skill updated to optimized version
  • Future executions faster and cheaper
  • Lineage tracked (parent_tool_id)

---

Capability Gates

Workspace Settings

Auto-Dev must be enabled at workspace level:

{
  "auto_dev": {
    "enabled": true,
    "memento_skills": true,
    "alpha_evolver": true,
    "background_evolution": false,
    "max_mutations_per_day": 10,
    "max_skill_candidates_per_day": 5
  }
}

Configuration Options

`enabled` (boolean)

  • Master toggle for all Auto-Dev features
  • Default: false

`memento_skills` (boolean)

  • Enable Memento-Skills (skill generation)
  • Requires: INTERN maturity
  • Default: true

`alpha_evolver` (boolean)

  • Enable AlphaEvolver (skill optimization)
  • Requires: SUPERVISED maturity
  • Default: true

`background_evolution` (boolean)

  • Enable automatic background optimization
  • Requires: AUTONOMOUS maturity
  • Default: false (explicit opt-in)

`max_mutations_per_day` (integer)

  • Daily limit on AlphaEvolver mutations
  • Default: 10

`max_skill_candidates_per_day` (integer)

  • Daily limit on Memento skill proposals
  • Default: 5

Maturity Requirements Summary

CapabilityMinimum MaturityWorkspace SettingDaily Limit
Memento-SkillsINTERNmemento_skills: true5 candidates
AlphaEvolverSUPERVISEDalpha_evolver: true10 mutations
Background EvolutionAUTONOMOUSbackground_evolution: true10 mutations

Capability Unlock Notifications

When an agent graduates to a new maturity level, you'll receive:

{
  "type": "auto_dev_capability_unlocked",
  "agent_id": "agent-123",
  "capability": "auto_dev.alpha_evolver",
  "message": "Agent has graduated to use Alpha Evolver. Enable it in Settings > Auto-Dev to activate.",
  "action_required": true
}

---

Common Workflows

Workflow 1: Enable Auto-Dev for Your Workspace

  1. **Check Agent Maturity**
  • INTERN or higher required for Memento-Skills
  • SUPERVISED or higher required for AlphaEvolver
  1. **Enable Auto-Dev**
  1. **Configure Capabilities**
  • ✅ Memento-Skills
  • ✅ AlphaEvolver
  • ❌ Background Evolution (until AUTONOMOUS)
  1. **Set Daily Limits**
  • Max skill candidates: 5/day
  • Max mutations: 10/day
  1. **Save Changes**
  • Auto-Dev is now active for your workspace

Workflow 2: Review Skill Candidates

  1. **Navigate to Candidates**
  1. **Review Pending Candidates**
  • **Skill Name**: Proposed function name
  • **Description**: What it does
  • **Source Episode**: Which failure triggered it
  • **Generated Code**: Python implementation
  • **Validation Result**: Sandbox test results
  1. **Inspect Code**
  • Function signature
  • Type hints
  • Docstring
  • Implementation
  • Error handling
  1. **Test Manually** (Optional)
  • Click "Test in Sandbox"
  • Provide sample inputs
  • Review output
  1. **Approve or Reject**
  • **Approve**: Skill registered to catalog
  • **Reject**: Candidate discarded (can be regenerated)
  1. **Monitor Usage**

Workflow 3: Monitor Evolution Progress

  1. **Navigate to Mutations**
  1. **Review Mutation History**
  • **Tool Name**: Which skill was mutated
  • **Parent**: Original version
  • **Mutated Code**: New implementation
  • **Status**: pending, passed, failed
  • **Fitness Score**: 0.0 to 1.0
  1. **Compare Variants**
  • **Original**: Baseline performance
  • **Mutated**: Improved performance
  • **Fitness Delta**: Score improvement
  1. **Approve Promotion** (if applicable)
  • Review fitness comparison
  • Approve to replace original
  • Reject to discard
  1. **Track Lineage**

Workflow 4: Configure Daily Limits

  1. **Navigate to Settings**
  1. **Adjust Limits**
  • **Skill Candidates**: 1-20 per day
  • **Mutations**: 1-50 per day
  1. **Consider Factors**
  • Workspace size (more agents = higher limits)
  • Budget (LLM API costs)
  • Review capacity (can you keep up?)
  1. **Save Changes**
  • Limits apply immediately
  • Resets at midnight UTC

Workflow 5: Check Graduation Readiness

  1. **Navigate to Agent**
  1. **Review Readiness Score**
  • Episode count (40% weight)
  • Intervention rate (30% weight)
  • Constitutional compliance (30% weight)
  1. **Check Gaps**
  • Episodes needed: 10 (INTERN), 25 (SUPERVISED), 50 (AUTONOMOUS)
  • Intervention rate target: <50% (INTERN), <20% (SUPERVISED), <5% (AUTONOMOUS)
  • Constitutional score target: >0.70 (INTERN), >0.85 (SUPERVISED), >0.95 (AUTONOMOUS)
  1. **Accelerate Graduation**
  • Run more episodes
  • Reduce interventions (approve proposals)
  • Improve constitutional compliance
  1. **Wait for Notification**
  • Auto-Dev capabilities unlock automatically
  • You'll receive notification when ready

---

Best Practices

1. Start with Memento-Skills

**Why:** Skill generation is lower risk than optimization

**How:**

  • Enable Memento-Skills at INTERN level
  • Review candidates carefully
  • Promote high-quality skills
  • Disable if too many low-quality proposals

**Benefits:**

  • Expands agent capabilities
  • Addresses repeated failures
  • Lower maturity requirement (INTERN)

2. Review Candidates Before Promotion

**Why:** LLM-generated code may have bugs or security issues

**How:**

  • Always read generated code
  • Check for:
  • Type hints
  • Error handling
  • Input validation
  • Security issues (hardcoded secrets, unsafe operations)
  • Test in sandbox with sample inputs
  • Only approve high-quality code

**Red Flags:**

  • Missing type hints
  • No error handling
  • Hardcoded values
  • External API calls without rate limiting
  • File system operations without validation

3. Monitor Fitness Scores

**Why:** Fitness scores indicate optimization effectiveness

**How:**

  • Track fitness trends over time
  • Investigate sudden drops
  • Celebrate improvements
  • Set fitness targets (>0.7 = good, >0.9 = excellent)

**Interpretation:**

  • **0.9-1.0**: Excellent (promote immediately)
  • **0.7-0.9**: Good (consider promoting)
  • **0.5-0.7**: Moderate (needs improvement)
  • **0.0-0.5**: Poor (discard or re-mutate)

4. Set Appropriate Daily Limits

**Why:** Prevent resource exhaustion and control costs

**How:**

  • Start with defaults (5 candidates, 10 mutations)
  • Increase if:
  • Large workspace (10+ agents)
  • High review capacity
  • Sufficient budget
  • Decrease if:
  • Small workspace (1-3 agents)
  • Limited review time
  • Budget constraints

**Recommendations:**

Workspace SizeSkill CandidatesMutations
Small (1-3 agents)3/day5/day
Medium (4-10 agents)5/day10/day
Large (10+ agents)10/day20/day

5. Use Background Evolution Carefully

**Why:** Automatic mutations can accumulate errors

**How:**

  • Only enable for AUTONOMOUS agents
  • Set conservative daily limits
  • Monitor mutation queue regularly
  • Roll back problematic mutations

**When to Enable:**

  • Agent has 50+ successful episodes
  • Intervention rate <5%
  • Constitutional score >0.95
  • You have time to review daily

6. Track Lineage

**Why:** Understanding mutation history helps debugging

**How:**

  • Use parent_tool_id to trace mutations
  • Compare variants side-by-side
  • Keep original code for rollback
  • Document why mutations were made

**Benefits:**

  • Easy rollback
  • Pattern recognition
  • Knowledge sharing
  • Debugging aid

7. Combine with Episodic Memory

**Why:** Episodes provide context for learning

**How:**

  • Ensure episodes are recorded
  • Link episodes to skill candidates
  • Use episode search to find relevant context
  • Leverage episode feedback for fitness signals

**Integration:**

  • Memento uses source_episode_id
  • AlphaEvolver uses episode analysis
  • Fitness signals from episode outcomes

---

Troubleshooting

Issue: No Skill Candidates Generated

**Symptoms:**

  • Agent fails tasks repeatedly
  • No candidates appear in queue
  • ReflectionEngine not triggering

**Diagnosis:**

  1. Check workspace settings:
  2. Check agent maturity:
  1. Check failure threshold:

**Solutions:**

  • Enable Memento-Skills in workspace settings
  • Graduate agent to INTERN level
  • Lower failure threshold to 1-2
  • Ensure EpisodeService is recording failures

Issue: Mutations Failing Sandbox

**Symptoms:**

  • All mutations show "failed" status
  • Sandbox execution errors
  • No fitness scores calculated

**Diagnosis:**

  1. Check Docker availability:
  2. Check sandbox logs:
  3. Review mutation code for syntax errors

**Solutions:**

  • Install/start Docker daemon
  • Fix syntax errors in base code
  • Increase sandbox timeout (default: 60s)
  • Check memory limits (default: 256MB)

Issue: Fitness Scores Not Updating

**Symptoms:**

  • Mutations stuck at "pending" status
  • Fitness scores remain None
  • External signals not received

**Diagnosis:**

  1. Check webhook integration:
  2. Check evaluation status:
  3. Review FitnessService logs

**Solutions:**

  • Configure webhooks for external signals
  • Manually trigger delayed evaluation
  • Check proxy signals are being recorded
  • Ensure expects_delayed_eval is set correctly

Issue: Daily Limits Exceeded

**Symptoms:**

  • Error: "Daily limit exceeded"
  • No new mutations/candidates generated
  • Counter at max value

**Diagnosis:**

  1. Check current usage:
  2. Check limits:

**Solutions:**

  • Wait for midnight UTC reset
  • Increase daily limits in settings
  • Disable unused capabilities
  • Review and discard low-quality items

Issue: Agent Not Graduating

**Symptoms:**

  • Agent stuck at current maturity
  • Readiness score not increasing
  • No graduation notifications

**Diagnosis:**

  1. Check graduation criteria:
  2. Review episode count:
  3. Check intervention rate:

**Solutions:**

  • Run more episodes (automated tasks)
  • Reduce interventions (approve proposals)
  • Improve constitutional compliance
  • Wait for automatic graduation evaluation

Issue: Generated Code Has Security Issues

**Symptoms:**

  • Candidate contains hardcoded secrets
  • Code accesses unsafe resources
  • Missing input validation

**Diagnosis:**

  1. Review generated code carefully
  2. Check for security patterns:
  • Hardcoded API keys
  • SQL injection risks
  • Path traversal vulnerabilities
  • Missing authentication

**Solutions:**

  • **Reject** the candidate immediately
  • Report security issue to improve LLM prompts
  • Manually create secure version
  • Enable security scanning in sandbox

---

FAQ

Q: What's the difference between Memento-Skills and AlphaEvolver?

**A:**

  • **Memento-Skills** creates NEW capabilities from failures (feature expansion)
  • **AlphaEvolver** improves EXISTING capabilities through mutation (optimization)

**Analogy:**

  • Memento = "I don't have a hammer, let me invent one"
  • AlphaEvolver = "This hammer is heavy, let me make it lighter"

Q: Do I need to enable both learning loops?

**A:** No, they're independent:

  • Enable only Memento-Skills for skill generation
  • Enable only AlphaEvolver for optimization
  • Enable both for full self-evolution

Q: Can agents edit their own code without limits?

**A:** No, multiple guardrails exist:

  1. **Maturity gates**: INTERN/SUPERVISED/AUTONOMOUS requirements
  2. **Workspace settings**: Master toggle + per-capability toggles
  3. **Sandbox validation**: All code tested before promotion
  4. **User approval**: Required for promotion (INTERN/SUPERVISED)
  5. **Daily limits**: Prevent resource exhaustion

Q: What happens if a mutation breaks a skill?

**A:** Lineage tracking enables rollback:

  1. Original code preserved via parent_tool_id
  2. Fitness scores detect degradation
  3. You can reject mutations
  4. Rollback to previous version

Q: How often should I review skill candidates?

**A:** Depends on workspace activity:

  • **High activity** (10+ agents): Daily review recommended
  • **Medium activity** (4-10 agents): Weekly review sufficient
  • **Low activity** (1-3 agents): Review as needed

**Best practice:** Set aside dedicated time each week to review and approve/reject candidates.

Q: Can I manually create skill candidates?

**A:** Yes, via API:

from core.auto_dev.memento_engine import MementoEngine

engine = MementoEngine(db)
candidate = await engine.generate_skill_candidate(
    tenant_id="your-tenant",
    agent_id="your-agent",
    episode_id="episode-123",
)

Q: How do I disable Auto-Dev?

**A:** Two levels:

  1. **Workspace level**: Disable master toggle
  2. **Capability level**: Disable specific capabilities

Q: What happens to existing skills/mutations when Auto-Dev is disabled?

**A:** They persist:

  • Existing skills remain available
  • Mutation history preserved
  • Can re-enable anytime
  • No data loss

Q: Can I export skills learned via Auto-Dev?

**A:** Yes, skills are standard Python packages:

  1. Navigate to skill directory
  2. Export as ZIP
  3. Share with other workspaces
  4. Import via skill marketplace

Q: How much does Auto-Dev cost?

**A:** Two cost components:

  1. **LLM API calls**: For code generation
  • Memento: ~500-1000 tokens per candidate
  • AlphaEvolver: ~300-800 tokens per mutation
  1. **Sandbox execution**: Minimal (Docker resources)

**Cost estimation:**

  • 5 skill candidates/day: ~$0.05-0.10/day
  • 10 mutations/day: ~$0.03-0.08/day
  • **Total**: ~$0.08-0.18/day for active workspace

Q: Is Auto-Dev safe for production workloads?

**A:** Yes, with safeguards:

  • All code validated in sandbox
  • User approval required (INTERN/SUPERVISED)
  • Daily limits prevent runaway processes
  • Lineage tracking enables rollback
  • Tenant isolation enforced

**Recommendation:** Start with non-critical workloads, monitor closely, then expand.

Q: Can I use Auto-Dev with custom LLM providers?

**A:** Yes, Auto-Dev uses LLMService abstraction:

  • OpenAI, Anthropic, DeepSeek, Gemini, etc.
  • Configure in workspace settings
  • Auto-Dev automatically uses configured provider

---

See Also