Atom AI Labs - AI-Powered Multi-Tenant Platform

Auto-Dev User Guide

Learn how to use Auto-Dev's self-evolving agent capabilities to help your agents learn from experience and automatically improve their skills over time.

**Version:** 1.0.0

**Last Updated:** 2026-04-10

---

Introduction
Understanding Auto-Dev
Memento-Skills Learning Loop
AlphaEvolver Learning Loop
Capability Gates
Common Workflows
Best Practices
Troubleshooting
FAQ

---

Introduction

What is Auto-Dev?

Auto-Dev is a self-evolving agent system that enables AI agents to learn from their experiences and automatically improve their capabilities over time. Instead of manually writing and updating skills, you can let your agents learn from failures and optimize their own code.

**Key Benefits:**

**Automatic Skill Generation**: Agents create new skills when they encounter repeated failures
**Continuous Optimization**: Existing skills are refined through iterative mutation and testing
**Safe by Default**: All changes are validated in sandboxed environments before promotion
**Maturity-Gated**: Features unlock as agents demonstrate competence

Two Learning Loops

Auto-Dev provides two complementary learning loops:

**Memento-Skills** (Skill Generation) - Creates NEW capabilities from failures
**AlphaEvolver** (Skill Optimization) - Improves EXISTING capabilities through mutation

Both loops are gated by agent maturity level, ensuring agents have proven competence before accessing self-evolution features.

---

Understanding Auto-Dev

How It Works

┌─────────────────────────────────────────────────────────────┐
│                     Agent Executes Task                      │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │  Task Fails?    │
                    └─────────────────┘
                      │ Yes        │ No
                      ▼            ▼
            ┌──────────────┐   ┌──────────────┐
            │ MementoSkills│   │ AlphaEvolver │
            │ (INTERN)     │   │ (SUPERVISED) │
            └──────────────┘   └──────────────┘
                      │                    │
                      ▼                    ▼
            ┌──────────────┐   ┌──────────────┐
            │ Generate New │   │ Mutate &     │
            │ Skill        │   │ Optimize     │
            └──────────────┘   └──────────────┘
                      │                    │
                      ▼                    ▼
            ┌──────────────┐   ┌──────────────┐
            │ Sandbox      │   │ Sandbox      │
            │ Validation   │   │ Validation   │
            └──────────────┘   └──────────────┘
                      │                    │
                      ▼                    ▼
            ┌──────────────┐   ┌──────────────┐
            │ User         │   │ Fitness      │
            │ Approval     │   │ Comparison   │
            └──────────────┘   └──────────────┘

Agent Maturity Levels

Auto-Dev capabilities unlock as agents progress through maturity levels:

Maturity Level	Memento-Skills	AlphaEvolver	Background Evolution
STUDENT	❌ Blocked	❌ Blocked	❌ Blocked
INTERN	✅ Enabled	❌ Blocked	❌ Blocked
SUPERVISED	✅ Enabled	✅ Enabled	❌ Blocked
AUTONOMOUS	✅ Enabled	✅ Enabled	✅ Enabled

**How Agents Graduate:**

Complete episodes successfully
Maintain low intervention rates
Demonstrate constitutional compliance
Receive positive user feedback

---

Memento-Skills Learning Loop

Overview

Memento-Skills generates **new capabilities** when agents fail tasks repeatedly. It analyzes failure patterns, creates skill proposals, and validates them before promotion.

When It Triggers

Memento-Skills activates when:

Agent maturity is **INTERN** or higher
Auto-Dev is enabled in workspace settings
Agent fails the **same task** 2+ times (configurable threshold)

Step-by-Step Workflow

Step 1: Failure Detection

Agent attempts task → Fails → Episode recorded

The system records:

Task description
Error trace
Tools attempted
Execution context

Step 2: Pattern Recognition

ReflectionEngine monitors failures → Detects pattern → Triggers Memento

The ReflectionEngine:

Buffers recent failures per agent
Compares task descriptions for similarity
Triggers when threshold exceeded (default: 2 similar failures)

Step 3: Episode Analysis

MementoEngine analyzes episode → Extracts failure pattern

Analysis includes:

What task was attempted
What went wrong (error trace)
What tools were tried
Suggested skill name

Step 4: Skill Generation

LLM proposes new skill → Python code generated

The LLM creates:

Function with clear name
Type hints on parameters
Docstring explaining purpose
Error handling
Self-contained logic

**Example Generated Skill:**

def extract_invoice_id(email_body: str) -> str | None:
    """
    Extract invoice ID from email body using regex patterns.

    Args:
        email_body: Email text content

    Returns:
        Invoice ID if found, None otherwise
    """
    import re

    # Pattern: INV followed by digits
    pattern = r'INV[-_]?(\d{4,})'
    match = re.search(pattern, email_body, re.IGNORECASE)

    return match.group(1) if match else None

Step 5: Sandbox Validation

Generated skill → Executed in sandbox → Results captured

Validation checks:

Syntax correctness
Execution success
Output format
Error handling

Step 6: User Review

Skill candidate queued → You review → Approve or reject

You'll see:

Skill name and description
Generated code
Validation results
Failure pattern context

Step 7: Promotion

Approved skill → Registered to skill catalog → Agent can use it

The skill becomes:

Available to all agents in workspace
Version controlled (v1.0.0)
Tracked in skill registry

Example Scenario

**Initial State:**

Agent: INTERN maturity
Task: "Extract invoice ID from customer emails"
Result: Fails repeatedly (no matching skill)

**After 2 Failures:**

ReflectionEngine detects pattern
MementoEngine analyzes episodes
LLM generates extract_invoice_id() skill
Sandbox validates successfully
You review and approve
Skill registered

**Future:**

Agent successfully extracts invoice IDs
No more failures on this task
Skill available to other agents

---

AlphaEvolver Learning Loop

Overview

AlphaEvolver **optimizes existing skills** through iterative mutation and fitness comparison. It's like A/B testing for code - small variations are tried, measured, and the best wins.

When It Triggers

AlphaEvolver activates when:

Agent maturity is **SUPERVISED** or higher
Auto-Dev is enabled in workspace settings
Skill execution shows optimization opportunities:

High latency (>5 seconds)
High token usage (>5000 tokens)
Partial failures or retries

Step-by-Step Workflow

Step 1: Performance Monitoring

Skill executes → Metrics captured → EvolutionEngine evaluates

Tracked metrics:

Execution time
Token consumption
Success rate
Error patterns

Step 2: Optimization Trigger

Threshold exceeded → EvolutionEngine triggers AlphaEvolver

Trigger conditions:

execution_seconds > 5.0
token_usage > 5000
success == False

Step 3: Code Mutation

Original skill → LLM mutates → New variant created

Mutation prompts:

"Reduce execution time by 50%"
"Optimize for lower token usage"
"Fix intermittent failures"

**Example Mutation:**

**Original:**

def process_invoice(invoice_id: str) -> dict:
    # Linear search through 10K invoices
    for invoice in invoices:
        if invoice["id"] == invoice_id:
            return invoice
    return None

**Mutated:**

def process_invoice(invoice_id: str) -> dict:
    # O(1) dictionary lookup
    return invoice_index.get(invoice_id)

Step 4: Sandbox Execution

Mutated code → Executed in sandbox → Proxy signals captured

Proxy signals:

execution_success: Ran without crashing
syntax_error: Code compiles
execution_latency_ms: Runtime performance
user_approved_proposal: HITL feedback

Step 5: Fitness Evaluation

FitnessService calculates score → Variant ranked

**Stage 1: Initial Proxy Score**

score = 0.0
if not syntax_error:
    score += 0.2  # Survived syntax check
if execution_success:
    score += 0.3  # Ran successfully
if user_approved:
    score += 0.5  # Human approval

**Stage 2: Delayed External Signals**

if invoice_created:
    score += 0.4  # Business value
if crm_conversion:
    score += 0.5  # Downstream success
if conversion_value:
    score += min(0.5, value / 1000)  # Scaled value

Step 6: Variant Comparison

Original vs. Mutated → Higher fitness wins

**Example:**

Original fitness: 0.65 (slow but works)
Mutated fitness: 0.85 (fast and works)
**Winner: Mutated variant**

Step 7: Research Mode (Optional)

Iterative mutation → Progressive improvement → Best variant selected

For AUTONOMOUS agents, AlphaEvolver can run multi-iteration experiments:

Iteration 1: Base code → Mutate → Test → Keep winner
Iteration 2: Winner → Mutate → Test → Keep winner
Iteration 3: Winner → Mutate → Test → Final best

Example Scenario

**Initial State:**

Skill: send_slack_notification()
Agent: SUPERVISED maturity
Performance: 7 seconds, 6000 tokens
Issue: Too slow for high-volume usage

**After Trigger:**

EvolutionEngine detects high latency
AlphaEvolver generates mutation
Sandbox tests both versions
Fitness scores calculated
Mutated variant wins (2 seconds, 2000 tokens)
Queued for your review

**After Approval:**

Skill updated to optimized version
Future executions faster and cheaper
Lineage tracked (parent_tool_id)

---

Capability Gates

Workspace Settings

Auto-Dev must be enabled at workspace level:

{
  "auto_dev": {
    "enabled": true,
    "memento_skills": true,
    "alpha_evolver": true,
    "background_evolution": false,
    "max_mutations_per_day": 10,
    "max_skill_candidates_per_day": 5
  }
}

Configuration Options

`enabled` (boolean)

Master toggle for all Auto-Dev features
Default: false

`memento_skills` (boolean)

Enable Memento-Skills (skill generation)
Requires: INTERN maturity
Default: true

`alpha_evolver` (boolean)

Enable AlphaEvolver (skill optimization)
Requires: SUPERVISED maturity
Default: true

`background_evolution` (boolean)

Enable automatic background optimization
Requires: AUTONOMOUS maturity
Default: false (explicit opt-in)

`max_mutations_per_day` (integer)

Daily limit on AlphaEvolver mutations
Default: 10

`max_skill_candidates_per_day` (integer)

Daily limit on Memento skill proposals
Default: 5

Maturity Requirements Summary

Capability	Minimum Maturity	Workspace Setting	Daily Limit
Memento-Skills	INTERN	`memento_skills: true`	5 candidates
AlphaEvolver	SUPERVISED	`alpha_evolver: true`	10 mutations
Background Evolution	AUTONOMOUS	`background_evolution: true`	10 mutations

Capability Unlock Notifications

When an agent graduates to a new maturity level, you'll receive:

{
  "type": "auto_dev_capability_unlocked",
  "agent_id": "agent-123",
  "capability": "auto_dev.alpha_evolver",
  "message": "Agent has graduated to use Alpha Evolver. Enable it in Settings > Auto-Dev to activate.",
  "action_required": true
}

---

Common Workflows

Workflow 1: Enable Auto-Dev for Your Workspace

**Check Agent Maturity**

INTERN or higher required for Memento-Skills
SUPERVISED or higher required for AlphaEvolver

**Enable Auto-Dev**

**Configure Capabilities**

✅ Memento-Skills
✅ AlphaEvolver
❌ Background Evolution (until AUTONOMOUS)

**Set Daily Limits**

Max skill candidates: 5/day
Max mutations: 10/day

**Save Changes**

Auto-Dev is now active for your workspace

Workflow 2: Review Skill Candidates

**Navigate to Candidates**

**Review Pending Candidates**

**Skill Name**: Proposed function name
**Description**: What it does
**Source Episode**: Which failure triggered it
**Generated Code**: Python implementation
**Validation Result**: Sandbox test results

**Inspect Code**

Function signature
Type hints
Docstring
Implementation
Error handling

**Test Manually** (Optional)

Click "Test in Sandbox"
Provide sample inputs
Review output

**Approve or Reject**

**Approve**: Skill registered to catalog
**Reject**: Candidate discarded (can be regenerated)

**Monitor Usage**

Workflow 3: Monitor Evolution Progress

**Navigate to Mutations**

**Review Mutation History**

**Tool Name**: Which skill was mutated
**Parent**: Original version
**Mutated Code**: New implementation
**Status**: pending, passed, failed
**Fitness Score**: 0.0 to 1.0

**Compare Variants**

**Original**: Baseline performance
**Mutated**: Improved performance
**Fitness Delta**: Score improvement

**Approve Promotion** (if applicable)

Review fitness comparison
Approve to replace original
Reject to discard

**Track Lineage**

Workflow 4: Configure Daily Limits

**Navigate to Settings**

**Adjust Limits**

**Skill Candidates**: 1-20 per day
**Mutations**: 1-50 per day

**Consider Factors**

Workspace size (more agents = higher limits)
Budget (LLM API costs)
Review capacity (can you keep up?)

**Save Changes**

Limits apply immediately
Resets at midnight UTC

Workflow 5: Check Graduation Readiness

**Navigate to Agent**

**Review Readiness Score**

Episode count (40% weight)
Intervention rate (30% weight)
Constitutional compliance (30% weight)

**Check Gaps**

Episodes needed: 10 (INTERN), 25 (SUPERVISED), 50 (AUTONOMOUS)
Intervention rate target: <50% (INTERN), <20% (SUPERVISED), <5% (AUTONOMOUS)
Constitutional score target: >0.70 (INTERN), >0.85 (SUPERVISED), >0.95 (AUTONOMOUS)

**Accelerate Graduation**

Run more episodes
Reduce interventions (approve proposals)
Improve constitutional compliance

**Wait for Notification**

Auto-Dev capabilities unlock automatically
You'll receive notification when ready

---

Best Practices

1. Start with Memento-Skills

**Why:** Skill generation is lower risk than optimization

**How:**

Enable Memento-Skills at INTERN level
Review candidates carefully
Promote high-quality skills
Disable if too many low-quality proposals

**Benefits:**

Expands agent capabilities
Addresses repeated failures
Lower maturity requirement (INTERN)

2. Review Candidates Before Promotion

**Why:** LLM-generated code may have bugs or security issues

**How:**

Always read generated code
Check for:
Type hints
Error handling
Input validation
Security issues (hardcoded secrets, unsafe operations)
Test in sandbox with sample inputs
Only approve high-quality code

**Red Flags:**

Missing type hints
No error handling
Hardcoded values
External API calls without rate limiting
File system operations without validation

3. Monitor Fitness Scores

**Why:** Fitness scores indicate optimization effectiveness

**How:**

Track fitness trends over time
Investigate sudden drops
Celebrate improvements
Set fitness targets (>0.7 = good, >0.9 = excellent)

**Interpretation:**

**0.9-1.0**: Excellent (promote immediately)
**0.7-0.9**: Good (consider promoting)
**0.5-0.7**: Moderate (needs improvement)
**0.0-0.5**: Poor (discard or re-mutate)

4. Set Appropriate Daily Limits

**Why:** Prevent resource exhaustion and control costs

**How:**

Start with defaults (5 candidates, 10 mutations)
Increase if:
Large workspace (10+ agents)
High review capacity
Sufficient budget
Decrease if:
Small workspace (1-3 agents)
Limited review time
Budget constraints

**Recommendations:**

Workspace Size	Skill Candidates	Mutations
Small (1-3 agents)	3/day	5/day
Medium (4-10 agents)	5/day	10/day
Large (10+ agents)	10/day	20/day

5. Use Background Evolution Carefully

**Why:** Automatic mutations can accumulate errors

**How:**

Only enable for AUTONOMOUS agents
Set conservative daily limits
Monitor mutation queue regularly
Roll back problematic mutations

**When to Enable:**

Agent has 50+ successful episodes
Intervention rate <5%
Constitutional score >0.95
You have time to review daily

6. Track Lineage

**Why:** Understanding mutation history helps debugging

**How:**

Use parent_tool_id to trace mutations
Compare variants side-by-side
Keep original code for rollback
Document why mutations were made

**Benefits:**

Easy rollback
Pattern recognition
Knowledge sharing
Debugging aid

7. Combine with Episodic Memory

**Why:** Episodes provide context for learning

**How:**

Ensure episodes are recorded
Link episodes to skill candidates
Use episode search to find relevant context
Leverage episode feedback for fitness signals

**Integration:**

Memento uses source_episode_id
AlphaEvolver uses episode analysis
Fitness signals from episode outcomes

---

Troubleshooting

Issue: No Skill Candidates Generated

**Symptoms:**

Agent fails tasks repeatedly
No candidates appear in queue
ReflectionEngine not triggering

**Diagnosis:**

Check workspace settings:
Check agent maturity:

Check failure threshold:

**Solutions:**

Enable Memento-Skills in workspace settings
Graduate agent to INTERN level
Lower failure threshold to 1-2
Ensure EpisodeService is recording failures

Issue: Mutations Failing Sandbox

**Symptoms:**

All mutations show "failed" status
Sandbox execution errors
No fitness scores calculated

**Diagnosis:**

Check Docker availability:
Check sandbox logs:
Review mutation code for syntax errors

**Solutions:**

Install/start Docker daemon
Fix syntax errors in base code
Increase sandbox timeout (default: 60s)
Check memory limits (default: 256MB)

Issue: Fitness Scores Not Updating

**Symptoms:**

Mutations stuck at "pending" status
Fitness scores remain None
External signals not received

**Diagnosis:**

Check webhook integration:
Check evaluation status:
Review FitnessService logs

**Solutions:**

Configure webhooks for external signals
Manually trigger delayed evaluation
Check proxy signals are being recorded
Ensure expects_delayed_eval is set correctly

Issue: Daily Limits Exceeded

**Symptoms:**

Error: "Daily limit exceeded"
No new mutations/candidates generated
Counter at max value

**Diagnosis:**

Check current usage:
Check limits:

**Solutions:**

Wait for midnight UTC reset
Increase daily limits in settings
Disable unused capabilities
Review and discard low-quality items

Issue: Agent Not Graduating

**Symptoms:**

Agent stuck at current maturity
Readiness score not increasing
No graduation notifications

**Diagnosis:**

Check graduation criteria:
Review episode count:
Check intervention rate:

**Solutions:**

Run more episodes (automated tasks)
Reduce interventions (approve proposals)
Improve constitutional compliance
Wait for automatic graduation evaluation

Issue: Generated Code Has Security Issues

**Symptoms:**

Candidate contains hardcoded secrets
Code accesses unsafe resources
Missing input validation

**Diagnosis:**

Review generated code carefully
Check for security patterns:

Hardcoded API keys
SQL injection risks
Path traversal vulnerabilities
Missing authentication

**Solutions:**

**Reject** the candidate immediately
Report security issue to improve LLM prompts
Manually create secure version
Enable security scanning in sandbox

---

FAQ

Q: What's the difference between Memento-Skills and AlphaEvolver?

**A:**

**Memento-Skills** creates NEW capabilities from failures (feature expansion)
**AlphaEvolver** improves EXISTING capabilities through mutation (optimization)

**Analogy:**

Memento = "I don't have a hammer, let me invent one"
AlphaEvolver = "This hammer is heavy, let me make it lighter"

Q: Do I need to enable both learning loops?

**A:** No, they're independent:

Enable only Memento-Skills for skill generation
Enable only AlphaEvolver for optimization
Enable both for full self-evolution

Q: Can agents edit their own code without limits?

**A:** No, multiple guardrails exist:

**Maturity gates**: INTERN/SUPERVISED/AUTONOMOUS requirements
**Workspace settings**: Master toggle + per-capability toggles
**Sandbox validation**: All code tested before promotion
**User approval**: Required for promotion (INTERN/SUPERVISED)
**Daily limits**: Prevent resource exhaustion

Q: What happens if a mutation breaks a skill?

**A:** Lineage tracking enables rollback:

Original code preserved via parent_tool_id
Fitness scores detect degradation
You can reject mutations
Rollback to previous version

Q: How often should I review skill candidates?

**A:** Depends on workspace activity:

**High activity** (10+ agents): Daily review recommended
**Medium activity** (4-10 agents): Weekly review sufficient
**Low activity** (1-3 agents): Review as needed

**Best practice:** Set aside dedicated time each week to review and approve/reject candidates.

Q: Can I manually create skill candidates?

**A:** Yes, via API:

from core.auto_dev.memento_engine import MementoEngine

engine = MementoEngine(db)
candidate = await engine.generate_skill_candidate(
    tenant_id="your-tenant",
    agent_id="your-agent",
    episode_id="episode-123",
)

Q: How do I disable Auto-Dev?

**A:** Two levels:

**Workspace level**: Disable master toggle
**Capability level**: Disable specific capabilities

Q: What happens to existing skills/mutations when Auto-Dev is disabled?

**A:** They persist:

Existing skills remain available
Mutation history preserved
Can re-enable anytime
No data loss

Q: Can I export skills learned via Auto-Dev?

**A:** Yes, skills are standard Python packages:

Navigate to skill directory
Export as ZIP
Share with other workspaces
Import via skill marketplace

Q: How much does Auto-Dev cost?

**A:** Two cost components:

**LLM API calls**: For code generation

Memento: ~500-1000 tokens per candidate
AlphaEvolver: ~300-800 tokens per mutation

**Sandbox execution**: Minimal (Docker resources)

**Cost estimation:**

5 skill candidates/day: ~$0.05-0.10/day
10 mutations/day: ~$0.03-0.08/day
**Total**: ~$0.08-0.18/day for active workspace

Q: Is Auto-Dev safe for production workloads?

**A:** Yes, with safeguards:

All code validated in sandbox
User approval required (INTERN/SUPERVISED)
Daily limits prevent runaway processes
Lineage tracking enables rollback
Tenant isolation enforced

**Recommendation:** Start with non-critical workloads, monitor closely, then expand.

Q: Can I use Auto-Dev with custom LLM providers?

**A:** Yes, Auto-Dev uses LLMService abstraction:

OpenAI, Anthropic, DeepSeek, Gemini, etc.
Configure in workspace settings
Auto-Dev automatically uses configured provider

---

Auto-Dev User Guide

Table of Contents

Introduction

What is Auto-Dev?

Two Learning Loops

Understanding Auto-Dev

How It Works

Agent Maturity Levels

Memento-Skills Learning Loop

Overview

When It Triggers

Step-by-Step Workflow

Step 1: Failure Detection

Step 2: Pattern Recognition

Step 3: Episode Analysis

Step 4: Skill Generation

Step 5: Sandbox Validation

Step 6: User Review

Step 7: Promotion

Example Scenario

AlphaEvolver Learning Loop

Overview

When It Triggers

Step-by-Step Workflow

Step 1: Performance Monitoring

Step 2: Optimization Trigger

Step 3: Code Mutation

Step 4: Sandbox Execution

Step 5: Fitness Evaluation

Step 6: Variant Comparison

Step 7: Research Mode (Optional)

Example Scenario

Capability Gates

Workspace Settings

Configuration Options

`enabled` (boolean)

`memento_skills` (boolean)

`alpha_evolver` (boolean)

`background_evolution` (boolean)

`max_mutations_per_day` (integer)

`max_skill_candidates_per_day` (integer)

Maturity Requirements Summary

Capability Unlock Notifications

Common Workflows

Workflow 1: Enable Auto-Dev for Your Workspace

Workflow 2: Review Skill Candidates

Workflow 3: Monitor Evolution Progress

Workflow 4: Configure Daily Limits

Workflow 5: Check Graduation Readiness

Best Practices

1. Start with Memento-Skills

2. Review Candidates Before Promotion

3. Monitor Fitness Scores

4. Set Appropriate Daily Limits

5. Use Background Evolution Carefully

6. Track Lineage

7. Combine with Episodic Memory

Troubleshooting

Issue: No Skill Candidates Generated

Issue: Mutations Failing Sandbox

Issue: Fitness Scores Not Updating

Issue: Daily Limits Exceeded

Issue: Agent Not Graduating

Issue: Generated Code Has Security Issues

FAQ

Q: What's the difference between Memento-Skills and AlphaEvolver?

Q: Do I need to enable both learning loops?

Q: Can agents edit their own code without limits?

Q: What happens if a mutation breaks a skill?

Q: How often should I review skill candidates?

Q: Can I manually create skill candidates?

Q: How do I disable Auto-Dev?

Q: What happens to existing skills/mutations when Auto-Dev is disabled?

Q: Can I export skills learned via Auto-Dev?

Q: How much does Auto-Dev cost?

Q: Is Auto-Dev safe for production workloads?

Q: Can I use Auto-Dev with custom LLM providers?

See Also