Auto-Dev User Guide
Learn how to use Auto-Dev's self-evolving agent capabilities to help your agents learn from experience and automatically improve their skills over time.
**Version:** 1.0.0
**Last Updated:** 2026-04-10
---
Table of Contents
- Introduction
- Understanding Auto-Dev
- Memento-Skills Learning Loop
- AlphaEvolver Learning Loop
- Capability Gates
- Common Workflows
- Best Practices
- Troubleshooting
- FAQ
---
Introduction
What is Auto-Dev?
Auto-Dev is a self-evolving agent system that enables AI agents to learn from their experiences and automatically improve their capabilities over time. Instead of manually writing and updating skills, you can let your agents learn from failures and optimize their own code.
**Key Benefits:**
- **Automatic Skill Generation**: Agents create new skills when they encounter repeated failures
- **Continuous Optimization**: Existing skills are refined through iterative mutation and testing
- **Safe by Default**: All changes are validated in sandboxed environments before promotion
- **Maturity-Gated**: Features unlock as agents demonstrate competence
Two Learning Loops
Auto-Dev provides two complementary learning loops:
- **Memento-Skills** (Skill Generation) - Creates NEW capabilities from failures
- **AlphaEvolver** (Skill Optimization) - Improves EXISTING capabilities through mutation
Both loops are gated by agent maturity level, ensuring agents have proven competence before accessing self-evolution features.
---
Understanding Auto-Dev
How It Works
┌─────────────────────────────────────────────────────────────┐
│ Agent Executes Task │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────┐
│ Task Fails? │
└─────────────────┘
│ Yes │ No
▼ ▼
┌──────────────┐ ┌──────────────┐
│ MementoSkills│ │ AlphaEvolver │
│ (INTERN) │ │ (SUPERVISED) │
└──────────────┘ └──────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Generate New │ │ Mutate & │
│ Skill │ │ Optimize │
└──────────────┘ └──────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Sandbox │ │ Sandbox │
│ Validation │ │ Validation │
└──────────────┘ └──────────────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ User │ │ Fitness │
│ Approval │ │ Comparison │
└──────────────┘ └──────────────┘Agent Maturity Levels
Auto-Dev capabilities unlock as agents progress through maturity levels:
| Maturity Level | Memento-Skills | AlphaEvolver | Background Evolution |
|---|---|---|---|
| **STUDENT** | ❌ Blocked | ❌ Blocked | ❌ Blocked |
| **INTERN** | ✅ Enabled | ❌ Blocked | ❌ Blocked |
| **SUPERVISED** | ✅ Enabled | ✅ Enabled | ❌ Blocked |
| **AUTONOMOUS** | ✅ Enabled | ✅ Enabled | ✅ Enabled |
**How Agents Graduate:**
- Complete episodes successfully
- Maintain low intervention rates
- Demonstrate constitutional compliance
- Receive positive user feedback
---
Memento-Skills Learning Loop
Overview
Memento-Skills generates **new capabilities** when agents fail tasks repeatedly. It analyzes failure patterns, creates skill proposals, and validates them before promotion.
When It Triggers
Memento-Skills activates when:
- Agent maturity is **INTERN** or higher
- Auto-Dev is enabled in workspace settings
- Agent fails the **same task** 2+ times (configurable threshold)
Step-by-Step Workflow
Step 1: Failure Detection
Agent attempts task → Fails → Episode recordedThe system records:
- Task description
- Error trace
- Tools attempted
- Execution context
Step 2: Pattern Recognition
ReflectionEngine monitors failures → Detects pattern → Triggers MementoThe ReflectionEngine:
- Buffers recent failures per agent
- Compares task descriptions for similarity
- Triggers when threshold exceeded (default: 2 similar failures)
Step 3: Episode Analysis
MementoEngine analyzes episode → Extracts failure patternAnalysis includes:
- What task was attempted
- What went wrong (error trace)
- What tools were tried
- Suggested skill name
Step 4: Skill Generation
LLM proposes new skill → Python code generatedThe LLM creates:
- Function with clear name
- Type hints on parameters
- Docstring explaining purpose
- Error handling
- Self-contained logic
**Example Generated Skill:**
def extract_invoice_id(email_body: str) -> str | None:
"""
Extract invoice ID from email body using regex patterns.
Args:
email_body: Email text content
Returns:
Invoice ID if found, None otherwise
"""
import re
# Pattern: INV followed by digits
pattern = r'INV[-_]?(\d{4,})'
match = re.search(pattern, email_body, re.IGNORECASE)
return match.group(1) if match else NoneStep 5: Sandbox Validation
Generated skill → Executed in sandbox → Results capturedValidation checks:
- Syntax correctness
- Execution success
- Output format
- Error handling
Step 6: User Review
Skill candidate queued → You review → Approve or rejectYou'll see:
- Skill name and description
- Generated code
- Validation results
- Failure pattern context
Step 7: Promotion
Approved skill → Registered to skill catalog → Agent can use itThe skill becomes:
- Available to all agents in workspace
- Version controlled (v1.0.0)
- Tracked in skill registry
Example Scenario
**Initial State:**
- Agent: INTERN maturity
- Task: "Extract invoice ID from customer emails"
- Result: Fails repeatedly (no matching skill)
**After 2 Failures:**
- ReflectionEngine detects pattern
- MementoEngine analyzes episodes
- LLM generates
extract_invoice_id()skill - Sandbox validates successfully
- You review and approve
- Skill registered
**Future:**
- Agent successfully extracts invoice IDs
- No more failures on this task
- Skill available to other agents
---
AlphaEvolver Learning Loop
Overview
AlphaEvolver **optimizes existing skills** through iterative mutation and fitness comparison. It's like A/B testing for code - small variations are tried, measured, and the best wins.
When It Triggers
AlphaEvolver activates when:
- Agent maturity is **SUPERVISED** or higher
- Auto-Dev is enabled in workspace settings
- Skill execution shows optimization opportunities:
- High latency (>5 seconds)
- High token usage (>5000 tokens)
- Partial failures or retries
Step-by-Step Workflow
Step 1: Performance Monitoring
Skill executes → Metrics captured → EvolutionEngine evaluatesTracked metrics:
- Execution time
- Token consumption
- Success rate
- Error patterns
Step 2: Optimization Trigger
Threshold exceeded → EvolutionEngine triggers AlphaEvolverTrigger conditions:
execution_seconds > 5.0token_usage > 5000success == False
Step 3: Code Mutation
Original skill → LLM mutates → New variant createdMutation prompts:
- "Reduce execution time by 50%"
- "Optimize for lower token usage"
- "Fix intermittent failures"
**Example Mutation:**
**Original:**
def process_invoice(invoice_id: str) -> dict:
# Linear search through 10K invoices
for invoice in invoices:
if invoice["id"] == invoice_id:
return invoice
return None**Mutated:**
def process_invoice(invoice_id: str) -> dict:
# O(1) dictionary lookup
return invoice_index.get(invoice_id)Step 4: Sandbox Execution
Mutated code → Executed in sandbox → Proxy signals capturedProxy signals:
execution_success: Ran without crashingsyntax_error: Code compilesexecution_latency_ms: Runtime performanceuser_approved_proposal: HITL feedback
Step 5: Fitness Evaluation
FitnessService calculates score → Variant ranked**Stage 1: Initial Proxy Score**
score = 0.0
if not syntax_error:
score += 0.2 # Survived syntax check
if execution_success:
score += 0.3 # Ran successfully
if user_approved:
score += 0.5 # Human approval**Stage 2: Delayed External Signals**
if invoice_created:
score += 0.4 # Business value
if crm_conversion:
score += 0.5 # Downstream success
if conversion_value:
score += min(0.5, value / 1000) # Scaled valueStep 6: Variant Comparison
Original vs. Mutated → Higher fitness wins**Example:**
- Original fitness: 0.65 (slow but works)
- Mutated fitness: 0.85 (fast and works)
- **Winner: Mutated variant**
Step 7: Research Mode (Optional)
Iterative mutation → Progressive improvement → Best variant selectedFor AUTONOMOUS agents, AlphaEvolver can run multi-iteration experiments:
Iteration 1: Base code → Mutate → Test → Keep winner
Iteration 2: Winner → Mutate → Test → Keep winner
Iteration 3: Winner → Mutate → Test → Final bestExample Scenario
**Initial State:**
- Skill:
send_slack_notification() - Agent: SUPERVISED maturity
- Performance: 7 seconds, 6000 tokens
- Issue: Too slow for high-volume usage
**After Trigger:**
- EvolutionEngine detects high latency
- AlphaEvolver generates mutation
- Sandbox tests both versions
- Fitness scores calculated
- Mutated variant wins (2 seconds, 2000 tokens)
- Queued for your review
**After Approval:**
- Skill updated to optimized version
- Future executions faster and cheaper
- Lineage tracked (parent_tool_id)
---
Capability Gates
Workspace Settings
Auto-Dev must be enabled at workspace level:
{
"auto_dev": {
"enabled": true,
"memento_skills": true,
"alpha_evolver": true,
"background_evolution": false,
"max_mutations_per_day": 10,
"max_skill_candidates_per_day": 5
}
}Configuration Options
`enabled` (boolean)
- Master toggle for all Auto-Dev features
- Default:
false
`memento_skills` (boolean)
- Enable Memento-Skills (skill generation)
- Requires: INTERN maturity
- Default:
true
`alpha_evolver` (boolean)
- Enable AlphaEvolver (skill optimization)
- Requires: SUPERVISED maturity
- Default:
true
`background_evolution` (boolean)
- Enable automatic background optimization
- Requires: AUTONOMOUS maturity
- Default:
false(explicit opt-in)
`max_mutations_per_day` (integer)
- Daily limit on AlphaEvolver mutations
- Default:
10
`max_skill_candidates_per_day` (integer)
- Daily limit on Memento skill proposals
- Default:
5
Maturity Requirements Summary
| Capability | Minimum Maturity | Workspace Setting | Daily Limit |
|---|---|---|---|
| Memento-Skills | INTERN | memento_skills: true | 5 candidates |
| AlphaEvolver | SUPERVISED | alpha_evolver: true | 10 mutations |
| Background Evolution | AUTONOMOUS | background_evolution: true | 10 mutations |
Capability Unlock Notifications
When an agent graduates to a new maturity level, you'll receive:
{
"type": "auto_dev_capability_unlocked",
"agent_id": "agent-123",
"capability": "auto_dev.alpha_evolver",
"message": "Agent has graduated to use Alpha Evolver. Enable it in Settings > Auto-Dev to activate.",
"action_required": true
}---
Common Workflows
Workflow 1: Enable Auto-Dev for Your Workspace
- **Check Agent Maturity**
- INTERN or higher required for Memento-Skills
- SUPERVISED or higher required for AlphaEvolver
- **Enable Auto-Dev**
- **Configure Capabilities**
- ✅ Memento-Skills
- ✅ AlphaEvolver
- ❌ Background Evolution (until AUTONOMOUS)
- **Set Daily Limits**
- Max skill candidates: 5/day
- Max mutations: 10/day
- **Save Changes**
- Auto-Dev is now active for your workspace
Workflow 2: Review Skill Candidates
- **Navigate to Candidates**
- **Review Pending Candidates**
- **Skill Name**: Proposed function name
- **Description**: What it does
- **Source Episode**: Which failure triggered it
- **Generated Code**: Python implementation
- **Validation Result**: Sandbox test results
- **Inspect Code**
- Function signature
- Type hints
- Docstring
- Implementation
- Error handling
- **Test Manually** (Optional)
- Click "Test in Sandbox"
- Provide sample inputs
- Review output
- **Approve or Reject**
- **Approve**: Skill registered to catalog
- **Reject**: Candidate discarded (can be regenerated)
- **Monitor Usage**
Workflow 3: Monitor Evolution Progress
- **Navigate to Mutations**
- **Review Mutation History**
- **Tool Name**: Which skill was mutated
- **Parent**: Original version
- **Mutated Code**: New implementation
- **Status**: pending, passed, failed
- **Fitness Score**: 0.0 to 1.0
- **Compare Variants**
- **Original**: Baseline performance
- **Mutated**: Improved performance
- **Fitness Delta**: Score improvement
- **Approve Promotion** (if applicable)
- Review fitness comparison
- Approve to replace original
- Reject to discard
- **Track Lineage**
Workflow 4: Configure Daily Limits
- **Navigate to Settings**
- **Adjust Limits**
- **Skill Candidates**: 1-20 per day
- **Mutations**: 1-50 per day
- **Consider Factors**
- Workspace size (more agents = higher limits)
- Budget (LLM API costs)
- Review capacity (can you keep up?)
- **Save Changes**
- Limits apply immediately
- Resets at midnight UTC
Workflow 5: Check Graduation Readiness
- **Navigate to Agent**
- **Review Readiness Score**
- Episode count (40% weight)
- Intervention rate (30% weight)
- Constitutional compliance (30% weight)
- **Check Gaps**
- Episodes needed: 10 (INTERN), 25 (SUPERVISED), 50 (AUTONOMOUS)
- Intervention rate target: <50% (INTERN), <20% (SUPERVISED), <5% (AUTONOMOUS)
- Constitutional score target: >0.70 (INTERN), >0.85 (SUPERVISED), >0.95 (AUTONOMOUS)
- **Accelerate Graduation**
- Run more episodes
- Reduce interventions (approve proposals)
- Improve constitutional compliance
- **Wait for Notification**
- Auto-Dev capabilities unlock automatically
- You'll receive notification when ready
---
Best Practices
1. Start with Memento-Skills
**Why:** Skill generation is lower risk than optimization
**How:**
- Enable Memento-Skills at INTERN level
- Review candidates carefully
- Promote high-quality skills
- Disable if too many low-quality proposals
**Benefits:**
- Expands agent capabilities
- Addresses repeated failures
- Lower maturity requirement (INTERN)
2. Review Candidates Before Promotion
**Why:** LLM-generated code may have bugs or security issues
**How:**
- Always read generated code
- Check for:
- Type hints
- Error handling
- Input validation
- Security issues (hardcoded secrets, unsafe operations)
- Test in sandbox with sample inputs
- Only approve high-quality code
**Red Flags:**
- Missing type hints
- No error handling
- Hardcoded values
- External API calls without rate limiting
- File system operations without validation
3. Monitor Fitness Scores
**Why:** Fitness scores indicate optimization effectiveness
**How:**
- Track fitness trends over time
- Investigate sudden drops
- Celebrate improvements
- Set fitness targets (>0.7 = good, >0.9 = excellent)
**Interpretation:**
- **0.9-1.0**: Excellent (promote immediately)
- **0.7-0.9**: Good (consider promoting)
- **0.5-0.7**: Moderate (needs improvement)
- **0.0-0.5**: Poor (discard or re-mutate)
4. Set Appropriate Daily Limits
**Why:** Prevent resource exhaustion and control costs
**How:**
- Start with defaults (5 candidates, 10 mutations)
- Increase if:
- Large workspace (10+ agents)
- High review capacity
- Sufficient budget
- Decrease if:
- Small workspace (1-3 agents)
- Limited review time
- Budget constraints
**Recommendations:**
| Workspace Size | Skill Candidates | Mutations |
|---|---|---|
| Small (1-3 agents) | 3/day | 5/day |
| Medium (4-10 agents) | 5/day | 10/day |
| Large (10+ agents) | 10/day | 20/day |
5. Use Background Evolution Carefully
**Why:** Automatic mutations can accumulate errors
**How:**
- Only enable for AUTONOMOUS agents
- Set conservative daily limits
- Monitor mutation queue regularly
- Roll back problematic mutations
**When to Enable:**
- Agent has 50+ successful episodes
- Intervention rate <5%
- Constitutional score >0.95
- You have time to review daily
6. Track Lineage
**Why:** Understanding mutation history helps debugging
**How:**
- Use
parent_tool_idto trace mutations - Compare variants side-by-side
- Keep original code for rollback
- Document why mutations were made
**Benefits:**
- Easy rollback
- Pattern recognition
- Knowledge sharing
- Debugging aid
7. Combine with Episodic Memory
**Why:** Episodes provide context for learning
**How:**
- Ensure episodes are recorded
- Link episodes to skill candidates
- Use episode search to find relevant context
- Leverage episode feedback for fitness signals
**Integration:**
- Memento uses
source_episode_id - AlphaEvolver uses episode analysis
- Fitness signals from episode outcomes
---
Troubleshooting
Issue: No Skill Candidates Generated
**Symptoms:**
- Agent fails tasks repeatedly
- No candidates appear in queue
- ReflectionEngine not triggering
**Diagnosis:**
- Check workspace settings:
- Check agent maturity:
- Check failure threshold:
**Solutions:**
- Enable Memento-Skills in workspace settings
- Graduate agent to INTERN level
- Lower failure threshold to 1-2
- Ensure EpisodeService is recording failures
Issue: Mutations Failing Sandbox
**Symptoms:**
- All mutations show "failed" status
- Sandbox execution errors
- No fitness scores calculated
**Diagnosis:**
- Check Docker availability:
- Check sandbox logs:
- Review mutation code for syntax errors
**Solutions:**
- Install/start Docker daemon
- Fix syntax errors in base code
- Increase sandbox timeout (default: 60s)
- Check memory limits (default: 256MB)
Issue: Fitness Scores Not Updating
**Symptoms:**
- Mutations stuck at "pending" status
- Fitness scores remain None
- External signals not received
**Diagnosis:**
- Check webhook integration:
- Check evaluation status:
- Review FitnessService logs
**Solutions:**
- Configure webhooks for external signals
- Manually trigger delayed evaluation
- Check proxy signals are being recorded
- Ensure
expects_delayed_evalis set correctly
Issue: Daily Limits Exceeded
**Symptoms:**
- Error: "Daily limit exceeded"
- No new mutations/candidates generated
- Counter at max value
**Diagnosis:**
- Check current usage:
- Check limits:
**Solutions:**
- Wait for midnight UTC reset
- Increase daily limits in settings
- Disable unused capabilities
- Review and discard low-quality items
Issue: Agent Not Graduating
**Symptoms:**
- Agent stuck at current maturity
- Readiness score not increasing
- No graduation notifications
**Diagnosis:**
- Check graduation criteria:
- Review episode count:
- Check intervention rate:
**Solutions:**
- Run more episodes (automated tasks)
- Reduce interventions (approve proposals)
- Improve constitutional compliance
- Wait for automatic graduation evaluation
Issue: Generated Code Has Security Issues
**Symptoms:**
- Candidate contains hardcoded secrets
- Code accesses unsafe resources
- Missing input validation
**Diagnosis:**
- Review generated code carefully
- Check for security patterns:
- Hardcoded API keys
- SQL injection risks
- Path traversal vulnerabilities
- Missing authentication
**Solutions:**
- **Reject** the candidate immediately
- Report security issue to improve LLM prompts
- Manually create secure version
- Enable security scanning in sandbox
---
FAQ
Q: What's the difference between Memento-Skills and AlphaEvolver?
**A:**
- **Memento-Skills** creates NEW capabilities from failures (feature expansion)
- **AlphaEvolver** improves EXISTING capabilities through mutation (optimization)
**Analogy:**
- Memento = "I don't have a hammer, let me invent one"
- AlphaEvolver = "This hammer is heavy, let me make it lighter"
Q: Do I need to enable both learning loops?
**A:** No, they're independent:
- Enable only Memento-Skills for skill generation
- Enable only AlphaEvolver for optimization
- Enable both for full self-evolution
Q: Can agents edit their own code without limits?
**A:** No, multiple guardrails exist:
- **Maturity gates**: INTERN/SUPERVISED/AUTONOMOUS requirements
- **Workspace settings**: Master toggle + per-capability toggles
- **Sandbox validation**: All code tested before promotion
- **User approval**: Required for promotion (INTERN/SUPERVISED)
- **Daily limits**: Prevent resource exhaustion
Q: What happens if a mutation breaks a skill?
**A:** Lineage tracking enables rollback:
- Original code preserved via
parent_tool_id - Fitness scores detect degradation
- You can reject mutations
- Rollback to previous version
Q: How often should I review skill candidates?
**A:** Depends on workspace activity:
- **High activity** (10+ agents): Daily review recommended
- **Medium activity** (4-10 agents): Weekly review sufficient
- **Low activity** (1-3 agents): Review as needed
**Best practice:** Set aside dedicated time each week to review and approve/reject candidates.
Q: Can I manually create skill candidates?
**A:** Yes, via API:
from core.auto_dev.memento_engine import MementoEngine
engine = MementoEngine(db)
candidate = await engine.generate_skill_candidate(
tenant_id="your-tenant",
agent_id="your-agent",
episode_id="episode-123",
)Q: How do I disable Auto-Dev?
**A:** Two levels:
- **Workspace level**: Disable master toggle
- **Capability level**: Disable specific capabilities
Q: What happens to existing skills/mutations when Auto-Dev is disabled?
**A:** They persist:
- Existing skills remain available
- Mutation history preserved
- Can re-enable anytime
- No data loss
Q: Can I export skills learned via Auto-Dev?
**A:** Yes, skills are standard Python packages:
- Navigate to skill directory
- Export as ZIP
- Share with other workspaces
- Import via skill marketplace
Q: How much does Auto-Dev cost?
**A:** Two cost components:
- **LLM API calls**: For code generation
- Memento: ~500-1000 tokens per candidate
- AlphaEvolver: ~300-800 tokens per mutation
- **Sandbox execution**: Minimal (Docker resources)
**Cost estimation:**
- 5 skill candidates/day: ~$0.05-0.10/day
- 10 mutations/day: ~$0.03-0.08/day
- **Total**: ~$0.08-0.18/day for active workspace
Q: Is Auto-Dev safe for production workloads?
**A:** Yes, with safeguards:
- All code validated in sandbox
- User approval required (INTERN/SUPERVISED)
- Daily limits prevent runaway processes
- Lineage tracking enables rollback
- Tenant isolation enforced
**Recommendation:** Start with non-critical workloads, monitor closely, then expand.
Q: Can I use Auto-Dev with custom LLM providers?
**A:** Yes, Auto-Dev uses LLMService abstraction:
- OpenAI, Anthropic, DeepSeek, Gemini, etc.
- Configure in workspace settings
- Auto-Dev automatically uses configured provider
---
See Also
- AUTO_DEV_API_REFERENCE.md - Complete API documentation
- AUTO_DEV_DEVELOPER_GUIDE.md - Developer guide
- AUTO_DEV_EVENT_PROTOCOL.md - Event protocol
- AUTO_DEV_INTEGRATION_GUIDE.md - Deployment and monitoring
- AUTO_DEV_ARCHITECTURE.md - System architecture