Training Maturity Administration Guide
Table of Contents
- System Architecture Overview
- Configuration and Settings
- Monitoring and Analytics
- Troubleshooting Guide
- Performance Tuning
- Security and Compliance
- Backup and Recovery
- Advanced Configuration
---
1. System Architecture Overview
1.1 Component Architecture
┌─────────────────────────────────────────────────────────────┐
│ Frontend (Next.js) │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Proposal │ │ Supervision │ │ Training │ │
│ │ Management │ │ Dashboard │ │ Sessions │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
└─────────┼─────────────────┼─────────────────┼──────────────┘
│ │ │
│ WebSocket │ REST API │ REST API
│ │ │
┌─────────┼─────────────────┼─────────────────┼──────────────┐
│ │ Backend (FastAPI) │ │
│ ┌──────▼───────┐ ┌──────┴───────┐ ┌─────┴────────┐ │
│ │ API │ │ Training │ │ Metrics │ │
│ │ Routes │ │ Services │ │ Collector │ │
│ └──────┬───────┘ └──────┬───────┘ └─────┬────────┘ │
│ │ │ │ │
│ ┌──────▼─────────────────▼─────────────────▼────────┐ │
│ │ Database (PostgreSQL) │ │
│ │ - AgentProposal │ │
│ │ - SupervisionSession │ │
│ │ - TrainingSession │ │
│ │ - TrainingMetrics │ │
│ └──────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Cache (Redis) │ │
│ │ - Dashboard data (5-min TTL) │ │
│ │ - Metrics cache (1-hour TTL) │ │
│ │ - Alert deduplication (10-min TTL) │ │
│ └──────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────┘1.2 Data Flow
**Training Proposal Flow:**
- Trigger blocked →
TriggerInterceptor - Proposal created →
StudentTrainingService - Metric recorded →
MetricsCollector - Proposal stored → PostgreSQL
- WebSocket notification → Frontend
**Supervision Flow:**
- SUPERVISED agent triggered →
SupervisionService - Session created → PostgreSQL
- WebSocket connection established
- Agent actions streamed → Frontend
- Interventions sent →
SupervisionService - Metrics recorded →
MetricsCollector
1.3 Database Schema
**Key Tables:**
agent_proposals- Training and action proposalssupervision_sessions- Real-time supervision sessionstraining_sessions- Training session recordstraining_metrics- Time-series metrics (high volume)training_metrics_aggregated- Materialized view (5-min refresh)
**Indexes:**
idx_agent_proposals_tenant_status- (tenant_id, status)idx_supervision_sessions_agent_status- (agent_id, status)idx_training_metrics_tenant_metric_time- (tenant_id, metric_name, recorded_at DESC)
---
2. Configuration and Settings
2.1 Environment Variables
# Training System Features
TRAINING_SYSTEM_ENABLED=true
TRAINING_ANALYTICS_ENABLED=true
TRAINING_ALERTS_ENABLED=true
TRAINING_DASHBOARD_ENABLED=true
# WebSocket Configuration
SUPERVISION_WEBSOCKET_TIMEOUT=3600000 # 1 hour
WEBSOCKET_HEARTBEAT_INTERVAL=30000 # 30 seconds
# Metrics Collection
METRICS_BATCH_SIZE=100
METRICS_FLUSH_INTERVAL=5000 # 5 seconds
# Alert Configuration
ALERT_STUCK_IN_TRAINING_DAYS=14
ALERT_DECLINING_CONFIDENCE_RATE=0.05
ALERT_LOW_COMPLETION_RATE=0.80
ALERT_HIGH_INTERVENTION_RATE=0.50
# Cache Configuration
DASHBOARD_CACHE_TTL=300 # 5 minutes
METRICS_CACHE_TTL=3600 # 1 hour
ALERT_DEDUP_TTL=600 # 10 minutes
# Background Jobs
ALERT_CHECK_INTERVAL_HOURLY=0 * * * * # Every hour
ALERT_CHECK_INTERVAL_DAILY=0 9 * * * # Daily at 9 AM
METRICS_AGGREGATION_SCHEDULE=*/5 * * * * # Every 5 minutes2.2 Feature Flags
**Enable/Disable Features:**
# backend-saas/core/config.py
TRAINING_SYSTEM_ENABLED = os.getenv("TRAINING_SYSTEM_ENABLED", "true").lower() == "true"
TRAINING_ANALYTICS_ENABLED = os.getenv("TRAINING_ANALYTICS_ENABLED", "true").lower() == "true"
TRAINING_ALERTS_ENABLED = os.getenv("TRAINING_ALERTS_ENABLED", "true").lower() == "true"**Frontend Feature Flags:**
// src/lib/features.ts
export const TRAINING_UI_ENABLED = process.env.NEXT_PUBLIC_TRAINING_UI_ENABLED === 'true';
export const TRAINING_ANALYTICS_ENABLED = process.env.NEXT_PUBLIC_TRAINING_ANALYTICS_ENABLED === 'true';2.3 Alert Configuration
**Per-Tenant Custom Thresholds:**
# Example: Configure custom alert threshold for tenant
await alerting_service.configure_alert_thresholds(
tenant_id="tenant-123",
alert_type="stuck_in_training",
thresholds={"max_days_in_training": 21} # Override default 14 days
)**Notification Channels:**
# Configure notification channels per alert type
ALERT_NOTIFICATION_CHANNELS = {
"stuck_in_training": ["email", "in_app"],
"declining_confidence": ["email", "in_app"],
"low_completion_rate": ["email", "in_app", "slack"],
"high_intervention_rate": ["email", "in_app"],
"blocked_trigger_spike": ["in_app"],
}---
3. Monitoring and Analytics
3.1 Key Metrics to Monitor
**System Health Metrics:**
- Dashboard load time (target: <2s)
- API response time (target: <500ms p95)
- WebSocket connection success rate (target: ≥99%)
- Database query performance (target: <100ms for dashboard queries)
**Training Effectiveness Metrics:**
- Training completion rate (target: ≥90%)
- Time-to-promotion by maturity level
- Intervention rate trends (should decrease over time)
- Confidence score progression (should increase over time)
**Alert Metrics:**
- Alert frequency by type
- Alert false positive rate
- Alert resolution time
- Alert escalation rate
3.2 Dashboard Queries
**Slow Query Identification:**
-- Find slow queries (>500ms)
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
WHERE query LIKE '%training_metrics%'
AND mean_exec_time > 500
ORDER BY mean_exec_time DESC;
-- Check materialized view refresh performance
SELECT schemaname, matviewname, last_refresh
FROM pg_matviews
WHERE matviewname LIKE '%training%';**Cache Hit Rate Monitoring:**
# Monitor Redis cache performance
import redis
redis_client = redis.Redis.from_url(settings.REDIS_URL)
info = redis_client.info('stats')
cache_hits = info['keyspace_hits']
cache_misses = info['keyspace_misses']
hit_rate = cache_hits / (cache_hits + cache_misses) if (cache_hits + cache_misses) > 0 else 0
print(f"Cache hit rate: {hit_rate:.2%}") # Target: ≥80%---
4. Troubleshooting Guide
4.1 Common Issues and Solutions
**Issue: Dashboard Not Loading**
**Symptoms:**
- Dashboard shows loading spinner indefinitely
- Console errors: "Failed to fetch dashboard data"
- API returns 500 error
**Diagnosis:**
- Check API logs for errors
- Verify database connectivity
- Check Redis cache status
- Review query performance
**Solutions:**
# Check API logs
fly logs -a atom-saas --tail 100 | grep training
# Verify database connectivity
psql $DATABASE_URL -c "SELECT 1"
# Check Redis status
redis-cli ping
# Review slow queries
psql $DATABASE_URL -c "SELECT * FROM pg_stat_statements WHERE query LIKE '%training%' ORDER BY mean_exec_time DESC LIMIT 10"**Issue: WebSocket Connection Fails**
**Symptoms:**
- Supervision dashboard shows "Disconnected" status
- Console errors: "WebSocket connection failed"
- Real-time events not appearing
**Diagnosis:**
- Check WebSocket endpoint accessibility
- Verify authentication token
- Check server logs for WebSocket errors
- Test WebSocket connection manually
**Solutions:**
# Test WebSocket connection
wscat -c "wss://atom-saas.fly.dev/api/maturity/supervision/session-id/ws" \
-H "Authorization: Bearer $TOKEN"
# Check server logs
fly logs -a atom-saas --tail 100 | grep websocket
# Verify WebSocket configuration
grep SUPERVISION_WEBSOCKET_TIMEOUT .env**Issue: Alerts Not Triggering**
**Symptoms:**
- Stuck agents not detected
- No alert notifications sent
- Alert history empty
**Diagnosis:**
- Check background job status
- Verify alert configuration
- Check notification service status
- Review alert evaluation logs
**Solutions:**
# Manually trigger alert check
await alerting_service.check_alerts(tenant_id="tenant-123")
# Check background job logs
grep "alert_check" /var/log/supervisor/background-worker.log
# Verify alert thresholds
SELECT * FROM alert_configurations WHERE tenant_id = 'tenant-123';**Issue: High Memory Usage**
**Symptoms:**
- Dashboard API using excessive memory
- OOM kills on worker processes
- Slow page loads
**Diagnosis:**
- Check memory usage per endpoint
- Identify memory leaks
- Review query result sizes
- Check caching strategy
**Solutions:**
# Enable pagination for large datasets
@router.get("/analytics/training/dashboard")
async def get_dashboard(
limit: int = Query(100, le=1000), # Max 1000 records
offset: int = Query(0)
):
# ... pagination logic
# Use materialized views instead of raw queries
# Implement cursor-based pagination for large datasets
# Monitor memory usage
import psutil
process = psutil.Process()
print(f"Memory usage: {process.memory_info().rss / 1024 / 1024:.2f} MB")4.2 Debug Mode
**Enable Debug Logging:**
# backend-saas/core/logging_config.py
LOG_LEVEL = os.getenv("LOG_LEVEL", "DEBUG")
# Enable query logging
import logging
logging.getLogger('sqlalchemy.engine').setLevel(logging.DEBUG)**Enable Query Timing:**
# Track query execution time
import time
start_time = time.time()
result = await db.execute(query)
execution_time = time.time() - start_time
if execution_time > 0.5: # Log slow queries
logger.warning(f"Slow query ({execution_time:.2f}s): {query}")---
5. Performance Tuning
5.1 Database Index Optimization
-- Add composite index for dashboard queries
CREATE INDEX CONCURRENTLY idx_training_metrics_dashboard
ON training_metrics(tenant_id, metric_name, recorded_at DESC)
WHERE recorded_at > NOW() - INTERVAL '90 days';
-- Analyze query performance
EXPLAIN ANALYZE
SELECT * FROM get_training_dashboard_data('tenant-123', '30 days');5.2 Materialized View Tuning
-- Increase refresh frequency for real-time data
CREATE MATERIALIZED VIEW training_metrics_aggregated_5min AS
SELECT
tenant_id,
metric_name,
date_trunc('5 minutes', recorded_at) AS time_bucket,
AVG(metric_value) AS avg_value,
MIN(metric_value) AS min_value,
MAX(metric_value) AS max_value,
COUNT(*) AS count
FROM training_metrics
GROUP BY tenant_id, metric_name, date_trunc('5 minutes', recorded_at);
-- Refresh every 5 minutes via cron or pg_cron
REFRESH MATERIALIZED VIEW CONCURRENTLY training_metrics_aggregated_5min;5.3 Caching Strategy
**Redis Caching Configuration:**
# Dashboard data caching
async def get_dashboard_data_cached(tenant_id: str, time_range: str):
cache_key = f"dashboard:{tenant_id}:{time_range}"
cached_data = await redis_client.get(cache_key)
if cached_data:
return json.loads(cached_data)
# Cache miss - fetch from database
data = await fetch_dashboard_data(tenant_id, time_range)
await redis_client.setex(
cache_key,
300, # 5 minutes TTL
json.dumps(data)
)
return data---
6. Security and Compliance
6.1 Tenant Isolation
**Database-Level Isolation:**
-- Row-Level Security (RLS) policies
ALTER TABLE training_metrics ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation_policy ON training_metrics
FOR ALL
USING (tenant_id = current_tenant_id());**Application-Level Checks:**
# Verify tenant_id on all queries
async def get_dashboard_data(tenant_id: str, db: Session):
# Always filter by tenant_id
query = db.query(TrainingMetric).filter(
TrainingMetric.tenant_id == tenant_id
)
return query.all()6.2 Audit Logging
**Log All Training Actions:**
# backend-saas/core/audit_service.py
async def log_training_action(
tenant_id: str,
action: str,
user_id: str,
details: dict
):
await audit_log.create(
tenant_id=tenant_id,
action=f"training_{action}",
actor_id=user_id,
details=details,
timestamp=datetime.utcnow()
)**Audit Report Query:**
-- Generate audit report for tenant
SELECT
action,
actor_id,
details,
timestamp
FROM audit_logs
WHERE tenant_id = 'tenant-123'
AND action LIKE 'training_%'
AND timestamp >= NOW() - INTERVAL '30 days'
ORDER BY timestamp DESC;6.3 Data Retention
**Metrics Retention Policy:**
# Raw metrics: 90 days
# Aggregated metrics: 1 year
# Alert history: 90 days
# Background job to purge old data
@background_job(schedule="0 2 * * *") # Daily at 2 AM
async def purge_old_metrics():
cutoff_date = datetime.utcnow() - timedelta(days=90)
await db.query(TrainingMetric).filter(
TrainingMetric.recorded_at < cutoff_date
).delete()
await db.commit()---
7. Backup and Recovery
7.1 Database Backup Strategy
**Daily Backup:**
# Automated backup via Fly.io
fly postgres create --name atom-saas-backup
fly backups create --app atom-saas-db
# Manual backup
pg_dump $DATABASE_URL > training-backup-$(date +%Y%m%d).sql**Restore Procedure:**
# List available backups
fly backups list --app atom-saas-db
# Restore from backup
fly backups restore --app atom-saas-db <backup-id>7.2 Disaster Recovery
**Recovery Steps:**
- Verify database backup integrity
- Restore database from backup
- Run data validation queries
- Test training system endpoints
- Verify analytics dashboard data
**Data Validation:**
-- Verify agent proposals count
SELECT tenant_id, COUNT(*) as proposal_count
FROM agent_proposals
GROUP BY tenant_id;
-- Verify metrics continuity
SELECT
DATE(recorded_at) as date,
COUNT(*) as metric_count
FROM training_metrics
WHERE recorded_at >= NOW() - INTERVAL '7 days'
GROUP BY DATE(recorded_at)
ORDER BY date DESC;---
8. Advanced Configuration
8.1 Custom Alert Rules
**Define Custom Alert:**
@dataclass
class CustomAlertType:
name: str = "custom_declining_performance"
description: str = "Agent performance declining over 30 days"
severity: str = "warning"
check_frequency: str = "1week"
threshold_config: dict[str, Any] = field(default_factory=lambda: {
"min_decline_rate": 0.10,
"min_performance_score": 0.70
})
# Register custom alert
ALERT_TYPES.append(CustomAlertType())**Implement Custom Check:**
async def check_custom_alert(
tenant_id: str,
agent_id: str
) -> Alert | None:
# Calculate performance decline over 30 days
current_score = await get_current_performance_score(agent_id)
historical_score = await get_historical_performance_score(
agent_id,
days_ago=30
)
decline_rate = (historical_score - current_score) / historical_score
if decline_rate >= 0.10 and current_score < 0.70:
return Alert(
tenant_id=tenant_id,
alert_type="custom_declining_performance",
severity="warning",
agent_id=agent_id,
details={
"current_score": current_score,
"historical_score": historical_score,
"decline_rate": decline_rate
}
)
return None8.2 Multi-Region Deployment
**Configure Region-Specific Databases:**
# Primary region (FRA)
DATABASE_URL_PRIMARY="postgresql://...@atom-saas-db.fra.aws.neon.tech"
# Replica region (EWR)
DATABASE_URL_REPLICA="postgresql://...@atom-saas-db.ewr.aws.neon.tech"
# Route reads to replica, writes to primary
if request.method == "GET":
db = get_db_session(DATABASE_URL_REPLICA)
else:
db = get_db_session(DATABASE_URL_PRIMARY)**WebSocket Regional Endpoints:**
// Connect to nearest region
const region = detectNearestRegion();
const wsUrl = `wss://${region}.atom-saas.fly.dev/api/maturity/supervision/${sessionId}/ws`;---
9. Monitoring and Alerting Setup
9.1 Prometheus Metrics
**Key Metrics to Track:**
# backend-saas/core/monitoring.py
from prometheus_client import Counter, Histogram, Gauge
# Training metrics
training_proposals_total = Counter(
'training_proposals_total',
'Total training proposals',
['tenant_id', 'status']
)
training_sessions_duration = Histogram(
'training_session_duration_seconds',
'Training session duration',
['tenant_id']
)
supervision_sessions_active = Gauge(
'supervision_sessions_active',
'Active supervision sessions',
['tenant_id']
)
alert_triggered_total = Counter(
'alert_triggered_total',
'Total alerts triggered',
['tenant_id', 'alert_type', 'severity']
)9.2 Grafana Dashboards
**Dashboard JSON:**
{
"dashboard": {
"title": "Training System Overview",
"panels": [
{
"title": "Training Proposals by Status",
"targets": [
{
"expr": "sum by (status) (training_proposals_total)"
}
]
},
{
"title": "Active Supervision Sessions",
"targets": [
{
"expr": "supervision_sessions_active"
}
]
},
{
"title": "Alerts Triggered (Last 24h)",
"targets": [
{
"expr": "sum by (alert_type) (increase(alert_triggered_total[24h]))"
}
]
}
]
}
}9.3 PagerDuty Integration
**Configure Critical Alerts:**
# Send critical alerts to PagerDuty
from pdpyras import EventsAPISession
pagerduty = EventsAPISession(settings.PAGERDUTY_ROUTING_KEY)
async def send_critical_alert(alert: Alert):
if alert.severity == "critical":
pagerduty.post(
dedup_key=f"{alert.tenant_id}-{alert.alert_type}",
event_type="trigger",
severity="critical",
summary=alert.title,
custom_details=alert.details
)---
*For deployment checklist, see separate file: backend-saas/deployments/training-deployment-checklist.md*