Federation Protocol
Overview
The ATOM Federation Protocol enables cross-instance communication between self-hosted ATOM deployments, allowing agents, skills, and domain templates to be shared across organizational boundaries while maintaining security, authentication, and multi-tenant isolation.
Architecture Diagram
Protocol Specification
Version
Current Version: 1.0.0
Compatibility:
- Instances running v13.0+ can federate with each other
- Backward compatible with v12.x for read operations
- Forward compatible with schema versioning
Base URL
https://atom-saas.fly.dev/api/federation
For self-hosted instances:
https://your-instance.com/api/federation
Authentication
API Key Authentication
All federation requests require a shared secret API key:
Header:
X-Federation-Key: sk-federation-shared-key-...
Key Rotation:
# Set multiple keys (comma-separated) FEDERATION_API_KEY=sk-federation-key-1,sk-federation-key-2,sk-federation-key-3 # First key is primary for validation # Remaining keys support rotation
Security:
- Keys should be exchanged via secure channels (e.g., password managers, secret management)
- Keys should be rotated regularly (recommended: quarterly)
- Never include keys in code repositories
- Use environment variables or secret management systems
Key Validation:
def verify_federation_key(request: Request, x_federation_key: str = Header(...)): """ Verify federation API key from X-Federation-Key header. Supports key rotation - checks against all valid keys from environment. Logs all auth attempts for security auditing. """ client_ip = request.client.host if request.client else "unknown" # Check against all valid keys (supports rotation) if x_federation_key not in _FEDERATION_API_KEYS: logger.warning( "federation_auth_failed", extra={ "action": "auth_failed", "client_ip": client_ip, "timestamp": datetime.now(UTC).isoformat(), "reason": "invalid_api_key" } ) raise HTTPException(status_code=403, detail="Invalid Federation Key") # Log successful auth logger.info( "federation_auth_success", extra={ "action": "auth_success", "client_ip": client_ip, "timestamp": datetime.now(UTC).isoformat() } ) return True
Rate Limiting
Limits:
- 100 requests per minute per IP address
- Sliding window using Redis
- Graceful degradation if Redis unavailable
Headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1708412800
Exceeded Response:
{ "detail": "Rate limit exceeded. Please try again later.", "code": "RATE_LIMIT_EXCEEDED" }
Implementation:
async def rate_limit_federation(request: Request): """Rate limiting for federation endpoints using Redis.""" if not cache_service.enabled: return # Graceful degradation client_ip = request.client.host if request.client else "unknown" rate_limit = 100 window_seconds = 60 current_minute = datetime.now(UTC).strftime("%Y-%m-%d:%H:%M") rate_limit_key = f"federation:rate_limit:{client_ip}:{current_minute}" current_count = cache_service.get(rate_limit_key) or 0 if int(current_count) >= rate_limit: raise HTTPException( status_code=429, detail="Rate limit exceeded", headers={"Retry-After": str(window_seconds)} ) cache_service.set(rate_limit_key, int(current_count) + 1, ttl=window_seconds)
API Endpoints
1. List Federated Agents
Endpoint: GET /api/federation/agents
Description: Retrieve a paginated list of public, approved agent templates available for federation.
Authentication: Required (X-Federation-Key)
Rate Limited: Yes (100 req/min per IP)
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
limit | integer | No | Page size (1-100). Default: 50 |
offset | integer | No | Page offset. Default: 0 |
Example Request:
curl -X GET "https://atom-saas.fly.dev/api/federation/agents?limit=10&offset=0" \ -H "X-Federation-Key: sk-federation-shared-key-..."
Response:
{ "agents": [ { "id": "123e4567-e89b-12d3-a456-426614174000", "name": "Sales Assistant", "description": "Automated sales outreach and lead qualification", "category": "sales", "price": 0.0, "version": "1.2.0", "rating": 4.5 } ], "total": 150, "limit": 10, "offset": 0, "has_more": true }
Error Responses:
401 Unauthorized:
{ "detail": "Missing X-Federation-Key header" }
403 Forbidden:
{ "detail": "Invalid Federation Key" }
429 Too Many Requests:
{ "detail": "Rate limit exceeded. Please try again later." }
2. Get Agent Bundle
Endpoint: GET /api/federation/agents/{template_id}
Description: Download the complete agent template bundle including configuration, capabilities, canvas UI schemas, and anonymized memory for local installation.
Authentication: Required (X-Federation-Key)
Rate Limited: Yes (100 req/min per IP)
Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
template_id | string | Yes | Agent template UUID |
Example Request:
curl -X GET "https://atom-saas.fly.dev/api/federation/agents/123e4567-e89b-12d3-a456-426614174000" \ -H "X-Federation-Key: sk-federation-shared-key-..."
Response:
{ "id": "123e4567-e89b-12d3-a456-426614174000", "name": "Sales Assistant", "description": "Automated sales outreach and lead qualification", "category": "sales", "price": 0.0, "version": "1.2.0", "configuration": { "model": "gpt-4", "temperature": 0.7, "max_tokens": 2000, "system_prompt": "You are a helpful sales assistant...", "capabilities": ["email", "calendar", "crm"] }, "capabilities": [ "skill-uuid-1", "skill-uuid-2", "skill-uuid-3" ], "canvas_ui_schemas": [ { "type": "form", "fields": [ {"name": "prospect_email", "type": "email", "required": true}, {"name": "message", "type": "textarea", "required": true} ] } ], "anonymized_memory_bundle": { "heuristics": [ { "error_type": "email_bounce", "error_code": "BOUNCE", "resolution": "Verify email format and domain validity before sending" }, { "error_type": "calendar_conflict", "error_code": "CONFLICT", "resolution": "Check calendar availability and suggest alternative times" } ] } }
Error Responses:
404 Not Found:
{ "detail": "Agent Template not found" }
429 Too Many Requests:
{ "detail": "Rate limit exceeded. Please try again later." }
Security Model
Threat Model
Protected Assets:
- Agent configurations (may contain business logic)
- Anonymized memory bundles (heuristics, patterns)
- Canvas UI schemas (IP in UX design)
- Capability lists (skill dependencies)
Threats:
- Unauthorized access to private templates
- API key leakage
- Rate limit bypass
- Data injection
- Cross-tenant data leakage
- Replay attacks
Security Measures
1. Authentication & Authorization
Shared Secret API Keys:
- High-entropy random strings (256-bit minimum)
- Environment-based configuration
- No hardcoding in source code
- Regular rotation schedule
Key Exchange Process:
# Generate new key openssl rand -base64 32 # Exchange via secure channel (password manager, secret management) # Add to environment FEDERATION_API_KEY=sk-federation-<generated-key> # Restart services
2. Multi-Tenant Isolation
Query Filtering:
# All queries filter by is_public and is_approved query = db.query(AgentTemplate).filter( and_( AgentTemplate.is_public == True, AgentTemplate.is_approved == True ) )
Tenant Context:
- Published templates retain author
tenant_id - Installed templates belong to installer's
tenant_id - No cross-tenant data access
3. Data Anonymization
PII Redaction:
email_pattern = re.compile(r"[\w\.-]+@[\w\.-]+\.\w+") uuid_pattern = re.compile(r"[0-9a-fA-F]{8}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{4}\b-[0-9a-fA-F]{12}") for res in raw_resolutions: text = res.resolution_attempted or "" text = email_pattern.sub("[REDACTED_EMAIL]", text) text = uuid_pattern.sub("[REDACTED_ID]", text)
Semantic Sanitization:
- Optional LLM-based rewriting
- Remove business-specific context
- Preserve technical patterns
4. Audit Logging
Structured Logging:
# Log all federation requests logger.info( "federation_request", extra={ "action": "list_agents", "client_ip": client_ip, "limit": limit, "offset": offset, "timestamp": datetime.now(UTC).isoformat() } ) # Log successful responses logger.info( "federation_response", extra={ "action": "list_agents_success", "client_ip": client_ip, "agent_count": len(results), "total_available": total, "timestamp": datetime.now(UTC).isoformat() } ) # Log failures logger.warning( "federation_request_failed", extra={ "action": "get_bundle_not_found", "client_ip": client_ip, "template_id": template_id, "timestamp": datetime.now(UTC).isoformat() } )
Log Fields:
- Action type
- Client IP
- Timestamp
- Request parameters
- Response status
- Error details
5. Rate Limiting
Implementation:
- Redis-based sliding window
- Per-IP tracking
- Graceful degradation
- Configurable limits
Bypass Prevention:
- Client IP detection (via
request.client.host) - No IP spoofing (TCP layer)
- No user-agent based bypass
6. Input Validation
Parameter Validation:
# Validate UUID format if not is_valid_uuid(template_id): raise HTTPException(status_code=400, detail="Invalid template ID") # Validate pagination limits if limit < 1 or limit > 100: raise HTTPException(status_code=400, detail="Limit must be between 1 and 100") # Validate offset if offset < 0: raise HTTPException(status_code=400, detail="Offset must be non-negative")
SQL Injection Prevention:
- SQLAlchemy ORM parameterized queries
- No raw SQL with user input
- Type-safe query building
Security Best Practices
For Instance Operators
-
Key Management
- Store keys in environment variables
- Use secret management systems (Vault, AWS Secrets Manager)
- Rotate keys quarterly
- Never commit keys to git
- Use different keys per environment
-
Network Security
- Use HTTPS in production
- Configure firewall rules
- Monitor access logs
- Set up intrusion detection
-
Access Control
- Limit federation to trusted instances
- IP whitelisting (optional)
- VPN requirements (optional)
- Regular audits
-
Monitoring
- Track federation request volume
- Alert on unusual patterns
- Monitor rate limit violations
- Review authentication failures
For Developers
-
API Key Handling
# Good: Environment variable API_KEY = os.environ.get("FEDERATION_API_KEY") # Bad: Hardcoded API_KEY = "sk-federation-secret-key-123" -
Error Handling
# Good: Generic error messages if not valid_key: raise HTTPException(status_code=403, detail="Invalid Federation Key") # Bad: Information leakage if not valid_key: raise HTTPException(status_code=403, detail=f"Key {key} not in {keys}") -
Logging
# Good: Structured logging logger.info("federation_request", extra={"action": "list_agents"}) # Bad: Sensitive data in logs logger.info(f"Request with key: {api_key}")
Error Handling
Error Response Format
{ "detail": "Human-readable error message", "code": "ERROR_CODE", "timestamp": "2024-02-19T00:00:00Z" }
Common Errors
| Code | Status | Description |
|---|---|---|
MISSING_API_KEY | 401 | X-Federation-Key header missing |
INVALID_API_KEY | 403 | Federation key invalid or revoked |
RATE_LIMIT_EXCEEDED | 429 | Request rate limit exceeded |
TEMPLATE_NOT_FOUND | 404 | Agent template not found |
INVALID_PARAMETER | 400 | Request parameter invalid |
SERVER_ERROR | 500 | Internal server error |
Retry Strategy
429 Too Many Requests:
- Exponential backoff: 1s, 2s, 4s, 8s, 16s
- Maximum retries: 5
- Use
Retry-Afterheader if present
5xx Server Errors:
- Exponential backoff with jitter
- Maximum retries: 3
- Alert monitoring system
4xx Client Errors:
- Do not retry
- Fix request parameters
- Verify authentication
Cross-Instance Communication
Discovery Pattern
Installation Flow
Federation Source Tracking
Installation Record:
installation = AgentInstallation( tenant_id=tenant_id, template_id=template.id, instantiated_agent_id=new_agent.id, installed_version=template.version, federation_source="remote-instance.com", # Track source federation_timestamp=datetime.now(UTC) )
Benefits:
- Trace installation origins
- Monitor federation usage
- Identify popular templates
- Detect abuse patterns
Performance Optimization
Caching Strategy
Redis Cache:
# Cache agent lists (5 minutes) cache_key = f"federation:agents:{limit}:{offset}" cached = cache_service.get(cache_key) if not cached: agents = fetch_from_db(limit, offset) cache_service.set(cache_key, agents, ttl=300)
Cache Invalidation:
- On template approval
- On template update
- On template deletion
- Time-based expiry
Database Optimization
Indexes:
-- Federation queries CREATE INDEX idx_agent_templates_federation ON agent_templates(is_public, is_approved, created_at DESC) WHERE is_public = true AND is_approved = true; -- Covering index for list endpoint CREATE INDEX idx_agent_templates_list ON agent_templates(is_public, is_approved, rating, installs) WHERE is_public = true AND is_approved = true;
Query Optimization:
# Efficient pagination query = ( db.query(AgentTemplate.id, AgentTemplate.name, AgentTemplate.description, AgentTemplate.category, AgentTemplate.price, AgentTemplate.version, AgentTemplate.rating) .filter(and_(AgentTemplate.is_public == True, AgentTemplate.is_approved == True)) .order_by(AgentTemplate.created_at.desc()) .limit(limit) .offset(offset) )
Bandwidth Optimization
Response Compression:
- Gzip compression enabled
- Typical compression ratio: 5-10x
- Reduces bandwidth usage significantly
Field Selection:
# List endpoint: Minimal fields agents = query.with_entities(AgentTemplate.id, AgentTemplate.name, ...) # Bundle endpoint: Full fields bundle = query.with_entities(AgentTemplate.*)
Monitoring & Observability
Key Metrics
Request Metrics:
- Request rate (per endpoint)
- Response times (p50, p95, p99)
- Error rate (by status code)
- Rate limit violations
Business Metrics:
- Templates downloaded
- Unique instances federating
- Popular templates
- Federation sources
Security Metrics:
- Authentication failures
- Invalid API keys
- Suspicious IP addresses
- Rate limit violations
Alerting
Critical Alerts:
- Error rate > 5%
- Response time p95 > 1s
- Authentication failures > 10/min
- Rate limit violations > 100/min
Warning Alerts:
- Error rate > 1%
- Response time p95 > 500ms
- Unusual request patterns
- Cache hit rate < 80%
Logging
Log Levels:
- INFO: Successful requests
- WARNING: Authentication failures, rate limits
- ERROR: Server errors, database failures
- CRITICAL: Security incidents
Log Retention:
- 30 days for INFO/WARNING
- 90 days for ERROR/CRITICAL
- Archive for security incidents
Testing
Unit Tests
Authentication:
def test_valid_federation_key(): """Test valid federation key is accepted.""" response = client.get( "/api/federation/agents", headers={"X-Federation-Key": "valid-key"} ) assert response.status_code == 200
Rate Limiting:
def test_rate_limit_exceeded(): """Test rate limit enforcement.""" for _ in range(101): response = client.get( "/api/federation/agents", headers={"X-Federation-Key": "valid-key"} ) assert response.status_code == 429
Integration Tests
Cross-Instance:
def test_federation_workflow(): """Test complete federation workflow.""" # Instance A: Publish agent template_id = publish_agent(instance_a) # Instance B: Discover agent agents = list_agents(instance_b) assert template_id in [a["id"] for a in agents] # Instance B: Download and install bundle = get_bundle(instance_b, template_id) installed = install_bundle(bundle) assert installed["status"] == "installed"
Security Tests
Authentication Bypass:
def test_no_auth_key(): """Test request without API key is rejected.""" response = client.get("/api/federation/agents") assert response.status_code == 401 def test_invalid_auth_key(): """Test invalid API key is rejected.""" response = client.get( "/api/federation/agents", headers={"X-Federation-Key": "invalid-key"} ) assert response.status_code == 403
SQL Injection:
def test_sql_injection_prevention(): """Test SQL injection attempts are blocked.""" response = client.get( "/api/federation/agents", params={"limit": "1; DROP TABLE agents; --"}, headers={"X-Federation-Key": "valid-key"} ) assert response.status_code == 400
Best Practices
For Federation Consumers
-
Respect Rate Limits
- Implement exponential backoff
- Cache responses locally
- Use pagination for large lists
-
Secure API Keys
- Store in environment variables
- Never commit to git
- Rotate regularly
-
Monitor Usage
- Track request volume
- Alert on unusual patterns
- Review error rates
-
Handle Errors Gracefully
- Implement retry logic
- Log errors for debugging
- Provide user feedback
For Federation Providers
-
Validate All Input
- Parameter type checking
- Range validation
- SQL injection prevention
-
Log Everything
- Structured logging
- Include context
- Preserve sensitive data
-
Monitor Performance
- Response times
- Error rates
- Resource usage
-
Plan for Scale
- Horizontal scaling
- Database sharding
- Cache optimization
Troubleshooting
Common Issues
1. 403 Forbidden: Invalid Federation Key
- Cause: Incorrect or missing API key
- Solution: Verify X-Federation-Key header matches environment variable
2. 429 Too Many Requests
- Cause: Rate limit exceeded
- Solution: Implement exponential backoff, check Retry-After header
3. 404 Not Found: Agent Template
- Cause: Template not public, not approved, or doesn't exist
- Solution: Verify template status, check template_id
4. Slow Response Times
- Cause: Database query performance, cache miss
- Solution: Check database indexes, cache hit rates, query optimization
5. High Memory Usage
- Cause: Large bundles, memory leaks
- Solution: Implement pagination, monitor memory, profile code
Debug Mode
Enable Debug Logging:
import logging logging.getLogger("atom").setLevel(logging.DEBUG)
Check Federation Status:
# Test federation endpoint curl -v -X GET "https://atom-saas.fly.dev/api/federation/agents?limit=1" \ -H "X-Federation-Key: your-key" # Check rate limit curl -I -X GET "https://atom-saas.fly.dev/api/federation/agents" \ -H "X-Federation-Key: your-key"
Future Enhancements
Planned Features
-
Webhook Federation
- Push notifications for new templates
- Real-time updates
- Event-driven synchronization
-
Peer Discovery
- Automatic instance discovery
- Peer-to-peer federation
- Mesh networking
-
Reputation Scoring
- Instance reputation scores
- Quality metrics
- Trust networks
-
Version Management
- Template versioning
- Migration guides
- Backward compatibility
-
Advanced Security
- Mutual TLS
- IP whitelisting
- VPN requirements
- Signature verification