Architecture Overview
Complete system architecture for ATOM SaaS - a multi-tenant AI agent platform with cognitive architectures, learning engines, and enterprise-grade governance.
High-Level Architecture
ATOM SaaS follows a layered architecture with clear separation of concerns:
Technology Stack
Frontend (Presentation Layer)
Web Application:
- Framework: Next.js 14 (App Router)
- Language: TypeScript 5.x
- UI Library: React 18
- Styling: Tailwind CSS
- Components: Radix UI primitives
- Editor: Monaco (VS Code editor)
- State: React Context + Server Components
Desktop Application:
- Framework: Tauri 2.0
- Language: Rust (backend), JavaScript (frontend)
- Features: Terminal access, Docker integration, local execution
- Security: Sandboxed execution with permission prompts
Backend (API Layer)
Unified Backend:
- Runtime: Managed Compute Node running dual processes via
supervisord - Frontend Port: 3000 (Next.js)
- Backend Port: 8000 (FastAPI)
- Internal Comm: Next.js proxies
/api/v1requests to local FastAPI instance
Data Layer
Primary Database:
- Database: PostgreSQL 15+
- Extension: pgvector (vector similarity)
- Security: Row-Level Security (RLS) for tenant isolation
- Hosting: Neon PostgreSQL (serverless)
Vector Database:
- Database: LanceDB
- Purpose: Semantic search for World Model
- Storage: Local file system (persistent volumes)
Caching:
- Cache: Redis
- Purpose: Rate limiting, session caching, pub/sub
- Hosting: Upstash Redis
File Storage:
- Storage: AWS S3
- Purpose: User uploads, agent artifacts, canvas exports
- Isolation: Tenant-specific prefixes (
s3://atom-saas/{tenant_id}/)
Infrastructure
Hosting:
- Platform: ATOM Cloud Platform
- Regions: Multiple regions for low latency (Anycast network)
- Features: Auto-scaling, health checks, rolling deployments
CI/CD:
- Pipeline: GitHub Actions
- Testing: 212 E2E tests (100% compliance)
- Deployment: Automated on merge to main
Brain Systems Architecture
The brain systems are the core intelligence layer that enables human-like agent behavior:
Brain System Responsibilities
1. Cognitive Architecture
- Human-like reasoning process
- Attention allocation
- Memory recall coordination
- Language processing
- Problem-solving strategies
2. Learning Engine
- Experience recording (RLHF)
- Pattern recognition
- Adaptation generation
- Behavior modification
- Performance optimization
3. World Model
- Long-term memory storage
- Semantic similarity search
- Experience recall by relevance
- Canvas context tracking
- Feedback-aware retrieval
4. Reasoning Engine
- Proactive intelligence
- Intervention generation
- Opportunity identification
- Automation suggestions
- Trend analysis
5. Cross-System Reasoning
- Multi-agent coordination
- Cross-system data correlation
- Complex problem decomposition
- Knowledge synthesis
6. Alpha Evolver
- Autonomous code mutation
- Sandbox-based variant testing
- Workflow performance optimization
- Self-improving toolsets
7. Agent Governance
- Permission validation
- Maturity Calibration (AI-driven)
- Safety checks
- Audit logging
- Rate limiting
Multi-Tenancy Architecture
Tenant isolation is implemented at multiple layers for enterprise-grade security:
Tenant Isolation Layers
1. Subdomain Routing
- Each tenant gets unique subdomain:
tenant.atomagentos.com - Custom domains supported
- Subdomain mapped to
tenant_idin database
2. Row-Level Security (RLS)
-- RLS Policy Example ALTER TABLE agents ENABLE ROW LEVEL SECURITY; CREATE POLICY tenant_isolation ON agents FOR ALL USING (tenant_id = current_setting('app.current_tenant_id')::UUID);
3. S3 Prefix Isolation
- Each tenant gets dedicated S3 prefix
- Path format:
s3://atom-saas/{tenant_id}/uploads/ - Bucket policies enforce prefix access
4. Redis Namespace
- Keys namespaced:
tenant:{tenant_id}:rate_limit - Pub/sub channels scoped:
tenant:{tenant_id}:events - Session isolation guaranteed
5. Application-Level Filtering
- All queries include
WHERE tenant_id = ? - API responses filter tenant data
- Background jobs scoped to tenant
Agent Execution Flow
Complete request lifecycle from user input to agent response:
Execution Stages
1. Request Validation
- Authenticate user session
- Extract tenant context
- Validate request schema
2. Governance Checks
- Rate limit validation (per-tenant)
- Permission check (agent maturity)
- Safety guardrails
3. Context Resolution
- Load agent configuration
- Resolve task context
- Fetch relevant settings
4. Cognitive Processing
- Recall relevant experiences (World Model)
- Generate reasoning chain
- Determine optimal approach
5. Skill Execution
- Load required skills
- Execute actions
- Handle integration calls
6. Learning & Recording
- Record experience to World Model
- Extract learnings
- Update patterns
7. Response Generation
- Format response
- Include metadata
- Return to user
Data Flow Diagrams
Agent Creation Flow
Graduation Exam Flow
Skill Execution Flow
Security Architecture
Multiple security layers protect tenant data and ensure safe agent behavior:
Security Layers
1. Network Security
- TLS 1.3 for all connections
- DDoS protection (Global edge network)
- IP whitelisting (enterprise)
2. Authentication
- JWT-based sessions
- OAuth 2.0 for integrations
- API key support (BYOK)
3. Tenant Isolation
- Subdomain-based routing
- Row-Level Security (PostgreSQL)
- Storage prefix isolation
- Cache namespace separation
4. Agent Governance
- Maturity-based permissions
- Real-time permission validation
- Constitutional guardrails
- Comprehensive audit logging
5. Abuse Protection
- Per-tenant rate limits
- Resource quotas (storage, API calls)
- Anomaly detection
- Automatic throttling
Scalability Architecture
Horizontal and vertical scaling strategies:
Horizontal Scaling
Auto-Scaling:
- CPU-based scaling triggers
- Memory-based scaling triggers
- Request queue-based scaling
- Regional distribution
Vertical Scaling
Database:
- Connection pooling (PgBouncer)
- Read replicas for analytics
- Partitioned tables (by tenant)
- Index optimization
Cache:
- Redis cluster for high availability
- Tiered caching (L1: memory, L2: Redis)
- Intelligent cache invalidation
Monitoring & Observability
Technology Rationale
Why Next.js?
- React Server Components for performance
- Built-in API routes for backend logic
- Excellent developer experience
- Strong TypeScript support
- SEO optimization
Why FastAPI?
- Native async support
- Automatic OpenAPI documentation
- High performance (comparable to Node.js)
- Strong type validation (Pydantic)
- Easy testing
Why PostgreSQL?
- ACID compliance
- Row-Level Security
- pgvector for vector similarity
- Excellent reliability
- Strong ecosystem
Why Neon?
- Serverless PostgreSQL
- Auto-scaling storage
- Branch-based development
- Built-in connection pooling
- Competitive pricing
Why LanceDB?
- Embedded vector database
- High-performance semantic search
- Python-native
- No separate infrastructure
- Open source
Why Redis?
- In-memory performance
- Rich data structures
- Pub/sub support
- Rate limiting capabilities
- Session management
Why ATOM Managed Infrastructure?
- Simple deployment model
- Built-in load balancing
- Multi-region support
- Integrated security
- Optimized performance
Architecture Patterns Used
1. Layered Architecture
- Clear separation of concerns
- Each layer has specific responsibility
- Easy to test and maintain
2. Event-Driven Architecture
- Agent executions trigger events
- Background jobs process asynchronously
- Real-time updates via pub/sub
3. Multi-Tenancy Patterns
- Subdomain-based routing
- Row-Level Security
- Tenant-scoped caching
- Isolated storage
4. Plugin Architecture
- Skill registry for dynamic loading
- Integration adapters
- Extensible brain systems
5. CQRS (Command Query Responsibility Segregation)
- Separate read and write models
- Optimized for each use case
- Complex queries use read replicas
Performance Considerations
Database Optimization
- Connection pooling (max 20 connections)
- Read replicas for analytics queries
- Indexed foreign keys
- Partitioned tables by tenant
Caching Strategy
- L1 cache: In-memory (frequently accessed)
- L2 cache: Redis (shared across instances)
- Cache TTL: 5-60 minutes depending on data
- Invalidation on updates
API Performance
- Response time target: < 200ms (p95)
- Rate limits: 50/day (free), 5000/day (team)
- Pagination for large result sets
- Compression enabled (gzip)
Background Jobs
- Async task processing
- Job queues (Redis-based)
- Automatic retries with exponential backoff
- Dead letter queue for failed jobs
Next Steps
Explore Specific Systems:
Implementation Guides:
Last Updated: 2025-02-06 Architecture Version: 8.0 (Production Ready)