Architecture Overview

Complete system architecture for ATOM SaaS - a multi-tenant AI agent platform with cognitive architectures, learning engines, and enterprise-grade governance.

High-Level Architecture

ATOM SaaS follows a layered architecture with clear separation of concerns:

Technology Stack

Frontend (Presentation Layer)

Web Application:

Framework: Next.js 14 (App Router)
Language: TypeScript 5.x
UI Library: React 18
Styling: Tailwind CSS
Components: Radix UI primitives
Editor: Monaco (VS Code editor)
State: React Context + Server Components

Desktop Application:

Framework: Tauri 2.0
Language: Rust (backend), JavaScript (frontend)
Features: Terminal access, Docker integration, local execution
Security: Sandboxed execution with permission prompts

Backend (API Layer)

Unified Backend:

Runtime: Managed Compute Node running dual processes via supervisord
Frontend Port: 3000 (Next.js)
Backend Port: 8000 (FastAPI)
Internal Comm: Next.js proxies /api/v1 requests to local FastAPI instance

Data Layer

Primary Database:

Database: PostgreSQL 15+
Extension: pgvector (vector similarity)
Security: Row-Level Security (RLS) for tenant isolation
Hosting: Neon PostgreSQL (serverless)

Vector Database:

Database: LanceDB
Purpose: Semantic search for World Model
Storage: Local file system (persistent volumes)

Caching:

Cache: Redis
Purpose: Rate limiting, session caching, pub/sub
Hosting: Upstash Redis

File Storage:

Storage: AWS S3
Purpose: User uploads, agent artifacts, canvas exports
Isolation: Tenant-specific prefixes (s3://atom-saas/{tenant_id}/)

Infrastructure

Hosting:

Platform: ATOM Cloud Platform
Regions: Multiple regions for low latency (Anycast network)
Features: Auto-scaling, health checks, rolling deployments

CI/CD:

Pipeline: GitHub Actions
Testing: 212 E2E tests (100% compliance)
Deployment: Automated on merge to main

Brain Systems Architecture

The brain systems are the core intelligence layer that enables human-like agent behavior:

Brain System Responsibilities

1. Cognitive Architecture

Human-like reasoning process
Attention allocation
Memory recall coordination
Language processing
Problem-solving strategies

2. Learning Engine

Experience recording (RLHF)
Pattern recognition
Adaptation generation
Behavior modification
Performance optimization

3. World Model

Long-term memory storage
Semantic similarity search
Experience recall by relevance
Canvas context tracking
Feedback-aware retrieval

4. Reasoning Engine

Proactive intelligence
Intervention generation
Opportunity identification
Automation suggestions
Trend analysis

5. Cross-System Reasoning

Multi-agent coordination
Cross-system data correlation
Complex problem decomposition
Knowledge synthesis

6. Alpha Evolver

Autonomous code mutation
Sandbox-based variant testing
Workflow performance optimization
Self-improving toolsets

7. Agent Governance

Permission validation
Maturity Calibration (AI-driven)
Safety checks
Audit logging
Rate limiting

Detailed Brain Systems →

Multi-Tenancy Architecture

Tenant isolation is implemented at multiple layers for enterprise-grade security:

Tenant Isolation Layers

1. Subdomain Routing

Each tenant gets unique subdomain: tenant.atomagentos.com
Custom domains supported
Subdomain mapped to tenant_id in database

2. Row-Level Security (RLS)

-- RLS Policy Example
ALTER TABLE agents ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON agents
  FOR ALL
  USING (tenant_id = current_setting('app.current_tenant_id')::UUID);

3. S3 Prefix Isolation

Each tenant gets dedicated S3 prefix
Path format: s3://atom-saas/{tenant_id}/uploads/
Bucket policies enforce prefix access

4. Redis Namespace

Keys namespaced: tenant:{tenant_id}:rate_limit
Pub/sub channels scoped: tenant:{tenant_id}:events
Session isolation guaranteed

5. Application-Level Filtering

All queries include WHERE tenant_id = ?
API responses filter tenant data
Background jobs scoped to tenant

Detailed Multi-Tenancy →

Agent Execution Flow

Complete request lifecycle from user input to agent response:

Execution Stages

1. Request Validation

Authenticate user session
Extract tenant context
Validate request schema

2. Governance Checks

Rate limit validation (per-tenant)
Permission check (agent maturity)
Safety guardrails

3. Context Resolution

Load agent configuration
Resolve task context
Fetch relevant settings

4. Cognitive Processing

Recall relevant experiences (World Model)
Generate reasoning chain
Determine optimal approach

5. Skill Execution

Load required skills
Execute actions
Handle integration calls

6. Learning & Recording

Record experience to World Model
Extract learnings
Update patterns

7. Response Generation

Format response
Include metadata
Return to user

Data Flow Diagrams

Agent Creation Flow

Graduation Exam Flow

Skill Execution Flow

Security Architecture

Multiple security layers protect tenant data and ensure safe agent behavior:

Security Layers

1. Network Security

TLS 1.3 for all connections
DDoS protection (Global edge network)
IP whitelisting (enterprise)

2. Authentication

JWT-based sessions
OAuth 2.0 for integrations
API key support (BYOK)

3. Tenant Isolation

Subdomain-based routing
Row-Level Security (PostgreSQL)
Storage prefix isolation
Cache namespace separation

4. Agent Governance

Maturity-based permissions
Real-time permission validation
Constitutional guardrails
Comprehensive audit logging

5. Abuse Protection

Per-tenant rate limits
Resource quotas (storage, API calls)
Anomaly detection
Automatic throttling

Scalability Architecture

Horizontal and vertical scaling strategies:

Horizontal Scaling

Auto-Scaling:

CPU-based scaling triggers
Memory-based scaling triggers
Request queue-based scaling
Regional distribution

Vertical Scaling

Database:

Connection pooling (PgBouncer)
Read replicas for analytics
Partitioned tables (by tenant)
Index optimization

Cache:

Redis cluster for high availability
Tiered caching (L1: memory, L2: Redis)
Intelligent cache invalidation

Monitoring & Observability

Detailed Monitoring →

Technology Rationale

Why Next.js?

React Server Components for performance
Built-in API routes for backend logic
Excellent developer experience
Strong TypeScript support
SEO optimization

Why FastAPI?

Native async support
Automatic OpenAPI documentation
High performance (comparable to Node.js)
Strong type validation (Pydantic)
Easy testing

Why PostgreSQL?

ACID compliance
Row-Level Security
pgvector for vector similarity
Excellent reliability
Strong ecosystem

Why Neon?

Serverless PostgreSQL
Auto-scaling storage
Branch-based development
Built-in connection pooling
Competitive pricing

Why LanceDB?

Embedded vector database
High-performance semantic search
Python-native
No separate infrastructure
Open source

Why Redis?

In-memory performance
Rich data structures
Pub/sub support
Rate limiting capabilities
Session management

Why ATOM Managed Infrastructure?

Simple deployment model
Built-in load balancing
Multi-region support
Integrated security
Optimized performance

Architecture Patterns Used

1. Layered Architecture

Clear separation of concerns
Each layer has specific responsibility
Easy to test and maintain

2. Event-Driven Architecture

Agent executions trigger events
Background jobs process asynchronously
Real-time updates via pub/sub

3. Multi-Tenancy Patterns

Subdomain-based routing
Row-Level Security
Tenant-scoped caching
Isolated storage

4. Plugin Architecture

Skill registry for dynamic loading
Integration adapters
Extensible brain systems

5. CQRS (Command Query Responsibility Segregation)

Separate read and write models
Optimized for each use case
Complex queries use read replicas

Performance Considerations

Database Optimization

Connection pooling (max 20 connections)
Read replicas for analytics queries
Indexed foreign keys
Partitioned tables by tenant

Caching Strategy

L1 cache: In-memory (frequently accessed)
L2 cache: Redis (shared across instances)
Cache TTL: 5-60 minutes depending on data
Invalidation on updates

API Performance

Response time target: < 200ms (p95)
Rate limits: 50/day (free), 5000/day (team)
Pagination for large result sets
Compression enabled (gzip)

Background Jobs

Async task processing
Job queues (Redis-based)
Automatic retries with exponential backoff
Dead letter queue for failed jobs

Next Steps

Explore Specific Systems:

Implementation Guides:

Last Updated: 2025-02-06 Architecture Version: 8.0 (Production Ready)