ATOM Documentation

← Back to App

Anomaly Detection System - Implementation Summary

Overview

Implemented a comprehensive anomaly detection system to identify unusual spending patterns using statistical analysis (3-sigma rule). The system helps users detect unexpected cost spikes, billing errors, resource abuse, and integration issues.

Implementation Details

1. Backend Service

**File:** /Users/rushiparikh/projects/atom-saas/backend-saas/core/anomaly_detection_service.py

**Key Features:**

  • **Statistical Analysis:** Uses 3-sigma rule (standard deviation-based) for anomaly detection
  • **Configurable Sensitivity:** Three sensitivity levels (high=2σ, medium=3σ, low=4σ)
  • **Severity Classification:** Categorizes anomalies as high (5x), medium (3x), or low (2x) severity
  • **Daily Aggregation:** Analyzes TokenUsage data aggregated by day
  • **Trend Analysis:** Calculates usage trends and week-over-week changes
  • **Alert Integration:** Sends notifications to tenant admins when anomalies detected

**Core Methods:**

async def detect_anomalies(days=30, sensitivity="medium")
    # Returns anomalies with statistics: avg, std, threshold, spike_ratio

async def get_usage_trends(days=30)
    # Returns daily usage data for visualization

async def check_and_alert_anomalies(days=7, sensitivity="medium")
    # Detects anomalies and sends alerts automatically

**Algorithm:**

  1. Fetch daily usage from TokenUsage table (cost_usd + markup_usd)
  2. Calculate mean and standard deviation
  3. Define threshold = mean + (sensitivity_level × std_dev)
  4. Flag days exceeding threshold as anomalies
  5. Calculate spike_ratio = daily_cost / mean
  6. Classify severity based on spike_ratio

2. Backend API Endpoints

**File:** /Users/rushiparikh/projects/atom-saas/backend-saas/api/routes/billing_routes.py

**Endpoints Added:**

GET `/billing/anomalies`

Get usage anomalies for the last N days.

**Query Parameters:**

  • days: Number of days to analyze (7-90, default 30)
  • sensitivity: Detection sensitivity ("high", "medium", "low")

**Response:**

{
  "anomalies": [
    {
      "date": "2026-04-01",
      "total_cost_usd": 15.50,
      "total_tokens": 125000,
      "llm_calls": 45,
      "expected_range": "3.20 ± 1.50",
      "spike_ratio": 4.84,
      "severity": "high",
      "deviation_from_mean": 12.30,
      "std_deviations": 8.20
    }
  ],
  "avg_daily_usage": 3.20,
  "std_daily_usage": 1.50,
  "threshold": 7.70,
  "total_days_analyzed": 30,
  "days_with_data": 28,
  "sensitivity": "medium",
  "sigma_level": 3.0
}

GET `/billing/anomalies/trends`

Get usage trends for visualization.

**Query Parameters:**

  • days: Number of days to analyze (7-90, default 30)

**Response:**

{
  "daily_usage": [
    {
      "date": "2026-04-01",
      "total_cost_usd": 15.50,
      "total_tokens": 125000,
      "llm_calls": 45
    }
  ],
  "trend_percent": 25.5,
  "days_analyzed": 30
}

3. Frontend API Route

**File:** /Users/rushiparikh/projects/atom-saas/src/app/api/billing/anomalies/route.ts

**Features:**

  • Authenticated endpoint using NextAuth session
  • Parameter validation (days: 7-90, sensitivity: high/medium/low)
  • Forwards requests to backend with tenant context
  • Error handling with appropriate HTTP status codes

4. Frontend Component

**File:** /Users/rushiparikh/projects/atom-saas/src/components/billing/AnomalyDetector.tsx

**Features:**

  • **SWR Integration:** Auto-refreshes every 5 minutes
  • **Loading States:** Shows loader while analyzing
  • **Error Handling:** Graceful error messages
  • **Insufficient Data:** Friendly message when < 7 days of data
  • **No Anomalies:** Green success message when usage is normal
  • **Anomaly List:** Detailed breakdown of each anomaly with:
  • Date and severity badge
  • Expected vs actual spending
  • Spike ratio and standard deviations
  • LLM call count
  • Token usage
  • **Statistics Summary:** Shows avg daily, threshold, days analyzed
  • **Info Box:** Explains anomaly detection methodology

**Props:**

interface AnomalyDetectorProps {
  days?: number          // Default: 30
  sensitivity?: 'high' | 'medium' | 'low'  // Default: 'medium'
}

5. Billing Page Integration

**File:** /Users/rushiparikh/projects/atom-saas/src/app/settings/billing/page.tsx

**Integration:**

  • Added after Alert Configuration section
  • Only shows when budget utilization >= 50%
  • Uses 30-day analysis window with medium sensitivity
  • Styled consistently with other billing sections

**Location:**

{/* Usage Anomaly Detection Section */}
{budget && budget.utilization_percent >= 50 && (
    <div className="rounded-xl border border-white/10 bg-white/5 overflow-hidden">
        <div className="p-6 border-b border-white/5">
            <h2 className="text-lg font-semibold text-white flex items-center gap-2">
                <Activity className="w-5 h-5 text-primary" />
                Usage Anomaly Detection
            </h2>
            <p className="text-sm text-zinc-500 mt-1">Identify unusual spending patterns using statistical analysis</p>
        </div>
        <div className="p-6">
            <AnomalyDetector days={30} sensitivity="medium" />
        </div>
    </div>
)}

Technical Details

Statistical Methodology

**3-Sigma Rule:**

  • Anomaly threshold = μ + (n × σ)
  • Where μ = mean daily usage, σ = standard deviation, n = sensitivity level

**Sensitivity Levels:**

  • **High (2σ):** Detects more anomalies, may include false positives
  • **Medium (3σ):** Standard statistical threshold (default)
  • **Low (4σ):** Fewer false positives, may miss subtle anomalies

**Severity Classification:**

  • **High:** Spike ratio ≥ 5x normal usage
  • **Medium:** Spike ratio ≥ 3x normal usage
  • **Low:** Spike ratio ≥ 2x normal usage

Data Source

Uses TokenUsage table with fields:

  • tenant_id: Tenant filtering
  • created_at: Daily aggregation
  • cost_usd: Base LLM costs
  • markup_usd: Platform markup
  • total_tokens: Token consumption
  • id: Count for LLM calls

**Query:**

daily_breakdown = (
    db.query(
        func.date(TokenUsage.created_at).label("date"),
        func.sum(TokenUsage.cost_usd + TokenUsage.markup_usd).label("total_cost"),
        func.sum(TokenUsage.total_tokens).label("tokens"),
        func.count(TokenUsage.id).label("calls"),
    )
    .filter(
        and_(
            TokenUsage.tenant_id == tenant_id,
            TokenUsage.created_at >= start_date,
            TokenUsage.created_at <= end_date,
        )
    )
    .group_by(func.date(TokenUsage.created_at))
    .order_by(func.date(TokenUsage.created_at))
    .all()
)

Usage Examples

For Users

  1. **Navigate to Settings → Billing**
  2. **Scroll to "Usage Anomaly Detection" section** (shows at 50%+ budget utilization)
  3. **Review anomalies:**
  • Green box: No anomalies detected
  • Red/Amber boxes: Anomalies with severity ratings
  1. **Take action:**
  • Review agent activity on anomalous dates
  • Check for runaway processes
  • Adjust budgets if needed
  • Contact support if anomalies persist

For Developers

**Backend Testing:**

from core.anomaly_detection_service import AnomalyDetectionService
from core.database import SessionLocal

db = SessionLocal()
service = AnomalyDetectionService(tenant_id='tenant-id', db=db)

# Detect anomalies
result = await service.detect_anomalies(days=30, sensitivity='medium')
print(f"Found {len(result['anomalies'])} anomalies")

# Get trends
trends = await service.get_usage_trends(days=30)
print(f"Trend: {trends['trend_percent']}%")

# Check and alert
alert_result = await service.check_and_alert_anomalies(days=7)

**Frontend Usage:**

import { AnomalyDetector } from '@/components/billing/AnomalyDetector'

<AnomalyDetector days={30} sensitivity="medium" />

Future Enhancements

Potential Improvements

  1. **Threshold Customization**
  • Allow users to set custom sigma levels (1.5σ - 5σ)
  • Per-integration thresholds
  • Time-of-day patterns
  1. **Advanced Analytics**
  • Machine learning-based anomaly detection
  • Seasonal pattern recognition
  • Predictive anomaly forecasting
  • Integration-specific baselines
  1. **Alert System**
  • Email notifications for high-severity anomalies
  • Slack/webhook integrations
  • Scheduled anomaly reports
  • Anomaly suppression rules
  1. **Visualization**
  • Interactive trend charts
  • Anomaly overlay on usage graphs
  • Comparison views (month-over-month)
  • Export anomaly reports as CSV/PDF
  1. **Actionable Insights**
  • Root cause analysis (which agent/integration)
  • Auto-suggested budget adjustments
  • Cost optimization recommendations
  • Anomaly pattern recognition

Testing

Manual Testing Checklist

  • [ ] Service initializes without errors
  • [ ] Detects anomalies with sufficient data (7+ days)
  • [ ] Returns "insufficient data" message with < 7 days
  • [ ] Correct sensitivity levels (high/medium/low)
  • [ ] Accurate statistical calculations (mean, std, threshold)
  • [ ] Severity classification works correctly
  • [ ] API endpoint returns proper responses
  • [ ] Frontend component renders correctly
  • [ ] Loading states display properly
  • [ ] Error handling works gracefully
  • [ ] SWR auto-refreshes every 5 minutes
  • [ ] Widget only shows at 50%+ budget utilization

Automated Testing

Create test file: backend-saas/tests/test_anomaly_detection_service.py

import pytest
from datetime import datetime, timedelta
from core.anomaly_detection_service import AnomalyDetectionService
from core.database import SessionLocal
from core.models import TokenUsage

def test_anomaly_detection_with_sufficient_data(db_session):
    """Test anomaly detection with 30 days of data"""
    # Create test data with known anomaly
    # ...
    service = AnomalyDetectionService(tenant_id='test', db=db_session)
    result = await service.detect_anomalies(days=30)
    assert len(result['anomalies']) > 0
    assert result['avg_daily_usage'] > 0

def test_insufficient_data(db_session):
    """Test with less than 7 days of data"""
    service = AnomalyDetectionService(tenant_id='test', db=db_session)
    result = await service.detect_anomalies(days=5)
    assert result['error'] == 'insufficient_data'

Files Modified/Created

Created

  1. /Users/rushiparikh/projects/atom-saas/backend-saas/core/anomaly_detection_service.py - Main service
  2. /Users/rushiparikh/projects/atom-saas/src/app/api/billing/anomalies/route.ts - Frontend API
  3. /Users/rushiparikh/projects/atom-saas/src/components/billing/AnomalyDetector.tsx - React component

Modified

  1. /Users/rushiparikh/projects/atom-saas/backend-saas/api/routes/billing_routes.py - Added endpoints
  2. /Users/rushiparikh/projects/atom-saas/src/app/settings/billing/page.tsx - Added widget

Dependencies

Python

  • No new dependencies required (uses existing SQLAlchemy, datetime, statistics)

TypeScript/React

  • No new dependencies required (uses existing SWR, lucide-react, Radix UI)

Deployment Notes

  1. **Backend:** No migration needed (uses existing TokenUsage table)
  2. **Frontend:** Component will auto-deploy with Next.js
  3. **Environment:** No new environment variables required
  4. **Testing:** Test in staging with real usage data before production

Performance Considerations

  • **Query Optimization:** Daily aggregation reduces query complexity
  • **Caching:** SWR caches responses for 5 minutes
  • **Database Load:** Queries are indexed on tenant_id and created_at
  • **Scalability:** O(n) complexity where n = days analyzed

Security

  • **Tenant Isolation:** All queries filtered by tenant_id
  • **Authentication:** Frontend route requires valid session
  • **Authorization:** Backend validates user context
  • **Rate Limiting:** Consider adding rate limits for API endpoint

Support

For issues or questions:

  1. Check browser console for frontend errors
  2. Check backend logs for service errors
  3. Verify TokenUsage data exists for tenant
  4. Ensure sufficient historical data (7+ days)
  5. Review sensitivity level settings

---

**Implementation Date:** 2026-04-05

**Status:** Complete ✅

**Version:** 1.0.0