Anomaly Detection System - Implementation Summary
Overview
Implemented a comprehensive anomaly detection system to identify unusual spending patterns using statistical analysis (3-sigma rule). The system helps users detect unexpected cost spikes, billing errors, resource abuse, and integration issues.
Implementation Details
1. Backend Service
**File:** /Users/rushiparikh/projects/atom-saas/backend-saas/core/anomaly_detection_service.py
**Key Features:**
- **Statistical Analysis:** Uses 3-sigma rule (standard deviation-based) for anomaly detection
- **Configurable Sensitivity:** Three sensitivity levels (high=2σ, medium=3σ, low=4σ)
- **Severity Classification:** Categorizes anomalies as high (5x), medium (3x), or low (2x) severity
- **Daily Aggregation:** Analyzes TokenUsage data aggregated by day
- **Trend Analysis:** Calculates usage trends and week-over-week changes
- **Alert Integration:** Sends notifications to tenant admins when anomalies detected
**Core Methods:**
async def detect_anomalies(days=30, sensitivity="medium")
# Returns anomalies with statistics: avg, std, threshold, spike_ratio
async def get_usage_trends(days=30)
# Returns daily usage data for visualization
async def check_and_alert_anomalies(days=7, sensitivity="medium")
# Detects anomalies and sends alerts automatically**Algorithm:**
- Fetch daily usage from TokenUsage table (cost_usd + markup_usd)
- Calculate mean and standard deviation
- Define threshold = mean + (sensitivity_level × std_dev)
- Flag days exceeding threshold as anomalies
- Calculate spike_ratio = daily_cost / mean
- Classify severity based on spike_ratio
2. Backend API Endpoints
**File:** /Users/rushiparikh/projects/atom-saas/backend-saas/api/routes/billing_routes.py
**Endpoints Added:**
GET `/billing/anomalies`
Get usage anomalies for the last N days.
**Query Parameters:**
days: Number of days to analyze (7-90, default 30)sensitivity: Detection sensitivity ("high", "medium", "low")
**Response:**
{
"anomalies": [
{
"date": "2026-04-01",
"total_cost_usd": 15.50,
"total_tokens": 125000,
"llm_calls": 45,
"expected_range": "3.20 ± 1.50",
"spike_ratio": 4.84,
"severity": "high",
"deviation_from_mean": 12.30,
"std_deviations": 8.20
}
],
"avg_daily_usage": 3.20,
"std_daily_usage": 1.50,
"threshold": 7.70,
"total_days_analyzed": 30,
"days_with_data": 28,
"sensitivity": "medium",
"sigma_level": 3.0
}GET `/billing/anomalies/trends`
Get usage trends for visualization.
**Query Parameters:**
days: Number of days to analyze (7-90, default 30)
**Response:**
{
"daily_usage": [
{
"date": "2026-04-01",
"total_cost_usd": 15.50,
"total_tokens": 125000,
"llm_calls": 45
}
],
"trend_percent": 25.5,
"days_analyzed": 30
}3. Frontend API Route
**File:** /Users/rushiparikh/projects/atom-saas/src/app/api/billing/anomalies/route.ts
**Features:**
- Authenticated endpoint using NextAuth session
- Parameter validation (days: 7-90, sensitivity: high/medium/low)
- Forwards requests to backend with tenant context
- Error handling with appropriate HTTP status codes
4. Frontend Component
**File:** /Users/rushiparikh/projects/atom-saas/src/components/billing/AnomalyDetector.tsx
**Features:**
- **SWR Integration:** Auto-refreshes every 5 minutes
- **Loading States:** Shows loader while analyzing
- **Error Handling:** Graceful error messages
- **Insufficient Data:** Friendly message when < 7 days of data
- **No Anomalies:** Green success message when usage is normal
- **Anomaly List:** Detailed breakdown of each anomaly with:
- Date and severity badge
- Expected vs actual spending
- Spike ratio and standard deviations
- LLM call count
- Token usage
- **Statistics Summary:** Shows avg daily, threshold, days analyzed
- **Info Box:** Explains anomaly detection methodology
**Props:**
interface AnomalyDetectorProps {
days?: number // Default: 30
sensitivity?: 'high' | 'medium' | 'low' // Default: 'medium'
}5. Billing Page Integration
**File:** /Users/rushiparikh/projects/atom-saas/src/app/settings/billing/page.tsx
**Integration:**
- Added after Alert Configuration section
- Only shows when budget utilization >= 50%
- Uses 30-day analysis window with medium sensitivity
- Styled consistently with other billing sections
**Location:**
{/* Usage Anomaly Detection Section */}
{budget && budget.utilization_percent >= 50 && (
<div className="rounded-xl border border-white/10 bg-white/5 overflow-hidden">
<div className="p-6 border-b border-white/5">
<h2 className="text-lg font-semibold text-white flex items-center gap-2">
<Activity className="w-5 h-5 text-primary" />
Usage Anomaly Detection
</h2>
<p className="text-sm text-zinc-500 mt-1">Identify unusual spending patterns using statistical analysis</p>
</div>
<div className="p-6">
<AnomalyDetector days={30} sensitivity="medium" />
</div>
</div>
)}Technical Details
Statistical Methodology
**3-Sigma Rule:**
- Anomaly threshold = μ + (n × σ)
- Where μ = mean daily usage, σ = standard deviation, n = sensitivity level
**Sensitivity Levels:**
- **High (2σ):** Detects more anomalies, may include false positives
- **Medium (3σ):** Standard statistical threshold (default)
- **Low (4σ):** Fewer false positives, may miss subtle anomalies
**Severity Classification:**
- **High:** Spike ratio ≥ 5x normal usage
- **Medium:** Spike ratio ≥ 3x normal usage
- **Low:** Spike ratio ≥ 2x normal usage
Data Source
Uses TokenUsage table with fields:
tenant_id: Tenant filteringcreated_at: Daily aggregationcost_usd: Base LLM costsmarkup_usd: Platform markuptotal_tokens: Token consumptionid: Count for LLM calls
**Query:**
daily_breakdown = (
db.query(
func.date(TokenUsage.created_at).label("date"),
func.sum(TokenUsage.cost_usd + TokenUsage.markup_usd).label("total_cost"),
func.sum(TokenUsage.total_tokens).label("tokens"),
func.count(TokenUsage.id).label("calls"),
)
.filter(
and_(
TokenUsage.tenant_id == tenant_id,
TokenUsage.created_at >= start_date,
TokenUsage.created_at <= end_date,
)
)
.group_by(func.date(TokenUsage.created_at))
.order_by(func.date(TokenUsage.created_at))
.all()
)Usage Examples
For Users
- **Navigate to Settings → Billing**
- **Scroll to "Usage Anomaly Detection" section** (shows at 50%+ budget utilization)
- **Review anomalies:**
- Green box: No anomalies detected
- Red/Amber boxes: Anomalies with severity ratings
- **Take action:**
- Review agent activity on anomalous dates
- Check for runaway processes
- Adjust budgets if needed
- Contact support if anomalies persist
For Developers
**Backend Testing:**
from core.anomaly_detection_service import AnomalyDetectionService
from core.database import SessionLocal
db = SessionLocal()
service = AnomalyDetectionService(tenant_id='tenant-id', db=db)
# Detect anomalies
result = await service.detect_anomalies(days=30, sensitivity='medium')
print(f"Found {len(result['anomalies'])} anomalies")
# Get trends
trends = await service.get_usage_trends(days=30)
print(f"Trend: {trends['trend_percent']}%")
# Check and alert
alert_result = await service.check_and_alert_anomalies(days=7)**Frontend Usage:**
import { AnomalyDetector } from '@/components/billing/AnomalyDetector'
<AnomalyDetector days={30} sensitivity="medium" />Future Enhancements
Potential Improvements
- **Threshold Customization**
- Allow users to set custom sigma levels (1.5σ - 5σ)
- Per-integration thresholds
- Time-of-day patterns
- **Advanced Analytics**
- Machine learning-based anomaly detection
- Seasonal pattern recognition
- Predictive anomaly forecasting
- Integration-specific baselines
- **Alert System**
- Email notifications for high-severity anomalies
- Slack/webhook integrations
- Scheduled anomaly reports
- Anomaly suppression rules
- **Visualization**
- Interactive trend charts
- Anomaly overlay on usage graphs
- Comparison views (month-over-month)
- Export anomaly reports as CSV/PDF
- **Actionable Insights**
- Root cause analysis (which agent/integration)
- Auto-suggested budget adjustments
- Cost optimization recommendations
- Anomaly pattern recognition
Testing
Manual Testing Checklist
- [ ] Service initializes without errors
- [ ] Detects anomalies with sufficient data (7+ days)
- [ ] Returns "insufficient data" message with < 7 days
- [ ] Correct sensitivity levels (high/medium/low)
- [ ] Accurate statistical calculations (mean, std, threshold)
- [ ] Severity classification works correctly
- [ ] API endpoint returns proper responses
- [ ] Frontend component renders correctly
- [ ] Loading states display properly
- [ ] Error handling works gracefully
- [ ] SWR auto-refreshes every 5 minutes
- [ ] Widget only shows at 50%+ budget utilization
Automated Testing
Create test file: backend-saas/tests/test_anomaly_detection_service.py
import pytest
from datetime import datetime, timedelta
from core.anomaly_detection_service import AnomalyDetectionService
from core.database import SessionLocal
from core.models import TokenUsage
def test_anomaly_detection_with_sufficient_data(db_session):
"""Test anomaly detection with 30 days of data"""
# Create test data with known anomaly
# ...
service = AnomalyDetectionService(tenant_id='test', db=db_session)
result = await service.detect_anomalies(days=30)
assert len(result['anomalies']) > 0
assert result['avg_daily_usage'] > 0
def test_insufficient_data(db_session):
"""Test with less than 7 days of data"""
service = AnomalyDetectionService(tenant_id='test', db=db_session)
result = await service.detect_anomalies(days=5)
assert result['error'] == 'insufficient_data'Files Modified/Created
Created
/Users/rushiparikh/projects/atom-saas/backend-saas/core/anomaly_detection_service.py- Main service/Users/rushiparikh/projects/atom-saas/src/app/api/billing/anomalies/route.ts- Frontend API/Users/rushiparikh/projects/atom-saas/src/components/billing/AnomalyDetector.tsx- React component
Modified
/Users/rushiparikh/projects/atom-saas/backend-saas/api/routes/billing_routes.py- Added endpoints/Users/rushiparikh/projects/atom-saas/src/app/settings/billing/page.tsx- Added widget
Dependencies
Python
- No new dependencies required (uses existing SQLAlchemy, datetime, statistics)
TypeScript/React
- No new dependencies required (uses existing SWR, lucide-react, Radix UI)
Deployment Notes
- **Backend:** No migration needed (uses existing TokenUsage table)
- **Frontend:** Component will auto-deploy with Next.js
- **Environment:** No new environment variables required
- **Testing:** Test in staging with real usage data before production
Performance Considerations
- **Query Optimization:** Daily aggregation reduces query complexity
- **Caching:** SWR caches responses for 5 minutes
- **Database Load:** Queries are indexed on tenant_id and created_at
- **Scalability:** O(n) complexity where n = days analyzed
Security
- **Tenant Isolation:** All queries filtered by tenant_id
- **Authentication:** Frontend route requires valid session
- **Authorization:** Backend validates user context
- **Rate Limiting:** Consider adding rate limits for API endpoint
Support
For issues or questions:
- Check browser console for frontend errors
- Check backend logs for service errors
- Verify TokenUsage data exists for tenant
- Ensure sufficient historical data (7+ days)
- Review sensitivity level settings
---
**Implementation Date:** 2026-04-05
**Status:** Complete ✅
**Version:** 1.0.0