ATOM Documentation

← Back to App

Plan: Automated Entity Linking Agent + Attachment Ingestion

**Created:** May 7, 2026

**Context:** Raw Discoveries tab shows extracted entities but linking is manual (prompt()).

No attachment ingestion from Outlook emails (only body text is extracted).

---

Plan 1: LLM Judge Agent for Automated Entity Linking

Overview

Create a tenant-scoped agent (entity_linking_agent) that evaluates DiscoveredEntity

rows and decides: link, reject, or flag for human review. Uses the existing agent

framework (AgentRegistry, governance, graduation, feedback loops).

Architecture

Trigger (after sync chunk / scheduled)
  ↓
EntityLinkingAgent.execute()
  ↓
For each pending DiscoveredEntity:
  1. Fetch entity + source record + existing graph context
  2. LLM judge evaluates: is this a real entity? does it match an existing node?
  3. Decision: link / reject / flag
  4. Record decision with confidence + reasoning
  ↓
AgentLearning records feedback for continuous improvement

Files to Create/Modify

New Files

**backend-saas/core/agents/entity_linking_agent.py**

  • EntityLinkingAgent class extending agent framework
  • evaluate_entity(entity){ decision, confidence, reasoning, target_node_id }
  • execute_batch(tenant_id, workspace_id) → processes all pending entities
  • Uses LLM via LLMService with task_type="linking"
  • Records decisions in DiscoveredEntity.extraction_metadata.linking_judgment

**backend-saas/core/agents/entity_linking_judge.py**

  • EntityLinkingJudge — the LLM evaluation logic
  • Prompt template with:
  • Entity name, type, properties
  • Source record preview (email body, etc.)
  • Existing graph nodes of similar name/type
  • Instructions: determine if entity is real, if it matches existing node, confidence
  • Returns structured JSON: { decision, target_node_id, confidence, reasoning }

**backend-saas/core/agents/__init__.py**

  • Package init for agents directory

**backend-saas/tests/test_entity_linking_agent_tdd.py**

  • TDD tests covering:
  • Agent correctly identifies duplicate entities
  • Agent links to existing graph node when name matches
  • Agent creates new graph node for novel entities
  • Agent rejects low-confidence / spam entities
  • Agent respects tenant isolation
  • Undo functionality: agent can revert a bad link

Modified Files

**backend-saas/core/models.py**

  • Add entity_linking_agent as default agent in AgentRegistry (like other system agents)
  • Add linking_judgment JSON field to DiscoveredEntity (or use extraction_metadata)

**backend-saas/core/ingestion_pipeline.py**

  • After _run_schema_discovery(), call entity_linking_agent.execute_batch()
  • Only if auto-linking is enabled (feature flag / tenant setting)

**backend-saas/core/historical_sync_service.py**

  • Same: after schema discovery in _process_sync_job, trigger linking agent

**backend-saas/api/routes/multi_entity_extraction_routes.py**

  • Add endpoint: POST /api/v1/entities/discovered/auto-link
  • Add endpoint: POST /api/v1/entities/discovered/{entity_id}/undo-link

**src/components/knowledge-graph/DiscoveredEntitiesPanel.tsx**

  • Add "Auto-Link All" button (triggers agent batch)
  • Add "Undo" button on linked entities
  • Show linking judgment (reasoning) in expanded details

Agent Training & Learning

The agent uses the existing agent framework:

  • **AgentRegistry**: registered as entity_linking_agent, category="System", maturity="supervised"
  • **AgentLearning**: tracks feedback counts, adjusts LLM temperature over time
  • **AgentFeedback**: users can 👍/👎 linking decisions, feeding RLHF
  • **Graduation**: starts at "supervised" (flags uncertain cases), can graduate to "autonomous" after X successful links with no undos

Feature Flags

FlagDefaultDescription
AUTO_LINK_ENABLEDfalseGlobal kill switch
AUTO_LINK_MIN_CONFIDENCE0.7Minimum confidence to auto-link
AUTO_LINK_FLAG_THRESHOLD0.5Below this, flag for human review

TDD Verification

TestAutoLinking:
  test_agent_links_high_confidence_entity           # confidence 0.9 → linked
  test_agent_flags_medium_confidence_entity         # confidence 0.6 → flagged
  test_agent_rejects_low_confidence_entity          # confidence 0.2 → rejected
  test_agent_matches_existing_graph_node            # "Acme Corp" matches existing node
  test_agent_creates_new_node_for_novel_entity      # No match → creates node
  test_agent_undo_reverts_link                      # Undo restores pending status
  test_agent_tenant_isolation                       # Tenant A can't see Tenant B entities
  test_agent_learning_from_feedback                 # 👍 increases future auto-link rate
  test_agent_graduation_path                        # supervised → autonomous after N successes

---

Plan 2: Outlook Email Attachment Ingestion

Overview

Download and parse attachments from Outlook emails during historical sync.

Reuses the existing Docling document parsing pipeline used by OneDrive/WorkDrive.

Architecture

Outlook sync fetches email
  ↓
Check hasAttachments flag
  ↓
If true: GET /me/messages/{id}/attachments (metadata)
  ↓
For each attachment:
  GET /me/messages/{id}/attachments/{attachmentId}/$value (binary)
  ↓
Pass through DoclingProcessor (same as OneDrive/WorkDrive)
  ↓
Append parsed text to email body for LLM extraction
  ↓
Create DiscoveredEntity with source_record_type="outlook_attachment"

Files to Create/Modify

New Files

**backend-saas/tests/test_outlook_attachment_ingestion_tdd.py**

  • TDD tests covering:
  • Attachment metadata fetched correctly
  • Binary content downloaded
  • PDF parsed via Docling
  • DOCX parsed via Docling
  • Text appended to email body for extraction
  • Attachment entities created with proper type
  • Large attachments skipped (respect MAX size)
  • Rate limiting for attachment API calls

Modified Files

**backend-saas/integrations/outlook_service.py**

  • Add download_attachment(message_id, attachment_id) method
  • Calls GET /me/messages/{message_id}/attachments/{attachment_id}/$value
  • Returns bytes
  • Reuses existing self.client (httpx) and access token

**backend-saas/core/integrations/adapters/outlook_integration_v2.py**

  • Add download_file(file_id) method (standardized contract for _prepare_record_text_async)
  • file_id format: "{message_id}:{attachment_id}"
  • Delegates to Outlook service's download_attachment

**backend-saas/core/ingestion_pipeline.py**

  • In _prepare_record_text_async, add handling for integration_id == "outlook"
  • Check if record has hasAttachments == true
  • Fetch and parse each attachment
  • Append parsed text: record["body"] += "\n\n[Attachment: {name}]\n{parsed_text}"

**backend-saas/core/historical_sync_service.py**

  • No changes needed — _prepare_record_text_async is already called per record

Reuse from OneDrive

The OneDrive adapter (core/integrations/adapters/onedrive.py) already implements:

  • download_file(file_id) — downloads binary from MS Graph
  • Docling integration via _prepare_record_text_async

The Outlook service (integrations/outlook_service.py) already has:

  • self.client (httpx.AsyncClient) with auth headers
  • MS Graph API base URL
  • Access token management

The attachment flow is identical: MS Graph API → binary → Docling → text.

Only the endpoint differs: /drive/items/{id}/content vs /messages/{id}/attachments/{aid}/$value.

Rate Limiting & Limits

SettingDefaultDescription
OUTLOOK_MAX_ATTACHMENT_SIZE_MB10Skip attachments larger than this
OUTLOOK_ATTACHMENT_RATE_LIMIT10/minMS Graph throttling
OUTLOOK_ATTACHMENT_TYPESpdf,docx,xlsx,pptx,txt,csvSupported formats (via Docling)

TDD Verification

TestOutlookAttachmentIngestion:
  test_fetches_attachment_metadata           # GET /attachments returns list
  test_downloads_attachment_content          # GET /$value returns bytes
  test_parses_pdf_attachment                 # Docling extracts PDF text
  test_parses_docx_attachment                # Docling extracts DOCX text
  test_appends_text_to_email_body            # Body + attachment text combined
  test_skips_large_attachments               # >10MB skipped with warning
  test_handles_missing_attachment            # 404 from Graph API
  test_creates_attachment_entities            # DiscoveredEntity with type="Document"
  test_rate_limits_attachment_requests        # Respects 10/min throttle
  test_tenant_isolation                      # Cross-tenant attachment access blocked

---

Implementation Order

Phase 1: Attachment Ingestion (1-2 sessions)

  1. Add download_attachment to outlook_service.py
  2. Add download_file to outlook_integration_v2.py
  3. Modify _prepare_record_text_async for outlook integration
  4. TDD verification
  5. Deploy + test with Brennan's Outlook backfill

Phase 2: Automated Linking Agent (2-3 sessions)

  1. Create EntityLinkingJudge with LLM prompt
  2. Create EntityLinkingAgent extending agent framework
  3. Wire into ingestion pipeline (after schema discovery)
  4. Add API endpoints (auto-link, undo)
  5. Add UI components (Auto-Link All button, Undo, judgment display)
  6. TDD verification
  7. Register as default agent for all tenants
  8. Deploy + monitor linking decisions

Phase 3: Integration & Polish (1 session)

  1. End-to-end TDD: sync → extract → link → view in Knowledge Graph
  2. Agent graduation: supervised → autonomous based on feedback
  3. Documentation update

---

Dependencies

DependencyStatusNotes
Docling processor✅ Deployedget_docling_processor() singleton
Agent framework✅ DeployedAgentRegistry, governance, graduation
EntityLinkingService✅ Existslink_entities_to_graph already written
SchemaDiscoveryService✅ ExistsCreates EntityTypeDefinition drafts
MS Graph API access✅ WorkingToken refresh handled by outlook_service
Outlook attachment endpoints⚠️ Needs scopeMay need Mail.ReadWrite or Files.Read