Skip to content

Add Omi QoS tier system for model cost optimization #6831

@beastoin

Description

@beastoin

Backend LLM costs span multiple providers (OpenAI, Anthropic, Google/Gemini, Perplexity) across 60+ callsites. There is no unified way to control model selection per-feature — models are hardcoded throughout the codebase, making cost optimization and A/B testing impossible without code changes.

Current Behavior

  • 60+ LLM callsites across 4 providers with hardcoded model instances
  • OpenAI: 15+ features using llm_mini, llm_medium, llm_medium_experiment directly
  • Anthropic: chat agent hardcoded to claude-sonnet-4-6
  • OpenRouter: persona chat and wrapped analysis hardcoded to specific Gemini/Claude models
  • Perplexity: web search hardcoded to sonar-pro
  • No mechanism to downgrade/upgrade models per-feature without code changes
  • No way to switch cost profiles (e.g., "run everything on cheapest acceptable models")

Expected Behavior

A provider-agnostic QoS profile system where each profile (mini/medium/high) maps every feature to a specific model — potentially different model tiers within the same profile, since some features need more quality than others even in a cost-optimized profile.

Solution

QoS Profiles — each profile is a complete feature→model mapping across all providers:

MODEL_QOS_MINI:
  conv_action_items:  gpt-4.1-nano       (cheapest, structured extraction)
  conv_structure:     gpt-4.1-mini       (needs more quality)
  chat_agent:         claude-haiku-3.5   (cost-optimized chat)
  persona_chat:       gemini-flash-1.5-8b
  ...

MODEL_QOS_MEDIUM:
  conv_action_items:  gpt-4.1-mini
  conv_structure:     gpt-5.1
  chat_agent:         claude-sonnet-4-6
  persona_chat:       claude-3.5-sonnet
  ...

MODEL_QOS_HIGH:
  conv_action_items:  gpt-5.1
  conv_structure:     o4-mini
  chat_agent:         claude-sonnet-4-6
  persona_chat:       gemini-3-flash-preview
  ...

Global switch: MODEL_QOS=mini selects entire profile
Per-feature override: MODEL_QOS_CONV_STRUCTURE=gpt-5.1 overrides one feature

21 features across 4 providers:

  • OpenAI (16): conv_action_items, conv_structure, conv_apps, daily_summary, memories, memory_conflict, memory_category, knowledge_graph, chat_responses, chat_extraction, session_titles, goals, notifications, followup, smart_glasses, onboarding
  • Anthropic (1): chat_agent
  • OpenRouter (3): persona_chat, persona_clone, wrapped_analysis
  • Perplexity (1): web_search

Pinned features: fair_use classifier pinned to specific model regardless of profile (accuracy-critical).

Affected Areas

Area Files Callsites
QoS core utils/llm/clients.py Profile definitions, get_model(), client factories
Conversation processing utils/llm/conversation_processing.py 5 callsites
Memories utils/llm/memories.py 4 callsites
Knowledge graph utils/llm/knowledge_graph.py 2 callsites
Chat utils/llm/chat.py 10+ callsites
Persona utils/llm/persona.py 5 callsites
Goals utils/llm/goals.py 3 callsites
Notifications utils/llm/notifications.py 2 callsites
Agentic chat utils/retrieval/agentic.py 1 callsite (Anthropic)
Wrapped utils/wrapped/generate_2025.py 9 callsites (Gemini)
Other Various routers/utils 10+ callsites

Impact

Unified cost control across all LLM providers. One env var (MODEL_QOS=mini) to switch the entire backend to cost-optimized models. Per-feature overrides for A/B testing. No user-facing changes.


by AI for @beastoin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions