-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Context
Currently, the system lacks visibility into individual user token usage. A flat subscription model risks significant cost overruns due to usage disparities between user groups (e.g., standard undergraduate use vs. heavy usage by PhD students/developers).
We need to urgently implement "Put User Cost" tracking to distinguish between Heavy and Light users, paving the way for future tiered pricing or quota limits.
Goals
- Short-term: Accurately track and record input/output token consumption per user.
- Long-term: Establish a unified AI Gateway (LiteLLM) to handle automatic pricing, cost calculation, and user-defined API keys.
- Legacy Debt: Address the billing visibility and stability issues within the current Python-based MCP (Model Context Protocol) service.
Proposed Solutions
Option A: MVP (Quick Implementation)
- Direct Logging: Parse the
usagefield from the LLM API response in the backend and write token counts directly to the database. - Retroactive Calculation: Implement a Python Cronjob to read historical chat logs and calculate past token consumption using an offline tokenizer.
- Note: Embedding costs are negligible and can be ignored or calculated separately.
Option B: AI Gateway (LiteLLM) - Recommended
Deploy LiteLLM as a unified gateway to manage requests to OpenRouter and other providers.
-
Billing Integration: Utilize the Cost/Usage information returned in LiteLLM headers.
-
Challenge: Verify if the Streaming API accurately returns cost headers; if not, calculation must occur at the end of the stream.
-
User Attribution: Inject
User IDinto request headers to utilize LiteLLM's Metadata/Tagging for user-level statistics. -
Custom Model Support: Implement a Go Interceptor to handle logic for custom API keys. If the model is not in the whitelist, dynamically construct the config for the user-provided key.
Option C: MCP Service Refactor
The current Python-based MCP service poses significant risks:
- Billing Black Box: It runs independently, making it difficult to track internal token consumption (especially for "Deep Research" tasks).
- Instability: High failure rate for function calls and poor handling of edge cases.
- Decision:
- Plan A: Temporarily disable MCP as it is not production-ready.
- Plan B: Migrating core logic to the main Go service to leverage Go's concurrency control and unified billing logic.
Action Plan
Phase 1: Data Infrastructure
- Database Schema: Design
user_token_usageorbilling_recordstables. - Data Collection:
- Update Go backend to extract and store
usagedata from API responses. - Create a script/cronjob to backfill usage data for historical messages using an offline tokenizer.
Phase 2: AI Gateway Integration (LiteLLM)
-
Infra: Deploy LiteLLM Pod and configure ConfigMaps.
-
Middleware Development:
-
Implement Go Interceptor to inject User ID into headers.
-
Implement logic for User-Defined Keys (dynamic config generation for non-whitelisted models).
-
Billing Sync: Parse LiteLLM response headers (Cost/Usage) and sync to the billing database.
-
Testing: Verify billing accuracy under
stream: truemode.
Phase 3: MCP Governance & Migration
- Audit: Review existing Python MCP code to identify essential features vs. spaghetti code.
- Refactor: Rewrite core MCP logic in Go and integrate it into the main service pipeline.
- Deprecate: Decommission the unstable Python MCP service.
Discussion
- Streaming Costs: If the Gateway cannot return real-time costs during streaming, should we implement a token counter in the Worker layer as a fallback?
- Resourcing: The MCP migration is labor-intensive. Should we consider assigning this to interns (undergrads) to assist with the Python-to-Go migration or unit testing?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status