diff --git a/docs/technical-design.md b/docs/technical-design.md index 6439f1a..835f955 100644 --- a/docs/technical-design.md +++ b/docs/technical-design.md @@ -1,6 +1,168 @@ # Technical Design & Architecture -This document outlines key architectural decisions and data flows within the Rewards Eligibility Oracle. +This document's purpose is to visually represent the Rewards Eligibility Oracle codebase, as a more approachable alternative to reading through the codebase directly. + +## End-to-End Oracle Flow + +The Rewards Eligibility Oracle operates as a daily scheduled service that evaluates indexer performance and updates on-chain rewards eligibility via function calls to the RewardsEligibilityOracle contract. The diagram below illustrates the complete execution flow from scheduler trigger through data processing to blockchain submission and error handling. + +The Oracle is designed to be resilient to transient network issues and RPC provider failures. It uses a multi-layered approach involving internal retries, provider rotation, and a circuit breaker to prevent costly infinite restart loops that needlessly burn through BigQuery requests. + +```mermaid +--- +title: Rewards Eligibility Oracle - End-to-End Flow +--- +%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#fee2e2', 'primaryTextColor':'#7f1d1d', 'primaryBorderColor':'#ef4444', 'lineColor':'#6b7280'}}}%% + +graph TB + %% Docker Container - Contains all oracle logic + subgraph DOCKER["Docker Container"] + Scheduler["Python Scheduler"] + Oracle["Rewards Eligibility Oracle"] + + subgraph CIRCUIT_BREAKER["Circuit Breaker Logic"] + CB["Circuit Breaker"] + CBCheck{"Has there been more
than 3 failures in the
last 60 minutes?"} + end + + Scheduler -.->|"Phase 1: Schedule daily run"| Oracle + + %% Data Pipeline + subgraph PIPELINE["Data Pipeline"] + CacheCheck{"Do we have recent cached
BigQuery results available?
(< 30 min old)"} + + subgraph BIGQUERY["BigQuery Analysis"] + FetchData["Fetch Indexer Performance Data
over last 28 days
(from BigQuery)"] + SQLQuery["- Daily query metrics
- Days online calculation
- Subgraph coverage"] + end + + subgraph PROCESSING["Eligibility Processing"] + ApplyCriteria["Apply Criteria e.g.
5+ days online
Latency < 5000ms
Blocks behind < 50000
1+ subgraph served"] + FilterData["Filter Eligible
vs Ineligible"] + GenArtifacts["Generate CSV Artifacts:
- eligible_indexers.csv
- ineligible_indexers.csv
- full_metrics.csv"] + end + end + + %% Blockchain Layer + subgraph BLOCKCHAIN["Blockchain Submission"] + Batch["Consume series of Eligible
Indexers from CSV.
Batch indexer addresses
into groups of 125 indexers."] + + subgraph RPC["RPC Failover System"] + TryRPC["Try establish connection
with RPC provider"] + RPCError["Rotate to next RPC provider"] + end + + BuildTx["Build Transaction:
- Estimate gas
- Get nonce
- Sign with key"] + SubmitTx["Submit Batch to Contract
call function:
renewIndexerEligibility()"] + WaitReceipt["Wait for Receipt
30s timeout"] + MoreBatches{"More
Batches?"} + end + + %% Monitoring + subgraph MONITOR["Monitoring & Notifications"] + SlackSuccess["Slack Success:
- Eligible count
- Execution time
- Transaction links"] + SlackFailCircuitBreaker["Stop container sys.exit(0)
Container will not restart
Manual Intervention needed
Send notification to team
slack channel for debugging"] + SlackFailRPC["Stop container sys.exit(1)
Container will restart
Send notification to slack"] + SlackRotate["Send slack notification"] + end + end + + %% External Systems - Define after Docker subgraph + RPCProviders["Pool of 4 RPC providers
(External Infrastructure)"] + BQ["Google BigQuery
Indexer Performance Data"] + + subgraph FailureLogStorage["Data Storage
(mounted volume)"] + CBLog["Failure log"] + end + + subgraph HistoricalDataStorage["Data Storage
(mounted volume)"] + HistoricalData["Historical archive of
eligible and ineligible
indexers by date
YYYY-MM-DD"] + end + + END_NO_RESTART["FAILURE
Container Stopped
No Restart
Manual Intervention Required"] + END_WITH_RESTART["FAILURE
Container Stopped
Restart Container
Will retry entire loop again"] + SUCCESS["SUCCESS
Wait for next
scheduled trigger"] + + %% Main Flow - Start with Docker container to anchor it left + Oracle -->|"Phase 1.1: Check if oracle
should run"| CB + CB -->|"Phase 1.2: Read log"| CBLog + CBLog -->|"Phase 1.3: Return log"| CB + CB -->|"Phase 1.4: Provides failure
timestamps (if they exist)"| CBCheck + CBCheck -->|"Phase 2:
(Regular Path)
No"| CacheCheck + CacheCheck -->|"Phase 2.1: Check for
recent cached data"| HistoricalData + HistoricalData -->|"Phase 2.2: Return recent eligible indexers
from eligible_indexers.csv
(if they exist)"| CacheCheck + CBCheck -.->|"Phase 2:
(Alternative Path)
Yes"| SlackFailCircuitBreaker + SlackFailCircuitBreaker -.-> END_NO_RESTART + + CacheCheck -->|"Phase 3:
(Alternative Path)
Yes"| Batch + CacheCheck -->|"Phase 3:
(Regular Path)
No"| FetchData + + FetchData -->|"Phase 3.1: Query data
from BigQuery"| BQ + BQ -->|"Phase 3.2: Returns metrics"| SQLQuery + SQLQuery -->|"Phase 3.3: Process results"| ApplyCriteria + ApplyCriteria --> FilterData + FilterData -->|"Phase 3.4: Generate CSV's"| GenArtifacts + GenArtifacts -->|"Phase 3.5: Save data"| HistoricalData + GenArtifacts --> Batch + + Batch -->|"Phase 4.1: For each batch"| TryRPC + TryRPC -->|"Phase 4.2: Connect"| RPCProviders + RPCProviders -->|"Phase 4.3:
(Regular Path)
RPC connection established"| BuildTx + RPCProviders -.->|"Phase 4.3:
(Alternative Path)
RPC connection failed
Multiple connection attempts
Not possible to connect"| RPCError + RPCError -.->|"Notify"| SlackRotate + RPCError -->|"All exhausted"| SlackFailRPC + SlackFailRPC --> END_WITH_RESTART + RPCError -->|"Connection successful"| BuildTx + + BuildTx --> SubmitTx + SubmitTx --> WaitReceipt + + WaitReceipt -->|"Phase 4.4: Batch confirmed"| MoreBatches + + MoreBatches -->|"Yes
Back to phase 4 loop
Process next batch"| Batch + MoreBatches -->|"Phase 5: No
All complete"| SlackSuccess + SlackSuccess --> SUCCESS + + %% Styling + classDef schedulerStyle fill:#fee2e2,stroke:#ef4444,stroke-width:3px,color:#7f1d1d + classDef oracleStyle fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#92400e + classDef dataStyle fill:#f0fdf4,stroke:#16a34a,stroke-width:2px,color:#14532d + classDef processingStyle fill:#e0e7ff,stroke:#6366f1,stroke-width:2px,color:#312e81 + classDef blockchainStyle fill:#fee2e2,stroke:#ef4444,stroke-width:2px,color:#7f1d1d + classDef monitorStyle fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#92400e + classDef infraStyle fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#374151 + classDef contractStyle fill:#dbeafe,stroke:#2563eb,stroke-width:3px,color:#1e3a8a + classDef decisionStyle fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#92400e + classDef endStyle fill:#7f1d1d,stroke:#991b1b,stroke-width:3px,color:#fee2e2 + classDef endStyleOrange fill:#ea580c,stroke:#c2410c,stroke-width:3px,color:#ffedd5 + classDef successStyle fill:#14532d,stroke:#166534,stroke-width:3px,color:#f0fdf4 + + class Scheduler schedulerStyle + class Oracle,CB oracleStyle + class FetchData,SQLQuery,BQ dataStyle + class ApplyCriteria,FilterData,GenArtifacts processingStyle + class Batch,TryRPC,BuildTx,SubmitTx,WaitReceipt,Rotate,RPCError blockchainStyle + class SlackSuccess,SlackFailCircuitBreaker,SlackFailRPC,SlackRotate monitorStyle + class RPCProviders,HistoricalData,CBLog infraStyle + class Contract contractStyle + class CacheCheck,MoreBatches,CBCheck decisionStyle + class END_NO_RESTART endStyle + class END_WITH_RESTART endStyleOrange + class SUCCESS successStyle + + style DOCKER fill:#dbeafe,stroke:#2563eb,stroke-width:3px,color:#1e3a8a + style CIRCUIT_BREAKER fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#92400e + style PIPELINE fill:#f0fdf4,stroke:#16a34a,stroke-width:2px,color:#14532d + style BIGQUERY fill:#dcfce7,stroke:#16a34a,stroke-width:2px,color:#14532d + style PROCESSING fill:#e0e7ff,stroke:#6366f1,stroke-width:2px,color:#312e81 + style BLOCKCHAIN fill:#fee2e2,stroke:#ef4444,stroke-width:2px,color:#7f1d1d + style RPC fill:#fecaca,stroke:#ef4444,stroke-width:2px,color:#7f1d1d + style MONITOR fill:#fef3c7,stroke:#f59e0b,stroke-width:2px,color:#92400e + style FailureLogStorage fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#374151 + style HistoricalDataStorage fill:#f3f4f6,stroke:#6b7280,stroke-width:2px,color:#374151 +``` + +--- ## RPC Provider Failover and Circuit Breaker Logic @@ -21,24 +183,22 @@ sequenceDiagram # Describe failure loop inside the blockchain_client module activate blockchain_client - alt RPC Loop (for each provider) + loop For each provider in pool - # Attempt RPC call - blockchain_client->>blockchain_client: _execute_rpc_call() with provider A - note right of blockchain_client: Fails after 5 retries + # Attempt RPC call + blockchain_client->>blockchain_client: _execute_rpc_call() with next provider + note right of blockchain_client: Fails after 3 attempts - # Log failure + # Log failure and rotate blockchain_client-->>blockchain_client: raises ConnectionError - note right of blockchain_client: Catches error, logs rotation + note right of blockchain_client: Catches error, rotates to next provider - # Retry RPC call - blockchain_client->>blockchain_client: _execute_rpc_call() with provider B - note right of blockchain_client: Fails after 5 retries + # Send rotation notification + blockchain_client->>slack_notifier: send_info_notification() + note right of slack_notifier: RPC provider rotation alert - # Log final failure - blockchain_client-->>blockchain_client: raises ConnectionError - note right of blockchain_client: All providers tried and failed end + note right of blockchain_client: All providers exhausted # Raise error back to main_oracle oracle and exit blockchain_client module blockchain_client-->>main_oracle: raises Final ConnectionError @@ -51,6 +211,6 @@ sequenceDiagram main_oracle->>slack_notifier: send_failure_notification() # Document restart process - note right of main_oracle: sys.exit(1) - note right of main_oracle: Docker will restart. CircuitBreaker can halt via sys.exit(0) + note right of main_oracle: sys.exit(1) triggers Docker restart + note right of main_oracle: Circuit breaker uses sys.exit(0) to prevent restart ```