-
Notifications
You must be signed in to change notification settings - Fork 7
Evaluation: Refactor #503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Evaluation: Refactor #503
Conversation
|
Warning Rate limit exceeded@AkhileshNegi has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 0 minutes and 7 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughRefactors evaluation endpoints into service-driven modules: adds evaluation services (dataset, evaluation, validators), moves CSV validation and dataset upload logic into services, replaces in-endpoint orchestration with service calls, and updates route names and exports accordingly. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant API as Evaluation API
participant Svc as Dataset Service
participant Storage as ObjectStore
participant Langfuse
participant DB
Client->>API: POST /evaluation/datasets (file, name, dup_factor)
API->>Svc: validate_csv_file(file) & sanitize_dataset_name(name)
Svc->>Svc: parse_csv_items(csv_content)
alt object store available
Svc->>Storage: upload_csv -> object_store_url
end
Svc->>Langfuse: upload_dataset_to_langfuse(csv_content) -> langfuse_id
Svc->>DB: create EvaluationDataset(metadata, object_store_url?, langfuse_id)
Svc-->>API: DatasetUploadResponse
API-->>Client: 201 Created
sequenceDiagram
autonumber
participant Client
participant API as Evaluation API
participant EvalSvc as Evaluation Service
participant OpenAI
participant Langfuse
participant DB
Client->>API: POST /evaluation/runs (dataset_id, config, assistant_id)
API->>EvalSvc: start_evaluation(dataset_id, config, assistant_id)
EvalSvc->>DB: fetch dataset -> verify langfuse_id
EvalSvc->>EvalSvc: build_evaluation_config (merge assistant config)
EvalSvc->>DB: create EvaluationRun (status: pending)
EvalSvc->>Langfuse: start_evaluation_batch(langfuse_dataset_id, config)
Langfuse-->>EvalSvc: batch started / error
EvalSvc->>DB: update EvaluationRun (status/external ids)
EvalSvc-->>API: return EvaluationRun
API-->>Client: 202 Accepted
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/api/routes/evaluation.py (1)
52-82: Fix event loop blocking: async endpoint calling sync I/O.The endpoint is
async def(required forawait validate_csv_file), but line 72 calls the synchronousupload_datasetservice, which performs blocking I/O operations (database queries, Langfuse API calls, object store uploads). This blocks FastAPI's event loop and degrades concurrency.🔎 Recommended fixes (choose one):
Option 1 (simplest): Run the sync service in a thread pool
+from fastapi.concurrency import run_in_threadpool + async def upload_dataset_endpoint( ... ) -> APIResponse[DatasetUploadResponse]: """Upload an evaluation dataset.""" # Validate and read CSV file csv_content = await validate_csv_file(file) # Upload dataset using service - dataset = upload_dataset( + dataset = await run_in_threadpool( + upload_dataset, session=_session, csv_content=csv_content, dataset_name=dataset_name, description=description, duplication_factor=duplication_factor, organization_id=auth_context.organization.id, project_id=auth_context.project.id, ) return APIResponse.success_response(data=_dataset_to_response(dataset))Option 2: Make the service async (requires refactoring service layer)
This would require making
upload_datasetand its dependencies async, which is a larger refactor but provides better async integration.Note: The other endpoints (
evaluate,get_evaluation_run_status) are correctly defined as sync (def), so FastAPI automatically runs them in a thread pool.
🧹 Nitpick comments (5)
backend/app/services/evaluation/dataset.py (1)
131-138: Consider sanitizing exception details in client-facing error message.Exposing raw exception messages (
{e}) could leak internal implementation details (stack traces, connection strings, etc.) to API clients.🔎 Apply this diff to provide a generic error message:
except Exception as e: logger.error( f"[upload_dataset] Failed to upload dataset to Langfuse | {e}", exc_info=True, ) raise HTTPException( - status_code=500, detail=f"Failed to upload dataset to Langfuse: {e}" + status_code=500, detail="Failed to upload dataset to Langfuse. Please check your Langfuse credentials and try again." )backend/app/services/evaluation/evaluation.py (3)
23-28: Consider more specific type hints forconfigparameter.Using
dictwithout type parameters reduces type safety. As per coding guidelines requiring type hints, consider usingdict[str, Any]for clarity.🔎 Apply this diff:
+from typing import Any + def build_evaluation_config( session: Session, - config: dict, + config: dict[str, Any], assistant_id: str | None, project_id: int, -) -> dict: +) -> dict[str, Any]:
101-109: Consider more specific type hints forconfigparameter here as well.Same suggestion as
build_evaluation_configfor consistency.
313-319: Consider sanitizing exception details in error message.Similar to the dataset service, exposing raw exception messages could leak internal details.
🔎 Apply this diff:
except Exception as e: logger.error( f"[get_evaluation_with_scores] Failed to fetch trace info | " f"evaluation_id={evaluation_id} | error={e}", exc_info=True, ) - return eval_run, f"Failed to fetch trace info from Langfuse: {str(e)}" + return eval_run, "Failed to fetch trace info from Langfuse. Please try again later."backend/app/api/routes/evaluation.py (1)
67-67: Consider enriching docstrings with parameter and behavior details.The added docstrings are present but minimal, mostly restating the function names. For route handlers that delegate to services, consider documenting:
- Key parameter constraints or validation rules
- Important response scenarios (e.g., 404 vs 422)
- Side effects or async behavior notes
Example for
upload_dataset_endpoint:""" Upload an evaluation dataset from a CSV file. Validates CSV format and content, uploads to Langfuse and optional object store, and persists metadata. Runs synchronously in a thread pool to avoid blocking. """This is optional since the Swagger descriptions are loaded from external markdown files.
Also applies to: 96-96, 124-124, 156-156, 200-200, 225-225, 269-269
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
backend/app/api/routes/evaluation.py(10 hunks)backend/app/crud/evaluations/__init__.py(4 hunks)backend/app/services/evaluation/__init__.py(1 hunks)backend/app/services/evaluation/dataset.py(1 hunks)backend/app/services/evaluation/evaluation.py(1 hunks)backend/app/services/evaluation/validators.py(1 hunks)backend/app/tests/api/routes/test_evaluation.py(7 hunks)
🧰 Additional context used
📓 Path-based instructions (4)
backend/app/services/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Implement business logic in services located in
backend/app/services/
Files:
backend/app/services/evaluation/validators.pybackend/app/services/evaluation/dataset.pybackend/app/services/evaluation/__init__.pybackend/app/services/evaluation/evaluation.py
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets:logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase
Files:
backend/app/services/evaluation/validators.pybackend/app/crud/evaluations/__init__.pybackend/app/services/evaluation/dataset.pybackend/app/services/evaluation/__init__.pybackend/app/services/evaluation/evaluation.pybackend/app/tests/api/routes/test_evaluation.pybackend/app/api/routes/evaluation.py
backend/app/tests/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
Use factory pattern for test fixtures in
backend/app/tests/
Files:
backend/app/tests/api/routes/test_evaluation.py
backend/app/api/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
backend/app/api/**/*.py: Define FastAPI REST endpoints inbackend/app/api/organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings usingload_description("domain/action.md")
Files:
backend/app/api/routes/evaluation.py
🧬 Code graph analysis (4)
backend/app/crud/evaluations/__init__.py (2)
backend/app/crud/evaluations/core.py (1)
save_score(259-295)backend/app/crud/evaluations/langfuse.py (1)
fetch_trace_scores_from_langfuse(304-543)
backend/app/services/evaluation/__init__.py (3)
backend/app/services/evaluation/dataset.py (1)
upload_dataset(24-163)backend/app/services/evaluation/evaluation.py (2)
build_evaluation_config(23-98)get_evaluation_with_scores(229-329)backend/app/services/evaluation/validators.py (2)
parse_csv_items(118-174)sanitize_dataset_name(23-68)
backend/app/services/evaluation/evaluation.py (7)
backend/app/crud/assistants.py (1)
get_assistant_by_id(19-30)backend/app/crud/evaluations/core.py (3)
create_evaluation_run(14-61)get_evaluation_run_by_id(103-141)save_score(259-295)backend/app/crud/evaluations/langfuse.py (1)
fetch_trace_scores_from_langfuse(304-543)backend/app/crud/evaluations/dataset.py (1)
get_dataset_by_id(107-140)backend/app/crud/evaluations/batch.py (1)
start_evaluation_batch(118-212)backend/app/models/evaluation.py (1)
EvaluationRun(170-319)backend/app/utils.py (2)
get_langfuse_client(212-248)get_openai_client(179-209)
backend/app/api/routes/evaluation.py (3)
backend/app/services/evaluation/evaluation.py (1)
get_evaluation_with_scores(229-329)backend/app/services/evaluation/dataset.py (1)
upload_dataset(24-163)backend/app/services/evaluation/validators.py (1)
validate_csv_file(71-115)
🔇 Additional comments (29)
backend/app/tests/api/routes/test_evaluation.py (7)
56-64: LGTM! Mock paths correctly updated to reflect service layer refactoring.The patch targets are properly updated from
app.api.routes.evaluation.*toapp.services.evaluation.dataset.*, aligning with the new service-oriented architecture.
142-150: Mock paths consistently updated for empty rows test.
188-196: Mock paths consistently updated for default duplication test.
229-237: Mock paths consistently updated for custom duplication test.
270-278: Mock paths consistently updated for description test.
362-370: Mock paths consistently updated for boundary minimum test.
407-410: Mock paths consistently updated for Langfuse configuration failure test.backend/app/crud/evaluations/__init__.py (3)
4-9: LGTM! Properly exportssave_scorefrom core module.The new export aligns with the service layer's need to persist evaluation scores.
26-31: LGTM! Properly exportsfetch_trace_scores_from_langfusefrom langfuse module.This enables the evaluation service to retrieve trace scores via the CRUD layer's public API.
39-70: LGTM!__all__correctly updated with new exports.Both
save_scoreandfetch_trace_scores_from_langfuseare properly listed in the appropriate sections.backend/app/services/evaluation/validators.py (4)
13-20: LGTM! Well-defined security constants.Clear file upload limits with appropriate MIME type allowances including
text/plainfor compatibility.
23-68: LGTM! Well-documented sanitization with comprehensive rules.The function handles edge cases properly, includes clear examples in the docstring, and follows coding guidelines with type hints.
100-115: LGTM! Thorough file size validation.Properly seeks to end for size check and resets position before reading.
118-174: LGTM! Robust CSV parsing with case-insensitive headers.Good approach to normalize headers while preserving original column names for row access. Properly filters empty rows and re-raises
HTTPExceptionwithout wrapping.backend/app/services/evaluation/dataset.py (4)
24-32: LGTM! Well-typed function signature.All parameters have proper type hints including the union type for optional
description. As per coding guidelines, type hints are correctly applied.
58-74: LGTM! Proper sanitization with informative logging.Logging follows the
[function_name]prefix convention as per coding guidelines.
86-108: LGTM! Graceful degradation for object store failures.Non-fatal handling allows the workflow to continue when object storage is unavailable, which is appropriate for optional storage backends.
140-163: LGTM! Clean metadata construction and database persistence.The metadata structure is well-organized and the final logging confirms successful creation.
backend/app/services/evaluation/__init__.py (1)
1-32: LGTM! Clean public API surface definition.Well-organized exports with clear categorization. The
__all__list provides a clear contract for the evaluation services package.backend/app/services/evaluation/evaluation.py (8)
47-86: LGTM! Robust assistant config merging logic.Properly handles the merge precedence where provided config values override assistant values, and conditionally adds tools when vector stores are available.
88-98: LGTM! Proper validation for direct config usage.Clear error message when model is missing and no assistant_id is provided.
142-170: LGTM! Thorough dataset validation with clear error messages.Properly validates both dataset existence and Langfuse ID presence before proceeding.
179-200: LGTM! Clean client initialization and evaluation run creation.Follows the documented workflow steps clearly.
229-236: LGTM! Well-typed return signature.The
tuple[EvaluationRun | None, str | None]return type clearly communicates the possible outcomes.
276-286: LGTM! Proper caching logic for trace scores.Correctly checks both completion status and existing cached scores before fetching from Langfuse.
321-329: LGTM! Clean score persistence flow.Uses the separate-session
save_scorefunction correctly to persist scores after the primary fetch operation.
219-226: The error handling behavior is correct and intentional.start_evaluation_batchproperly setsstatus = "failed"with anerror_messagebefore re-raising the exception. The exception is then caught instart_evaluation, andsession.refresh(eval_run)retrieves the updated failure state from the database. The approach is sound—error details are persisted before exception propagation, ensuring the evaluation run maintains failure context when returned to the caller.backend/app/api/routes/evaluation.py (2)
1-23: LGTM: Clean service-oriented refactoring.The refactoring to delegate business logic to service modules improves separation of concerns and maintainability. The route handlers are now thin wrappers focused on HTTP concerns.
294-295: Graceful degradation pattern is intentional and well-designed.Lines 294-295 correctly use
APIResponse.failure_response(), which is explicitly designed to support returning data alongside errors. Thefailure_response()method signature includes an optionaldata: Optional[T] = Noneparameter, and the response is constructed with bothsuccess=False,error=error_message, anddata=data. This pattern is appropriate for cases where the evaluation run exists but additional enrichment (like trace info) fails, allowing clients to receive partial data along with error context.
| # Security validation: Check file extension | ||
| file_ext = Path(file.filename).suffix.lower() | ||
| if file_ext not in ALLOWED_EXTENSIONS: | ||
| raise HTTPException( | ||
| status_code=422, | ||
| detail=f"Invalid file type. Only CSV files are allowed. Got: {file_ext}", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle potential None filename.
file.filename can be None for certain upload scenarios. Path(None) will raise TypeError.
🔎 Apply this diff to handle None filename:
# Security validation: Check file extension
+ if not file.filename:
+ raise HTTPException(
+ status_code=422,
+ detail="File must have a filename",
+ )
file_ext = Path(file.filename).suffix.lower()🤖 Prompt for AI Agents
In backend/app/services/evaluation/validators.py around lines 84 to 90,
file.filename can be None which makes Path(file.filename) raise TypeError; guard
against a missing filename first and return the same 422 HTTPException with a
clear message if it's None (e.g., "Missing filename or invalid upload"),
otherwise safely compute the suffix (lowercased) from the filename and continue
the existing allowed-extension check and error path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/api/routes/evaluation.py (1)
51-81: Add logging to endpoint and address blocking I/O pattern.Two issues found:
Missing logging: This endpoint lacks logging despite other endpoints in the file logging operations with
[function_name]prefix. Per coding guidelines, add a log statement prefixed with[upload_dataset_endpoint].Blocking I/O in async context:
upload_datasetis a synchronous function that performs blocking I/O operations (database writes, object store uploads, Langfuse API calls). Calling it directly in an async endpoint blocks the event loop. Consider wrapping it withasyncio.to_thread()or refactoring the service function to be async.Suggested fix for logging:
async def upload_dataset_endpoint( ... ) -> APIResponse[DatasetUploadResponse]: """Upload an evaluation dataset.""" + logger.info( + f"[upload_dataset_endpoint] Uploading dataset | " + f"name={dataset_name} | org_id={auth_context.organization.id} | " + f"project_id={auth_context.project.id}" + ) # Validate and read CSV file csv_content = await validate_csv_file(file) # Upload dataset using service dataset = upload_dataset(
🧹 Nitpick comments (4)
backend/app/api/routes/evaluation.py (4)
33-43: Add type hint fordatasetparameter.The
datasetparameter is missing a type hint. As per coding guidelines, all function parameters should have type hints.🔎 Suggested fix:
-def _dataset_to_response(dataset) -> DatasetUploadResponse: +def _dataset_to_response(dataset: "Dataset") -> DatasetUploadResponse:You'll need to import the
Datasetmodel or use the appropriate type from your models. If you want to avoid a circular import, you can use a string annotation orTYPE_CHECKINGblock.
89-110: Add logging for list operation.This endpoint lacks logging. Per coding guidelines, consider adding a log statement prefixed with
[list_datasets_endpoint]to track dataset listing operations.🔎 Suggested addition:
def list_datasets_endpoint( ... ) -> APIResponse[list[DatasetUploadResponse]]: """List evaluation datasets.""" + logger.info( + f"[list_datasets_endpoint] Listing datasets | " + f"org_id={auth_context.organization.id} | " + f"project_id={auth_context.project.id} | limit={limit} | offset={offset}" + ) # Enforce maximum limit if limit > 100: limit = 100
186-210: Add logging for evaluation start.This endpoint lacks logging for the evaluation start operation. Starting an evaluation is a significant action that should be logged for observability and debugging purposes.
🔎 Suggested addition:
def evaluate( ... ) -> APIResponse[EvaluationRunPublic]: """Start an evaluation run.""" + logger.info( + f"[evaluate] Starting evaluation | dataset_id={dataset_id} | " + f"experiment_name={experiment_name} | assistant_id={assistant_id} | " + f"org_id={auth_context.organization.id} | project_id={auth_context.project.id}" + ) eval_run = start_evaluation(
247-296: Add logging for evaluation status fetch.This endpoint lacks logging. Consider adding a log statement prefixed with
[get_evaluation_run_status]for consistency with other endpoints likeget_dataset.🔎 Suggested addition:
def get_evaluation_run_status( ... ) -> APIResponse[EvaluationRunPublic]: """Get evaluation run status with optional trace info.""" + logger.info( + f"[get_evaluation_run_status] Fetching evaluation | id={evaluation_id} | " + f"get_trace_info={get_trace_info} | resync_score={resync_score} | " + f"org_id={auth_context.organization.id}" + ) if resync_score and not get_trace_info:
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
backend/app/api/routes/evaluation.py(10 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
backend/app/api/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
backend/app/api/**/*.py: Define FastAPI REST endpoints inbackend/app/api/organized by domain
Load Swagger endpoint descriptions from external markdown files instead of inline strings usingload_description("domain/action.md")
Files:
backend/app/api/routes/evaluation.py
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets:logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase
Files:
backend/app/api/routes/evaluation.py
🔇 Additional comments (2)
backend/app/api/routes/evaluation.py (2)
113-142: LGTM!Proper logging with
[get_dataset]prefix, complete type hints, and appropriate error handling for the 404 case.
145-178: LGTM!Proper logging with
[delete_dataset]prefix at entry and exit. Type hints are complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/crud/evaluations/__init__.py (1)
1-37: Add an explicit__all__list to declare the public API.Per PEP 8, modules should explicitly declare the names in their public API using the
__all__attribute. This module currently imports 21+ functions from six submodules without defining__all__, which means all imported names become part of the public API by default. Using__all__allows you to explicitly specify the public interface of your packages and modules and prevents accidental usage of objects that shouldn't be used from outside the module. Define an__all__list after the module docstring containing the functions intended as part of the public API.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/app/crud/evaluations/__init__.py(2 hunks)backend/app/services/evaluation/__init__.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/app/services/evaluation/init.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Always add type hints to all function parameters and return values in Python code
Prefix all log messages with the function name in square brackets:logger.info(f"[function_name] Message {mask_string(sensitive_value)}")
Use Python 3.11+ with type hints throughout the codebase
Files:
backend/app/crud/evaluations/__init__.py
🧬 Code graph analysis (1)
backend/app/crud/evaluations/__init__.py (2)
backend/app/crud/evaluations/core.py (1)
save_score(259-295)backend/app/crud/evaluations/langfuse.py (1)
fetch_trace_scores_from_langfuse(304-543)
🔇 Additional comments (2)
backend/app/crud/evaluations/__init__.py (2)
8-8: LGTM!The import of
save_scoreis correctly added and maintains alphabetical ordering with other imports from the core module.
28-28: LGTM!The import of
fetch_trace_scores_from_langfuseis correctly added and maintains alphabetical ordering with other imports from the langfuse module.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Summary
Target issue is #492
Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Summary by CodeRabbit
Refactor
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.