fix(exp): Add run in github experiment #3459

nina-kollman · 2025-11-23T07:54:23Z

I have added tests that cover my changes.
If adding a new instrumentation or changing an existing one, I've added screenshots from some observability platform showing the change.
PR name follows conventional commits format: feat(instrumentation): ... or fix(instrumentation): ....
(If applicable) I have updated the documentation accordingly.

Important

Add GitHub Actions support for running experiments with local task execution and result submission in the Experiment class.

New Features:
- Add run_in_github() method in Experiment class to execute tasks locally and submit results for GitHub CI/CD.
- Introduce RunInGithubRequest, RunInGithubResponse, TaskResult, and GithubContext models in model.py to support GitHub-specific experiment execution.
- Add run_research_experiment.py script to demonstrate running experiments in GitHub Actions with local task execution and result submission.
Behavior:
- run() method in experiment.py now checks for GitHub Actions environment and delegates to _run_in_github() if detected.
- Local task execution with concurrency control using asyncio in _execute_tasks() in experiment.py.
- GitHub context extraction and validation in _run_in_github() to ensure execution in pull request workflows.
Misc:
- Update type annotations in run() and _run_locally() to reflect async task handling.

^{This description was created by}^{for 7ad43b6. You can customize this summary. It will automatically update as commits are pushed.}

Summary by CodeRabbit

New Features
- Run experiments from GitHub Actions with PR/commit/actor context and a GitHub-specific bulk run flow.
- Local concurrent execution of experiment tasks with aggregated per-task results, optional experiment/run metadata, and stop/wait controls.
- New API support for submitting bulk task results and receiving run identifiers.
- Added a sample experiment script that generates responses for dataset rows and submits aggregated results for evaluation.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-11-23T07:54:29Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a sample-app script to run GPT‑4 research tasks and extends the Experiment SDK with GitHub-backed bulk run flows, new Pydantic models for aggregated task results and GitHub context, and a concurrency helper to execute tasks locally before submitting results to the backend.

Changes

Cohort / File(s)	Summary
Sample app: experiment script `packages/sample-app/sample_app/experiment/run_research_experiment.py`	New async script: module-level `client` initialized from env, `generate_research_response(question: str)` creating an AsyncOpenAI client and calling GPT-4, `research_task(row)` that calls the generator and returns completion/result, and `main()` that runs an experiment via the SDK and prints run details; entrypoint runs `asyncio.run(main())`.
SDK: Experiment implementation `packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py`	`Experiment.run()` signature extended (adds `experiment_metadata`, `related_ref`, `aux`, `stop_on_error`, `wait_for_results`); task typing changed to accept/return awaitable task results; added GitHub execution path (`_run_in_github`, `run_in_github`) and local path (`_run_locally`); added concurrency helper `_execute_tasks` (uses semaphore, returns `TaskResult` objects); evaluator orchestration updated for sync/async modes and error propagation.
SDK: Experiment models `packages/traceloop-sdk/traceloop/sdk/experiment/model.py`	Added Pydantic models: `TaskResult`, `GithubContext`, `RunInGithubRequest`, and `RunInGithubResponse` for per-task results aggregation, GitHub execution context, and bulk run request/response payloads.

Sequence Diagram(s)

sequenceDiagram
    participant App as Sample App\n(run_research_experiment.py)
    participant SDK as Experiment SDK\n(experiment.py)
    participant Worker as Local Executor
    participant OpenAI as OpenAI API
    participant Backend as Traceloop Backend

    App->>SDK: init Traceloop client
    App->>App: main() coroutine
    App->>SDK: client.experiment.run(... or run_in_github ...)
    activate SDK

    rect rgb(248,250,255)
    note over SDK: choose flow based on env (GITHUB_ACTIONS / GITHUB_REF)
    end

    SDK->>SDK: parse dataset -> rows
    SDK->>Worker: _execute_tasks(rows, task)
    activate Worker

    loop per row (concurrent)
        Worker->>App: call research_task(row)
        App->>OpenAI: generate_research_response(question) (async)
        OpenAI-->>App: assistant response
        App-->>Worker: return TaskResult (input, output) or error
    end
    deactivate Worker

    rect rgb(235,255,235)
    note over SDK: build RunInGithubRequest with GithubContext (if GitHub) or prepare local results
    end

    SDK->>Backend: submit run (task_results + context/metadata)
    Backend-->>SDK: RunInGithubResponse (experiment_id, experiment_slug, run_id)
    deactivate SDK

    SDK-->>App: return run details
    App->>App: print experiment slug & run ID

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Review concurrency/semaphore logic and TaskResult aggregation in _execute_tasks.
Verify GitHub context extraction (PR URL, commit, actor) and validation for GitHub path.
Confirm Pydantic models and RunInGithubRequest/Response fields align with backend API contracts.
Check async typing changes for tasks and evaluator result handling (wait_for_results vs async triggers).

Possibly related PRs

feat(experiment): Add run experiment capabilities #3331 — Related changes touching experiment run flows and GitHub-run integration in the experiment subsystem.

Poem

🐰
I hopped through lines with lively cheer,
Sent questions out for answers clear,
Collected rows and bundled tight,
Pushed them up into the night,
A little hop — experiment near! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(exp): Add run in github experiment' accurately describes the main change: introducing GitHub-aware experiment execution. It directly aligns with the core objective of adding a run_in_github() method and GitHub Actions integration to the Experiment class.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch nk/exp_github

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

.github/workflows/run-experiments.yml

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (4)

packages/sample-app/sample_app/experiment/run_research_experiment.py (2)

55-78: Running this script locally will raise due to run_in_github guard

Because main() always calls client.experiment.run_in_github, running this module directly outside GitHub Actions will raise RuntimeError. That’s acceptable for a CI-only sample, but if you expect local runs, consider:

Catching the RuntimeError in main() and printing a friendly message, or

Adding a short note in the module docstring that it must be run in GitHub Actions.

23-40: Reuse module-level AsyncOpenAI client and add error handling for API calls

Creating a new AsyncOpenAI client on every call adds connection overhead. Create a single module-level client instance and reuse it across function calls. Additionally, wrap chat.completions.create calls in error handling to catch transient failures.

Recommended improvements:

Use async with AsyncOpenAI() as client: to ensure automatic client cleanup, or create a module-level instance with explicit shutdown.

Configure timeout (default 5sec is too short for LLM calls; use timeout=60 for typical requests).

Catch specific exceptions: APIConnectionError, APITimeoutError, RateLimitError, APIStatusError to handle network issues, rate limits, and API errors appropriately.

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)

37-94: experiment_metadata plumbing in run() and _init_experiment() looks correct

The new experiment_metadata parameter is threaded from run() into _init_experiment() and then into InitExperimentRequest, matching the existing pattern used for experiment_run_metadata. Only minor nits:

Docstring line “(an experiment holds all the experinent runs)” has a small typo (“experinent”).

Otherwise this change is consistent and safe.

277-299: HTTP error handling is reasonable; consider a library-specific exception type

Posting to "/experiments/run-in-github" and raising a generic Exception when response is None is functionally correct given the current HTTPClient.post contract. For clearer error handling and easier catching by SDK consumers, you might eventually:

Introduce an SDK-specific exception type (e.g., ExperimentRequestError) and raise that instead of bare Exception.

Optionally include the HTTP status/error details in the exception where available.

This is a polish item and not blocking.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 589e104 and c0138f5.

📒 Files selected for processing (3)

packages/sample-app/sample_app/experiment/run_research_experiment.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (7 hunks)
packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/experiment/model.py
packages/sample-app/sample_app/experiment/run_research_experiment.py
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

🧬 Code graph analysis (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (3)

packages/traceloop-sdk/traceloop/sdk/experiment/model.py (4)

RunInGithubRequest (88-98)

RunInGithubResponse (101-106)

TaskResult (71-76)

GithubContext (79-85)

packages/traceloop-sdk/traceloop/sdk/datasets/datasets.py (1)

get_version_jsonl (165-170)

packages/traceloop-sdk/traceloop/sdk/client/http.py (1)

post (23-37)

🪛 Ruff (0.14.5)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

223-226: Avoid specifying long messages outside the exception class

(TRY003)

247-250: Avoid specifying long messages outside the exception class

(TRY003)

294-296: Create your own exception

(TRY002)

294-296: Avoid specifying long messages outside the exception class

(TRY003)

387-387: Do not catch blind exception: Exception

(BLE001)

406-406: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.11)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint
GitHub Check: Test Packages (3.10)

🔇 Additional comments (1)

packages/sample-app/sample_app/experiment/run_research_experiment.py (1)

15-20: Good use of env vars for Traceloop client configuration

TRACELOOP_API_KEY and TRACELOOP_BASE_URL are correctly sourced from environment variables, so no secrets are hardcoded and this aligns with the Python guidelines.

packages/sample-app/sample_app/experiment/run_research_experiment.py

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

packages/traceloop-sdk/traceloop/sdk/experiment/model.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 4048c6c in 1 minute and 59 seconds. Click for details.

Reviewed 36 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 2 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:241

Draft comment:
The function now checks that GITHUB_EVENT_NAME equals 'pull_request'. Ensure this restriction is intentional; consider documenting or supporting additional event types (e.g., 'pull_request_target') if needed.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% This comment violates multiple rules: 1) It asks the author to "ensure" something is intentional, which is explicitly mentioned as not useful in the rules ("Do NOT ask the PR author to confirm their intention"). 2) It's speculative - it suggests "consider...supporting additional event types" without clear evidence that this is needed. 3) The comment doesn't point to a definite bug or issue - it's more of a "have you thought about X?" type of comment. The function name is run_in_github and the error message clearly states it's for pull_request workflows, so the restriction appears deliberate. The comment doesn't provide strong evidence of an actual problem. Perhaps the author intentionally restricted this to pull_request events for a specific reason, and suggesting support for pull_request_target could be valid if there's a clear use case. However, without evidence that this is actually needed or that the current restriction causes problems, this is just speculation. The critique doesn't change my assessment. The comment is still asking the author to confirm their intention ("ensure this restriction is intentional") and making speculative suggestions ("consider...supporting additional event types") without demonstrating a clear problem. These are exactly the types of comments the rules say to remove. This comment should be deleted because it asks the author to confirm their intention and makes speculative suggestions without clear evidence of a problem. It violates the rule against asking authors to "ensure" things are intentional.

2. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:400

Draft comment:
The try/except block for individual task execution was removed. Without this, any unexpected exception from a task may abort the entire _execute_tasks execution. Confirm if this change is intentional.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_uJCerDA2pP8b4vVP

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (4)

packages/sample-app/sample_app/experiment/run_research_experiment.py (1)

43-52: Previous typo has been corrected.

The misspelled key "sentenece" reported in earlier reviews has been fixed to "sentence" on line 51.

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (3)

244-250: Previous validation concerns have been addressed.

The explicit check for github_event_name == "pull_request" (line 245) ensures this method only runs in the correct GitHub workflow context, addressing the previous review's concern about insufficient event verification.

252-264: Previous type safety concerns have been addressed.

The validation sequence now ensures pr_number is extracted and verified (lines 255-258) before constructing pr_url (line 264), eliminating the type mismatch risk flagged in earlier reviews.

411-413: Previous bug in exception handling has been fixed.

The broken completed_task.task_input access and redundant outer try/except block reported in earlier reviews have been removed. The code now correctly collects TaskResult objects returned by run_single_row, which already handles exceptions internally.

🧹 Nitpick comments (2)

packages/sample-app/sample_app/experiment/run_research_experiment.py (1)
23-40: Consider reusing the AsyncOpenAI client instance.

Creating a new AsyncOpenAI client on every call to generate_research_response adds unnecessary overhead. The OpenAI client is designed to be reused and maintains its own connection pool.

Refactor to create the client once at module level:
 from traceloop.sdk import Traceloop
+from openai import AsyncOpenAI
 
 # Initialize Traceloop client
 client = Traceloop.init(
     app_name="research-experiment-ci-cd",
     api_key=os.getenv("TRACELOOP_API_KEY"),
     api_endpoint=os.getenv("TRACELOOP_BASE_URL"),
 )
 
+# Initialize OpenAI client once
+openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+
 
 async def generate_research_response(question: str) -> str:
     """Generate a research response using OpenAI"""
-    openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
-
     response = await openai_client.chat.completions.create(
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
302-306: Consider adding more specific error details.

When the API request fails, self._http_client.post returns None and prints the actual error to stderr. The exception message here doesn't preserve the underlying cause.

Consider capturing and including the underlying error details if the HTTP client can be enhanced to return/raise them, or at minimum advise users to check stderr for the actual failure reason:
         if response is None:
             raise Exception(
-                f"Failed to submit experiment '{experiment_slug}' for GitHub execution. "
+                f"Failed to submit experiment '{experiment_slug}' for GitHub execution. "
+                "Check stderr output above for the detailed error from the API request."
             )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c0138f5 and 4048c6c.

📒 Files selected for processing (2)

packages/sample-app/sample_app/experiment/run_research_experiment.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (7 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/experiment/run_research_experiment.py
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

🧬 Code graph analysis (2)

packages/sample-app/sample_app/experiment/run_research_experiment.py (2)

packages/traceloop-sdk/traceloop/sdk/__init__.py (3)

Traceloop (37-275)

init (49-206)

get (259-275)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)

run_in_github (188-307)

run (37-186)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)

packages/traceloop-sdk/traceloop/sdk/datasets/datasets.py (1)

get_version_jsonl (165-170)

packages/traceloop-sdk/traceloop/sdk/client/http.py (1)

post (23-37)

🪛 Ruff (0.14.5)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

223-226: Avoid specifying long messages outside the exception class

(TRY003)

246-250: Avoid specifying long messages outside the exception class

(TRY003)

259-262: Avoid specifying long messages outside the exception class

(TRY003)

303-305: Create your own exception

(TRY002)

303-305: Avoid specifying long messages outside the exception class

(TRY003)

396-396: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

🔇 Additional comments (4)

packages/sample-app/sample_app/experiment/run_research_experiment.py (2)

16-20: LGTM: Proper secret handling.

The client initialization correctly retrieves API credentials from environment variables, adhering to the security guideline against hardcoding secrets.

55-74: LGTM: Clean CI/CD orchestration.

The main function correctly orchestrates the GitHub experiment flow. Letting exceptions propagate is appropriate for CI/CD scripts, as it will fail the workflow with a clear error message.

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)

44-44: LGTM: Clean API extension for experiment metadata.

The experiment_metadata parameter is properly added to the run() signature, documented, and correctly passed through to _init_experiment.

Also applies to: 57-57, 92-92

389-400: LGTM: Appropriate exception handling pattern.

The broad Exception catch on line 396 is the correct pattern here. It ensures that any failure in user-provided task functions is captured and converted to a TaskResult with the error field set, allowing the experiment to continue processing remaining rows.

nirga · 2025-11-24T12:59:37Z

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py


        return results, errors

+    async def run_in_github(


this should be a private method and then in the general run you decide whether to call run_in_github or run_local based on the env vars

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 6764781 in 1 minute and 9 seconds. Click for details.

Reviewed 85 lines of code in 2 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/sample-app/sample_app/experiment/run_research_experiment.py:1

Draft comment:
The file docstring still mentions 'run_in_github' even though the code now calls 'run()'. Please update the documentation to reflect the new usage.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

2. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:37

Draft comment:
The 'run' method’s return type depends on the environment (tuple for local runs vs. RunInGithubResponse in GitHub Actions). Update the type annotation and docstring to clarify this behavior.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

3. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:37

Draft comment:
Document the additional parameters ('aux', 'stop_on_error', 'wait_for_results') in the 'run' method's docstring. Also, note that these parameters are only applicable for local execution.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

4. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:65

Draft comment:
Clarify that when running in GitHub Actions (i.e. when GITHUB_ACTIONS is true), parameters such as 'experiment_metadata', 'aux', 'stop_on_error', and 'wait_for_results' are not forwarded and are thus unused.
Reason this comment was not posted:
Comment did not seem useful. Confidence is useful = 0% <= threshold 50% This comment is purely informative, explaining the behavior of the code when running in GitHub Actions. It does not suggest any changes, ask for confirmation, or point out potential issues. It violates the rule against making purely informative comments.

Workflow ID: wflow_vHaVuqKywUYrVmUm

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
37-86: Return type annotation is incorrect - method returns different types.

The run() method is annotated to return Tuple[List[TaskResponse], List[str]] (line 49), but:

_run_in_github() returns RunInGithubResponse (line 248)

_run_locally() returns Tuple[List[TaskResponse], List[str]] (line 101)

This breaks type safety and will cause issues for callers expecting a consistent return type.

Consider using a Union type or introducing a wrapper response type:
+from typing import Union
+
 async def run(
     self,
     task: Callable[[Optional[Dict[str, Any]]], Dict[str, Any]],
     dataset_slug: Optional[str] = None,
     dataset_version: Optional[str] = None,
     evaluators: Optional[List[EvaluatorDetails]] = None,
     experiment_slug: Optional[str] = None,
     experiment_metadata: Optional[Dict[str, Any]] = None,
     related_ref: Optional[Dict[str, str]] = None,
     aux: Optional[Dict[str, str]] = None,
     stop_on_error: bool = False,
     wait_for_results: bool = True,
-) -> Tuple[List[TaskResponse], List[str]]:
+) -> Union[Tuple[List[TaskResponse], List[str]], RunInGithubResponse]:
Alternatively, consider a unified response model that both paths return.

🧹 Nitpick comments (2)

packages/sample-app/sample_app/experiment/run_research_experiment.py (1)

23-40: Consider handling potential None content from OpenAI response.

response.choices[0].message.content can be None if the model returns no content (e.g., when using function calls or tool use). Since this is a sample app demonstrating the workflow, this may be acceptable, but be aware it could cause downstream issues if None propagates to evaluators.
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
424-438: Docstring mentions non-existent parameter stop_on_error.

The docstring at line 434 documents a stop_on_error parameter that doesn't exist in the method signature.
     async def _execute_tasks(
         self,
         rows: List[Dict[str, Any]],
         task: Callable[[Optional[Dict[str, Any]]], Dict[str, Any]],
     ) -> List[TaskResult]:
         """Execute tasks locally with concurrency control
 
         Args:
             rows: List of dataset rows to process
             task: Function to run on each row
-            stop_on_error: Whether to stop on first error
 
         Returns:
             List of TaskResult objects with inputs, outputs, and errors
         """

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4048c6c and 6764781.

📒 Files selected for processing (2)

packages/sample-app/sample_app/experiment/run_research_experiment.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (7 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py
packages/sample-app/sample_app/experiment/run_research_experiment.py

🧬 Code graph analysis (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (3)

packages/traceloop-sdk/traceloop/sdk/experiment/model.py (5)

RunInGithubRequest (88-98)

RunInGithubResponse (101-106)

TaskResult (71-76)

GithubContext (79-85)

TaskResponse (10-15)

packages/traceloop-sdk/traceloop/sdk/datasets/datasets.py (1)

get_version_jsonl (165-170)

packages/traceloop-sdk/traceloop/sdk/client/http.py (1)

post (23-37)

🪛 Ruff (0.14.5)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

275-278: Avoid specifying long messages outside the exception class

(TRY003)

298-302: Avoid specifying long messages outside the exception class

(TRY003)

311-314: Avoid specifying long messages outside the exception class

(TRY003)

355-357: Create your own exception

(TRY002)

355-357: Avoid specifying long messages outside the exception class

(TRY003)

448-448: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.11)
GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.12)
GitHub Check: Lint

🔇 Additional comments (6)

packages/sample-app/sample_app/experiment/run_research_experiment.py (2)

16-20: LGTM - API keys properly sourced from environment variables.

The client initialization correctly retrieves sensitive values from environment variables, following security best practices. Based on coding guidelines, this is the correct approach for handling API keys.

43-52: LGTM - Typo has been fixed.

The previously flagged typo "sentenece" has been corrected to "sentence".

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (4)

4-20: LGTM - Imports are appropriate for the new GitHub integration.

The added imports for os and the new model types (RunInGithubRequest, RunInGithubResponse, TaskResult, GithubContext) align with the new functionality.

89-238: LGTM - Local execution logic is well-structured.

The _run_locally() method properly handles:

Experiment initialization with metadata

Concurrent task execution with semaphore (50 concurrent tasks)

Error aggregation with optional stop_on_error behavior

Both synchronous and async evaluator execution modes

291-323: GitHub context handling is now robust.

The implementation properly:

Validates GITHUB_EVENT_NAME == "pull_request" before proceeding (line 297)

Extracts PR number from GITHUB_REF only after validation

Ensures all required values exist before constructing GithubContext

This addresses the previous review concerns about pr_url being None and insufficient event verification.

461-467: Previous task_input bug has been fixed.

The exception handling has been simplified - errors are now captured within run_single_row() and returned as TaskResult objects with the error field set. This removes the need for the outer try/except that previously referenced the non-existent completed_task.task_input attribute.

packages/sample-app/sample_app/experiment/run_research_experiment.py

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

…nk/exp_github

ellipsis-dev

Important

Looks good to me! 👍

Reviewed d36d712 in 1 minute and 15 seconds. Click for details.

Reviewed 136 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 6 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:39

Draft comment:
Task parameter now expects an async function (Awaitable). Ensure documentation clearly states that 'task' must be async.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:170

Draft comment:
Good explicit type annotation for 'eval_results' (Dict[str, Union[Dict[str, Any], str]]); this removes the need for previous type ignores.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

3. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:396

Draft comment:
Allowing 'task_input' to be Optional is beneficial. Ensure that all callers handle None values appropriately.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

4. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:313

Draft comment:
Robust GitHub environment validation is implemented; consider logging details when required env vars (e.g. GITHUB_REF, GITHUB_REPOSITORY) are missing to aid debugging.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

5. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:461

Draft comment:
There's an inconsistency: 'run_single_row' accepts an Optional row while 'run_with_semaphore' types row as Dict. Consider aligning these for type consistency.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

6. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:350

Draft comment:
The use of 'related_ref' passed as 'experiment_run_metadata' may be confusing. Consider aligning the naming for clarity.
Reason this comment was not posted:
Confidence changes required: 33% <= threshold 50% None

Workflow ID: wflow_ALse0DALmd5g2igu

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (2)

packages/sample-app/sample_app/experiment/run_research_experiment.py (1)
24-53: Consider reusing the AsyncOpenAI client instead of creating it per call

generate_research_response instantiates AsyncOpenAI on every invocation, which is fine for a sample but adds overhead if the dataset is large. You could create a single module-level async client (similarly to the Traceloop client) and reuse it inside generate_research_response for better efficiency.

Example shape:
-from openai import AsyncOpenAI
+from openai import AsyncOpenAI
@@
-async def generate_research_response(question: str) -> str:
-    """Generate a research response using OpenAI"""
-    openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+openai_client = AsyncOpenAI(api_key=os.getenv("OPENAI_API_KEY"))
+
+
+async def generate_research_response(question: str) -> str:
+    """Generate a research response using OpenAI"""
@@
-    response = await openai_client.chat.completions.create(
+    response = await openai_client.chat.completions.create(
Please confirm against the current AsyncOpenAI client docs that reusing a single client instance matches recommended usage for your version.
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
428-471: Fix _execute_tasks docstring and consider tightening exception handling

Two small issues here:

The docstring mentions a stop_on_error argument, but _execute_tasks only takes rows and task. That’s misleading and looks like a copy/paste from _run_locally. Please update or remove that line so the docstring matches the actual signature.

The inner run_single_row currently uses a bare except Exception and wraps the error into TaskResult.error. That’s acceptable for a defensive boundary, but if you know which exceptions are expected from task, narrowing the exception type (or at least logging unexpected ones separately) would improve debuggability and align better with linting guidance.
-        Args:
-            rows: List of dataset rows to process
-            task: Function to run on each row
-            stop_on_error: Whether to stop on first error
+        Args:
+            rows: List of dataset rows to process
+            task: Function to run on each row
Please confirm against your linting configuration (e.g., Ruff/Flake8 rules like BLE001) how strict you want to be about broad Exception catches in this module.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6764781 and d36d712.

📒 Files selected for processing (2)

packages/sample-app/sample_app/experiment/run_research_experiment.py (1 hunks)
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (10 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/sample-app/sample_app/experiment/run_research_experiment.py
packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

🧬 Code graph analysis (1)

packages/sample-app/sample_app/experiment/run_research_experiment.py (2)

packages/traceloop-sdk/traceloop/sdk/experiment/model.py (1)

RunInGithubResponse (101-106)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)

run (37-89)

🪛 Ruff (0.14.5)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

200-200: Do not catch blind exception: Exception

(BLE001)

201-201: Use explicit conversion flag

Replace with conversion flag

(RUF010)

278-281: Avoid specifying long messages outside the exception class

(TRY003)

301-305: Avoid specifying long messages outside the exception class

(TRY003)

314-317: Avoid specifying long messages outside the exception class

(TRY003)

359-361: Create your own exception

(TRY002)

359-361: Avoid specifying long messages outside the exception class

(TRY003)

452-452: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.11)
GitHub Check: Lint
GitHub Check: Test Packages (3.10)
GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.12)

🔇 Additional comments (1)

packages/sample-app/sample_app/experiment/run_research_experiment.py (1)

56-77: Main experiment flow and GitHub/local handling look solid

The main() coroutine wires the sample together cleanly, and the isinstance(response, RunInGithubResponse) guard correctly handles the union return type from client.experiment.run in GitHub vs local environments.

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 49e92f2 in 1 minute and 4 seconds. Click for details.

Reviewed 50 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 4 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:73

Draft comment:
Good: The run() method now correctly passes the 'aux' parameter to _run_in_github. Ensure tests validate that auxiliary metadata is handled as expected.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

2. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:247

Draft comment:
The signature of _run_in_github now includes the 'aux' parameter with the proper type. This maintains consistency across calls.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

3. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:330

Draft comment:
The new experiment_run_metadata dictionary correctly aggregates both 'related_ref' and 'aux'. Consider including documentation on the expected structure for 'aux' if not already documented elsewhere.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

4. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:347

Draft comment:
The RunInGithubRequest payload now correctly passes experiment_run_metadata (which includes both 'related_ref' and 'aux'). This fixes the prior issue of only passing related_ref.
Reason this comment was not posted:
Confidence changes required: 0% <= threshold 50% None

Workflow ID: wflow_xH5s1SiOFnsvFUFa

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
68-77: experiment_metadata is still not passed to _run_in_github.

The run() method accepts experiment_metadata (line 44) and correctly passes it to _run_locally (line 85), but it's omitted when calling _run_in_github. Callers who supply experiment_metadata will have it silently ignored in GitHub runs.
         if os.getenv("GITHUB_ACTIONS"):
             return await self._run_in_github(
                 task=task,
                 dataset_slug=dataset_slug,
                 dataset_version=dataset_version,
                 evaluators=evaluators,
                 experiment_slug=experiment_slug,
+                experiment_metadata=experiment_metadata,
                 related_ref=related_ref,
                 aux=aux,
             )

🧹 Nitpick comments (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)
454-465: Consider aligning type hints for consistency.

run_single_row accepts Optional[Dict[str, Any]] (line 454), but TaskResult.input is typed as Dict[str, Any] (non-optional) per the model. If row were None, Pydantic validation would fail at line 458/463.

Since rows is List[Dict[str, Any]] and never contains None, this works in practice, but the type hints are inconsistent.
-        async def run_single_row(row: Optional[Dict[str, Any]]) -> TaskResult:
+        async def run_single_row(row: Dict[str, Any]) -> TaskResult:

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between d36d712 and 49e92f2.

📒 Files selected for processing (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (10 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

🧬 Code graph analysis (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (4)

packages/traceloop-sdk/traceloop/sdk/experiment/model.py (5)

RunInGithubRequest (88-98)

RunInGithubResponse (101-106)

TaskResult (71-76)

GithubContext (79-85)

TaskResponse (10-15)

packages/traceloop-sdk/traceloop/sdk/datasets/datasets.py (1)

get_version_jsonl (170-175)

packages/traceloop-sdk/traceloop/sdk/client/http.py (1)

post (23-37)

packages/traceloop-sdk/traceloop/sdk/fetcher.py (1)

post (64-65)

🪛 Ruff (0.14.5)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

201-201: Do not catch blind exception: Exception

(BLE001)

202-202: Use explicit conversion flag

Replace with conversion flag

(RUF010)

281-284: Avoid specifying long messages outside the exception class

(TRY003)

304-308: Avoid specifying long messages outside the exception class

(TRY003)

317-320: Avoid specifying long messages outside the exception class

(TRY003)

368-370: Create your own exception

(TRY002)

368-370: Avoid specifying long messages outside the exception class

(TRY003)

461-461: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.10)
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.12)
GitHub Check: Build Packages (3.11)
GitHub Check: Lint

🔇 Additional comments (4)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (4)

1-21: LGTM!

Imports are appropriate for the new GitHub execution feature. The new model imports and type additions align with the async patterns and new API contracts.

92-240: LGTM!

The local execution path is well-structured with proper concurrency control via semaphore and comprehensive error handling. The evaluation flow correctly handles both sync and async modes.

242-372: LGTM! Past review concerns have been addressed.

The implementation now correctly:

Validates GITHUB_EVENT_NAME == "pull_request" before proceeding (line 303)

Ensures pr_url is always set by requiring pr_number extraction success (line 316)

Merges user-provided experiment_metadata with created_from: "github" (lines 331-334)

Uses the same experiment_run_metadata construction pattern as _run_locally (lines 336-340)

374-418: LGTM!

The _init_experiment and _create_task updates correctly support the new metadata propagation. The Optional[Dict[str, Any]] type for task_input appropriately handles cases where experiments may run without input data.

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 724f0d5 in 1 minute and 18 seconds. Click for details.

Reviewed 12 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:444

Draft comment:
Removed 'stop_on_error' from the docstring since it is not an accepted parameter in _execute_tasks. Ensure that error handling is managed in the caller if needed.
Reason this comment was not posted:
Comment looked like it was already resolved.

Workflow ID: wflow_s5phOBDydc0SjYXR

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)
65-67: Docstring Returns section is incomplete.

The return type is Tuple[List[TaskResponse], List[str]] | RunInGithubResponse, but the docstring only documents the tuple format. Consider documenting both return shapes so SDK users know what to expect in CI vs local execution.
         Returns:
-            Tuple of (results, errors). Returns ([], []) if wait_for_results is False
+            - Local execution: Tuple of (results, errors). Returns ([], []) if wait_for_results is False
+            - GitHub execution: RunInGithubResponse with experiment_id, run_id, and status
469-471: Minor type hint inconsistency.

run_with_semaphore parameter is typed as Dict[str, Any] but run_single_row accepts Optional[Dict[str, Any]]. For consistency:
-        async def run_with_semaphore(row: Dict[str, Any]) -> TaskResult:
+        async def run_with_semaphore(row: Optional[Dict[str, Any]]) -> TaskResult:

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 49e92f2 and 724f0d5.

📒 Files selected for processing (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (10 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

🧠 Learnings (1)

📚 Learning: 2025-08-04T15:35:30.188Z

Learnt from: nina-kollman
Repo: traceloop/openllmetry PR: 3219
File: packages/traceloop-sdk/traceloop/sdk/datasets/dataset.py:357-360
Timestamp: 2025-08-04T15:35:30.188Z
Learning: In the traceloop SDK Dataset class, the established error handling pattern is that HTTP client methods return None on failure (after catching and logging RequestException), and the Dataset API methods check for None return values and raise Exception with descriptive messages. The update_row_api method is inconsistent with this pattern.

Applied to files:

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

🪛 Ruff (0.14.5)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

201-201: Do not catch blind exception: Exception

(BLE001)

202-202: Use explicit conversion flag

Replace with conversion flag

(RUF010)

281-284: Avoid specifying long messages outside the exception class

(TRY003)

304-308: Avoid specifying long messages outside the exception class

(TRY003)

317-320: Avoid specifying long messages outside the exception class

(TRY003)

368-370: Create your own exception

(TRY002)

368-370: Avoid specifying long messages outside the exception class

(TRY003)

460-460: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Test Packages (3.11)
GitHub Check: Lint
GitHub Check: Test Packages (3.10)
GitHub Check: Build Packages (3.11)

🔇 Additional comments (7)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (7)

1-21: LGTM!

Imports are well-organized and appropriate for the new GitHub execution functionality.

92-240: LGTM!

The local execution path properly propagates all parameters including experiment_metadata, handles concurrency with semaphore, and correctly implements stop_on_error and wait_for_results behavior.

242-372: LGTM! Previous review feedback addressed.

The method now:

Validates pull_request event explicitly (line 303)

Propagates experiment_metadata and aux parameters

Builds experiment_run_metadata consistently with _run_locally

Merges user-provided experiment_metadata with created_from: "github"

Based on learnings, the Exception raising pattern at line 368 is consistent with the established SDK error handling conventions.

374-399: LGTM!

The experiment_metadata parameter is correctly added and propagated to the request body.

401-418: LGTM!

The task_input parameter type correctly updated to Optional[Dict[str, Any]].

420-435: No concerns.

Existing JSONL parsing logic is unchanged.

437-479: Previous critical bug fixed.

The _execute_tasks method now correctly awaits and appends results without the broken completed_task.task_input access. Error handling is properly encapsulated in run_single_row.

…nk/exp_github

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

…nk/exp_github

ellipsis-dev

Important

Looks good to me! 👍

Reviewed 7ad43b6 in 1 minute and 35 seconds. Click for details.

Reviewed 13 lines of code in 1 files
Skipped 0 files when reviewing.
Skipped posting 1 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py:50

Draft comment:
The removed docstring lines provided clarity on GitHub vs. local execution. Consider retaining a brief note that the method routes to a GitHub-specific flow when GITHUB_ACTIONS is set, to aid users in understanding the behavior.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 15% vs. threshold = 50% The comment is suggesting that documentation be added back or retained. This is a documentation suggestion, not a code change. The rules state "Do NOT comment unless there is clearly a code change required" and "Do NOT make comments that are obvious or unimportant." The behavior is already evident from the return type annotation and the method implementation itself. The comment uses "Consider retaining" which is a soft suggestion rather than identifying a clear issue. Additionally, the removed documentation was somewhat redundant given that the code itself is self-documenting with the environment variable check and the two separate method calls. The comment is about documentation style/preference rather than a functional issue. The removed documentation might actually be helpful for users who are reading the docstring to understand the method's behavior without diving into the implementation. The dual behavior based on environment variables might not be immediately obvious from just the return type, and explicit documentation could improve the developer experience. While documentation can be helpful, the rules explicitly state not to comment unless there's a clear code change required. This is a documentation preference, not a functional issue. The behavior is already clear from the code structure and return types. If the author removed this documentation, they likely had a reason (perhaps to keep the docstring more concise or because the behavior is self-evident from the implementation). This comment should be deleted. It's suggesting documentation changes rather than identifying a code issue. The comment doesn't point to a bug or required code change, just a preference for more detailed documentation. This violates the rule that comments should only be made when there's clearly a code change required.

Workflow ID: wflow_ff9K6bZnVxzSqsEW

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (2)

63-64: Update docstring to document both return types.

The Returns section only describes the tuple shape (results, errors), but when running in GitHub Actions, the method returns RunInGithubResponse with different fields (experiment_id, experiment_slug, run_id). Document both shapes so callers know what to expect in each context.

442-444: Remove stale docstring parameter.

The docstring's Args section does not reference a stop_on_error parameter, but based on past review context, there was a stale reference. Ensure the docstring accurately reflects the current method signature.

🧹 Nitpick comments (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (1)

66-88: Verify environment variable detection for consistency.

The branch condition uses os.getenv("GITHUB_ACTIONS") to detect GitHub CI, which returns None (falsy) when unset. Inside _run_in_github, however, lines 278-282 verify the same variable but could theoretically be skipped if called directly. While the current code paths ensure _run_in_github is only reached when GITHUB_ACTIONS is set, consider whether the redundant validation in _run_in_github (lines 278-282) is necessary or can be removed for clarity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 724f0d5 and 7ad43b6.

📒 Files selected for processing (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (10 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Store API keys only in environment variables/secure vaults; never hardcode secrets in code
Use Flake8 for code linting and adhere to its rules

Files:

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

🧠 Learnings (1)

📚 Learning: 2025-08-04T15:35:30.188Z

Learnt from: nina-kollman
Repo: traceloop/openllmetry PR: 3219
File: packages/traceloop-sdk/traceloop/sdk/datasets/dataset.py:357-360
Timestamp: 2025-08-04T15:35:30.188Z
Learning: In the traceloop SDK Dataset class, the established error handling pattern is that HTTP client methods return None on failure (after catching and logging RequestException), and the Dataset API methods check for None return values and raise Exception with descriptive messages. The update_row_api method is inconsistent with this pattern.

Applied to files:

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

🧬 Code graph analysis (1)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (3)

packages/traceloop-sdk/traceloop/sdk/experiment/model.py (5)

RunInGithubRequest (88-98)

RunInGithubResponse (101-106)

TaskResult (71-76)

GithubContext (79-85)

TaskResponse (10-15)

packages/traceloop-sdk/traceloop/sdk/datasets/datasets.py (1)

get_version_jsonl (218-223)

packages/traceloop-sdk/traceloop/sdk/client/http.py (1)

post (23-37)

🪛 Ruff (0.14.6)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py

199-199: Do not catch blind exception: Exception

(BLE001)

200-200: Use explicit conversion flag

Replace with conversion flag

(RUF010)

279-282: Avoid specifying long messages outside the exception class

(TRY003)

302-306: Avoid specifying long messages outside the exception class

(TRY003)

315-318: Avoid specifying long messages outside the exception class

(TRY003)

366-368: Create your own exception

(TRY002)

366-368: Avoid specifying long messages outside the exception class

(TRY003)

458-458: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: Test Packages (3.12)
GitHub Check: Lint
GitHub Check: Build Packages (3.11)
GitHub Check: Test Packages (3.11)
GitHub Check: Test Packages (3.10)

🔇 Additional comments (4)

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py (4)

329-332: LGTM: Experiment metadata merge pattern.

The merge pattern correctly combines user-provided metadata with the GitHub marker, addressing the past review concern about hardcoded metadata.

334-338: LGTM: Metadata structure normalized.

The experiment_run_metadata construction now matches the pattern used in _run_locally (lines 123-127), ensuring consistent metadata shape across both execution paths and addressing the past concern about asymmetry.

301-318: LGTM: GitHub context validation is robust.

The addition of explicit GITHUB_EVENT_NAME validation (line 301) and the RuntimeError when pr_number cannot be extracted (lines 314-318) address the past review concerns about type mismatches and insufficient event verification. The code now ensures that pr_url is always non-null when constructing GithubContext, making it compatible with the model's str type annotation.

473-475: LGTM: asyncio.Task bug fixed.

The critical bug flagged in past reviews—accessing completed_task.task_input—has been resolved. The current implementation correctly awaits each task and appends the returned TaskResult without attempting to access non-existent attributes.

traceloop · 2025-11-27T10:56:22Z

🧪 Traceloop Experiment Results

Experiment: research-exp
Tasks Evaluated: 20
Related Commit: 1234567
Last Updated Comment: 2025-11-27 10:56:21 UTC

Evaluators Results

Evaluator	Success Rate	Avg Score
Research Relevancy	20/20 (100.0%)	-
Research Word Counter	-	239.45

View evaluation results →
View dataset →

nina-kollman added 7 commits November 20, 2025 09:50

in github

f8d03db

added yml

28212d2

added raise

6558d72

no semaphore

bcd66b3

add

a6ab630

wip

a0b2156

move exp

0092851

github-advanced-security bot found potential problems Nov 23, 2025

View reviewed changes

.github/workflows/run-experiments.yml Fixed Show fixed Hide fixed

traceloop deleted a comment from traceloop bot Nov 23, 2025

traceloop deleted a comment from traceloop bot Nov 24, 2025

coderabbitai bot reviewed Nov 24, 2025

View reviewed changes

nina-kollman added 2 commits November 24, 2025 13:05

fixes

05d6b44

fix comments

4048c6c

ellipsis-dev bot reviewed Nov 24, 2025

View reviewed changes

coderabbitai bot reviewed Nov 24, 2025

View reviewed changes

nirga changed the title ~~feat(exp): Add run in github experiment~~ fix(exp): Add run in github experiment Nov 24, 2025

nirga reviewed Nov 24, 2025

View reviewed changes

to private methods

6764781

ellipsis-dev bot reviewed Nov 25, 2025

View reviewed changes

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

packages/sample-app/sample_app/experiment/run_research_experiment.py Outdated Show resolved Hide resolved

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py Outdated Show resolved Hide resolved

nina-kollman added 3 commits November 25, 2025 15:18

Merge branch 'main' of https://github.com/traceloop/openllmetry into …

265b089

…nk/exp_github

merge

5b7817e

my py

d36d712

ellipsis-dev bot reviewed Nov 25, 2025

View reviewed changes

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py Show resolved Hide resolved

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py Show resolved Hide resolved

comments

49e92f2

ellipsis-dev bot reviewed Nov 25, 2025

View reviewed changes

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py Outdated Show resolved Hide resolved

doc change

724f0d5

ellipsis-dev bot reviewed Nov 25, 2025

View reviewed changes

coderabbitai bot reviewed Nov 25, 2025

View reviewed changes

Merge branch 'main' of https://github.com/traceloop/openllmetry into …

4531017

…nk/exp_github

nirga approved these changes Nov 26, 2025

View reviewed changes

packages/traceloop-sdk/traceloop/sdk/experiment/experiment.py Outdated Show resolved Hide resolved

nina-kollman added 2 commits November 27, 2025 10:57

Merge branch 'main' of https://github.com/traceloop/openllmetry into …

61d919f

…nk/exp_github

cooment

7ad43b6

ellipsis-dev bot reviewed Nov 27, 2025

View reviewed changes

coderabbitai bot reviewed Nov 27, 2025

View reviewed changes

nina-kollman merged commit 1297e6e into main Nov 27, 2025
12 checks passed

ellipsis-dev bot mentioned this pull request Nov 27, 2025

fix(openai): responses instrumentation broken traces for async streaming #3475

Merged

fix(exp): Add run in github experiment #3459

fix(exp): Add run in github experiment #3459

Uh oh!

Conversation

nina-kollman commented Nov 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

nirga Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

traceloop bot commented Nov 27, 2025

🧪 Traceloop Experiment Results

Evaluators Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

nina-kollman commented Nov 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 23, 2025 •

edited

Loading