Skip to content

Commit b6115d4

Browse files
feat(gepa): add tool description optimization for multi-agent systems (#8928)
* feat(gepa): add tool description optimization for multi-agent systems - Add optimize_tool_descriptions parameter (default False) to GEPA - Extract tool descriptions from all nested modules via named_sub_modules() - Apply optimized descriptions in DspyAdapter.build_program() - Enables holistic optimization of tools across main and subagent modules - Tests: 4 new tests, all 16 pass (4 new + 12 existing) * style: fix ruff formatting (trailing whitespace) * style: apply ruff formatting fixes * feat(gepa): implement tool-specific proposer for tool descriptions - Add ToolProposer with GenerateImprovedToolDescription signature - Implement routing logic to separate tools from signatures - Tools use ToolProposer, signatures use custom or parent default - Backward compatible: preserves existing custom_instruction_proposer behavior - Add test verifying routing splits components correctly * docs(gepa): clean up multi-agent example code - Define tool functions outside class for clarity - Match structure of simple ReAct example - Add clear comments explaining architecture - Make code more readable and maintainable * refactor(gepa): simplify tool reflective dataset with ReAct context reuse Tools now copy ReAct's reflective data with tool-specific annotation instead of complex trajectory extraction. This 15-line approach reuses ReAct's existing context (thoughts, tool calls, observations) and adds focused annotation for each tool. Implementation: - Tools receive full ReAct reflective examples (same trajectory context) - Feedback prefixed: [Optimizing tool: 'X'] for focused optimization - Reflection LM sees complete multi-step execution traces per tool Benefits: - Simpler: 15 lines vs 70+ line extraction approach - Reuses code: No duplicate trajectory formatting logic - Same context: Tools see full ReAct execution traces - Clean: Removed all debug output Tests: - 4 focused tests following GEPA patterns (removed 1 redundant) - 226KB fixture with 34 LM + 6 reflection calls - All tests passing with gpt-5-nano traces Documentation: - Updated GEPA_Advanced.md with implementation details - Explains reflective dataset construction approach * fix(gepa): unify custom proposer routing for tools * docs(gepa): clarify tool reflection prompt * test: streamline GEPA tool optimization tests * fix(gepa): streamline tool proposer formatting * test(gepa): drop legacy dummy tool fixture * docs(gepa): add tool-specific reflection prompt and metric example - Add GenerateImprovedToolDescriptionFromFeedback signature documentation - Include tool-aware metric example showing trajectory access - Document tool prefix annotation in feedback - Note component_selector applies to both signatures and tools - Fix 'fundamentally' language per reviewer feedback * docs(gepa): fix implementation details with accurate code flow - Separate Pass 1 (predictor examples) and Pass 2 (tool aggregation) - Clarify Generated Outputs includes full trajectory for ReAct - Fix feedback annotation format to [Tool 'name' from 'predictor_key'] - Add Component Identification & Proposer Routing section - Explain dual-proposer independence (custom proposer doesn't affect tool proposer) - Use consistent terminology: 'predictor' and 'signature instructions' * docs(gepa): remove backward compatibility note Per reviewer feedback, backward compatibility should be implicit * docs(gepa): improve usage examples with optimization visualization - Add component_selector='all' to optimize all components together - Show how to view optimized tool descriptions - Add example output demonstrating improvement from vague to specific descriptions - Remove unnecessary comments for cleaner examples * docs(gepa): add design rationale comments for tool context sharing - Document why full ReAct trajectory is shared with all tools - Explain rationale: tool interdependencies, selection patterns, workflow context - Add concrete example of optimization benefit - Describe alternative considered (tool-specific filtering) and rejection reasoning - Add future work section on joint tool optimization - Present two architectural approaches: separate proposer vs extending ReAct proposer - Include implementation details, benefits, challenges, and decision rationale * docs(gepa): add tool optimization links to overview and parameter docs - Add Tool Description Optimization section to GEPA overview.md with link to advanced guide - Add documentation link to optimize_tool_descriptions parameter in gepa.py - Addresses reviewer feedback to make tool optimization more discoverable * docs(gepa): refine tool optimization scenarios and remove implementation details - Restructure 'When to Use' as numbered list (1-5) per reviewer feedback - Move section after implementation details for better flow - Remove tool: prefix implementation detail from component identification - Explain tool discovery via ReAct modules in user-friendly terms - Add custom proposer compatibility clarification - Address optional PR feedback items (11 & 13) * docs(gepa): clarify future work section in code comments - Add note that proposed architecture details may change - Expand challenges with counterpoints and questions - Mark implementation notes as optional to avoid overengineering * refactor(gepa): unify ReAct optimization as single module Treat ReAct as ONE unified module containing react predictor, extract predictor, and tools as subcomponents - respecting both GEPA's module-level optimization abstraction and DSPy's ReAct module design. Before: - Tools optimized separately from react/extract (multiple components) - Each component had separate reflective dataset (3x redundant trajectories) - Violated DSPy's ReAct abstraction (tools are subcomponents, not peers) After: - ReAct module optimized as single "react_module" component - Joint optimization of react instruction + extract instruction + tool descriptions - One reflective dataset per ReAct execution (no redundant trajectories) - Respects GEPA's dict[str, str] contract (JSON config as string value) Architecture: - ReActModuleProposer: Handles entire ReAct module optimization - Dynamic signature generation: Creates output fields for each tool/parameter - Optional fields: Extract, tool descriptions, tool args (only improve what needs fixing) - JSON config: {"react": "...", "extract": "...", "tools": {...}} Benefits: - Eliminates duplicate trajectories (addresses gepa#97) - Coherent improvements (LM sees how components work together) - Respects both GEPA and DSPy abstractions - Enables cold-start optimization (tool args always available based on schema) * test(gepa): add end-to-end ReAct module optimization test Adds comprehensive test proving GEPA can optimize ReAct modules end-to-end: - Baseline with minimal tool descriptions achieves 0% accuracy - After optimization, achieves 100% accuracy - Tests unified ReAct architecture (react + extract + tools as one module) Key features: - Uses stable SHA256 hashing for deterministic fixture replay - Avoids Python's PYTHONHASHSEED randomization issues - 189KB fixture with security check passed (no API keys/tokens) - Verifies all components are optimized (react, extract, tool descriptions) * fix(gepa): enable arg description optimization for ReAct tools * chore: remove legacy test_gepa_tool_optimization.py This test file was for the old architecture where tools were optimized separately from ReAct modules. With the unified ReAct optimization approach, this test is replaced by test_gepa_react_optimization.py which tests the new architecture where ReAct modules (react + extract + tools) are optimized as a single unified component. * fix: restore accidentally removed score mismatch warning * test: update fixture after arg description optimization fix Regenerates fixture to match commit 3418b59 which changed how tool arg descriptions are optimized. Reduces LM calls from 26→22 by improving the optimization process efficiency. * fix(test): use JSON-based hashing for cross-version fixture stability - Replace repr()-based hashing with json.dumps(sort_keys=True) - Fixes CI failures caused by Python version differences (3.12.9 vs 3.12.11) - repr() formatting can differ between Python micro versions - JSON spec is standardized and stable across all versions - Regenerate fixture with new hashing approach * refactor(gepa): rename optimize_tool_descriptions to optimize_react_components - Rename parameter to better reflect that we optimize all ReAct components - Components include: react instructions, extract instructions, tool descriptions, and tool argument descriptions - Update all code references, tests, and documentation - No functional changes, pure rename for clarity * docs(gepa): improve 'What is optimize_react_components?' section - Clarify that specialized optimization applies only to dspy.ReAct modules - Explain ReAct module structure (react predictor, extract predictor, tools) - List all 4 optimizable components with clear descriptions - Specify react instruction always optimized, others optional based on failures - Simplify language: 'contradict' vs 'work together' instead of complex terms - Add link to ReAct documentation for deeper dive * docs(gepa): replace outdated tool-specific prompt with actual ReAct optimization prompt - Rename section: 'Tool-Specific Reflection Prompt' → 'ReAct Optimization Prompt' - Replace GenerateImprovedToolDescriptionFromFeedback (doesn't exist) with GenerateImprovedReActDescriptionsFromFeedback (actual implementation) - Show that prompt receives ALL components (react, extract, tools) and optimizes jointly - Update metric example: tool_feedback_metric → react_metric for clarity - Remove outdated notes about tool-specific prefixes and component_selector behavior - Clarify that tool descriptions/args are added dynamically via signature.append() * docs(gepa): simplify 'How It Works' section with accurate routing behavior * docs(gepa): remove outdated Implementation Details section * docs(gepa): replace theoretical scenarios with real user pain points * docs(gepa): fix usage examples reference to match updated scenarios * docs(gepa): update inspect section to show all 4 ReAct components with correct syntax * docs(gepa): rewrite Section 8 with accurate custom proposer behavior for ReAct - Clarify custom proposer receives ALL components (regular + ReAct) - Add realistic signature with ReAct failure patterns and component types - Use exact naming from implementation: examples_with_feedback, component_reflective_data, propose_instruction - Show _format_examples() helper matching real markdown formatting - Remove regular component handling to keep example focused on ReAct - Test code example validates successfully - Fix contradiction: optimize_react_components must be True (not irrelevant) docs(gepa): clarify custom proposer behavior in routing section Change 'overrides the default routing' to 'receives all components and handles the optimization logic' to avoid confusion with optimize_react_components which still controls discovery/serialization docs(gepa): remove discouraging recommendation from custom proposer section Users reading this section want to learn how to implement custom proposers for ReAct - don't discourage them from doing so * fix(gepa): fix top-level ReAct module lookup and remove tool name sanitization - Fix ReAct module lookup to handle top-level modules correctly Previously failed to match 'self' path for top-level ReAct instances - Remove tool name sanitization entirely Tool names are now used as-is in dynamic signatures Removed _sanitize_name() function and all calls to it Simplifies code and avoids surprising behavior - Skip failing test_gepa_react_optimization Hash-based fixtures are fragile across Python versions - Add debug logging to trace processing for troubleshooting * refactor(gepa): unify ReAct module key handling and use constant - Replace all magic string 'react_module' with REACT_MODULE_PREFIX constant - Unify path normalization pattern across gepa.py and gepa_utils.py - Rename 'prefix' to 'normalized_path' for clarity - Simplify module lookup by using consistent normalization - Remove awkward OR clause in ReAct module matching logic This makes the codebase more maintainable with a single source of truth for the module prefix and consistent naming throughout. * test(gepa): add ReAct module detection tests for nested structures - Add 3 comprehensive detection tests: single ReAct, mixed workflow (2 ReAct + ChainOfThought), orchestrator with 2 workers - Tests validate full path preservation (bug fix validation) - Uses monkey patching to capture base_program from gepa.optimize - Helper functions for DRY: setup spy, create optimizer, assert detection - Validates all ReAct components: react, extract, tools, tool metadata * test(gepa): add comprehensive ReAct detection and reconstruction tests Detection tests (3): - test_single_react_module_detection: top-level ReAct module - test_multi_react_workflow_detection: mixed ReAct + ChainOfThought (bug fix validation) - test_nested_react_orchestrator_worker_detection: orchestrator with 2 workers as tools Reconstruction tests (3): - test_build_program_single_react: single ReAct module - test_build_program_multi_react_workflow: mixed workflow with ReAct + non-ReAct - test_build_program_orchestrator_with_workers: complex nested structure Helper functions (12): - setup_spy_for_base_program: captures base_program from gepa.optimize - simple_metric_for_detection/reconstruction: test metrics - create_gepa_optimizer_for_detection: creates optimizer - assert_react_module_detected/updated: validates ReAct modules - assert_regular_module_detected/updated: validates non-ReAct modules - mock_optimized_react_module: mocks optimized candidate - create_*_program: 3 reusable program builders Validates: - Full path preservation (bug fix) - All 4 ReAct components (react, extract, tools, arg_desc) - Non-ReAct module handling - Deepcopy verification (original unchanged) - Both detection and reconstruction phases * test(gepa): add reflective dataset tests for multi-agent trajectory validation Adds 2 new tests validating make_reflective_dataset captures complete trajectories: - test_make_reflective_dataset_single_react: Single ReAct module - test_make_reflective_dataset_orchestrator_with_workers: Multi-agent system (3 modules) New helpers: - simple_feedback: Reusable feedback function (consolidates 5 duplicates) - assert_reflective_example_has_trajectory: Validates trajectory completeness Tests validate: - Complete trajectory capture (all iterations with thoughts/tools/observations) - No duplicate/missing iterations - Full path preservation in multi-agent systems - Each module's trajectory captured separately Improvements: - Clean up docstrings and remove redundant comments - Fix whitespace linter warnings (9 auto-fixed) - Reduce from 1054 to 975 lines All 8 tests passing (6 detection/reconstruction + 2 new reflective dataset) * test(gepa): verify tool arg descriptions propagate to args schema - Update assert_react_module_updated to check tool.args['param']['description'] - Add arg_desc to test cases for comprehensive validation - Expose bug: GEPA updates arg_desc but not tool.args (what renders in prompts) * fix(gepa): propagate arg_desc updates to tool.args for prompt rendering tool.arg_desc is only used during Tool.__init__; updating it after creation has no effect on prompts. Only tool.args is rendered, so GEPA must update args for optimized descriptions to appear in prompts. Fixes the bug where reflection LM improves tool parameter descriptions but they don't show in actual prompts because arg_desc changes weren't propagated to the args schema. * test(gepa): remove fixture-based test and unused dependencies * test(gepa): remove unused fixture file * style: fix ruff linting issues (import formatting, whitespace, bare except) * refactor(test): rename setup_spy_for_base_program to setup_capture_for_base_program for clarity * docs(gepa): clarify why Tool.func uses placeholder lambda in proposer * refactor(gepa): make all ReAct components optional with None default for selective optimization * docs(gepa): clarify 'LM' as 'reflection LM' in comments for precision * refactor(gepa): refine reflection prompt to guide concise, focused ReAct component optimization Update the ReAct proposer's reflection signature to guide the LM toward more appropriate output granularity and selective optimization. Changes: - Add context that components are progressively optimized across iterations - Change 'and' to 'and/or' for abstraction/specificity (allows flexibility) - Refine field descriptions to guide output style: * 'ReAct instruction for reasoning and tool selection' (functional context) * 'Extract instruction for answer extraction' (functional context) * 'Purpose of tool' (focuses on high-level what/why, not verbose how) * 'Usage of parameter' (focuses on specific usage, not essay) The goal is to prevent overly verbose LM outputs (multi-paragraph tool/param descriptions) while preserving exploration capability. Field descriptions now provide functional context ('for reasoning', 'purpose', 'usage') that naturally guides appropriate scope without being prescriptive about format or length. This allows the reflection LM to determine the right level of detail based on what's needed to fix failures, aligned with GEPA's general meta-prompt philosophy. * docs(gepa): revise ReAct metric example to be general and extensible Replace prescriptive 'minimize tool calls' example with educational progression that shows users how to write effective metrics without forcing specific objectives. Changes: - Show simple metric first (just correctness feedback) - Then show trajectory-based metric (accessing agent execution) - Use clear for-loop instead of list comprehension for readability - Follow DSPy docs conventions: answer_match variable, example/pred naming - Remove 'minimize tool calls' directive - let users decide their objectives - Add bullet points explaining what trajectory can reveal (tool selection, reasoning quality, efficiency) without prescribing how to use it - Rename section to 'Writing Metrics for ReAct Optimization' (more actionable) This aligns with GEPA's philosophy: provide general, extensible patterns that users can adapt to their specific needs. Detailed examples can be shown in tutorials rather than API documentation. Addresses PR review comment 5 about prescriptive objectives in documentation. * docs(gepa): replace custom proposer example with reference to ReActModuleProposer Address PR review comment 6 by simplifying the custom proposer documentation. Changes: - Replace long inline implementation example with clickable GitHub link - Point to ReActModuleProposer as reference implementation - Add bulleted list of what the reference shows (parsing, dynamic signatures, etc.) - Keep essential JSON structure and interface documentation - Remove 100+ lines of redundant code example Benefits: - Less overwhelming for users (no duplicate code) - Single source of truth (reference implementation) - Clickable link to actual working code on GitHub - Users can copy/modify real implementation instead of example Addresses PR comment from @LakshyAAAgrawal about using reference instead of full implementation example. * docs(gepa): make custom proposer section more approachable and clear Improve the custom proposer documentation to be more user-friendly while maintaining technical accuracy. Changes: - Warmer, more inviting opening ("best way to start") - Concrete example with 'search' tool instead of generic placeholders - Plain English explanations for each component ("How the agent reasons...") - Clear separation: "What you can improve" vs "What to preserve" - Simpler code example with inline comments explaining ReAct vs regular - Concise "reference shows how to" bullets (3 key points) - More approachable tone without sacrificing precision This makes the advanced feature more accessible to users who need custom optimization logic beyond the defaults. Follows up on the previous commit addressing PR comment about custom proposer example. * docs(gepa): update ReAct reflection prompt to match current implementation Sync documentation with actual reflection prompt after bd4cdac: - Add 'These components are progressively optimized' context - Change to 'and/or specificity' for flexibility - Update output field types to 'str | None' with default=None - Refine field descriptions ('for reasoning and tool selection', 'for answer extraction') - Add note about dynamic field descriptions ('Purpose of tool', 'Usage of parameter') This ensures docs accurately reflect the current prompt design that guides appropriate granularity without being prescriptive. * feat(gepa): warn when ReAct modules detected but optimization disabled Add warning message when GEPA detects ReAct modules in the program but optimize_react_components=False. This helps users discover the ReAct optimization feature. Changes: - Always traverse modules to detect ReAct instances - If optimize_react_components=False, warn for each ReAct module found - Shows module path to help users identify what would be optimized - No behavioral changes when optimize_react_components=True Addresses maintainer feedback to make the feature more discoverable. * test(gepa): fix DummyLM configuration and remove exception swallowing - Configure DummyLM with proper ReAct response format (next_thought, next_tool_name, next_tool_args) - Remove try/except blocks that silently swallowed exceptions - Add explanatory comments for why compile should now succeed - Increase DummyLM repetitions (10→20) to support GEPA iterations Addresses review feedback from @LakshyAAAgrawal requesting removal of unexplained exception handling that masked real bugs. All 8 tests now pass deterministically without silent failures. * test(gepa): add failing tests for generic tool optimization - Add 4 core tests for tool optimization beyond ReAct - test_detect_single_tool: single Tool input field - test_detect_tool_list: multiple tools with ordering - test_skip_predictor_without_tools: negative case (passing) - test_update_tool_and_predictor: reconstruction path Tests use class-based signatures (required for type detection). Currently failing (TDD approach) - implementation next. * refactor(gepa): rename optimize_react_components to enable_tool_optimization Rename flag to reflect generalization beyond ReAct modules: - optimize_react_components → enable_tool_optimization - Update documentation to mention custom predictors using dspy.Tool - Update warning message to use new flag name This prepares for upcoming feature: generic tool optimization for any predictor using dspy.Tool, not just dspy.ReAct modules. * refactor(gepa): extract nested function to private method Move build_propose_new_texts() from nested function in __init__ to _build_propose_new_texts() private method per maintainer feedback. Also simplify LM context handling by using unified context manager pattern instead of if/else branching (18 lines → 6 lines). Changes: - Extract _build_propose_new_texts() as private class method - Simplify LM context: use 'with dspy.context(lm=self.reflection_lm or dspy.settings.lm)' - Clean up __init__ (110+ lines nested function → 1 line method call) Benefits: - Cleaner class structure (easier to scan __init__) - Methods testable in isolation - Reduced code duplication (-26 lines net) - Addresses maintainer feedback: 'move helper function out as private method' * feat(gepa): detect tool-using predictors via type checking - Add type-based detection for predictors using dspy.Tool - Initialize tool-using predictors with JSON structure - Add inline helper function is_tool_field() for recursive type checking - Handle Union/Optional types containing Tool - Enable generic tool optimization beyond dspy.ReAct * test(gepa): update ReAct tests for predictor-name-based keys - Move inline imports to top of file - Rename module_path → predictor_name for clarity - Update all assertions to use full predictor names (e.g., extract.predict) - Update feedback_map keys to match predictor names - Simplify multi-agent test assertions (20+ lines → 10 lines) All 8 ReAct optimization tests now passing with new key structure. * test(gepa): use explicit predictor keys in tool optimization tests - Replace unpacking pattern with explicit predictor names - Remove duplicate inline imports (already at top) - Use TOOL_MODULE_PREFIX:pred consistently across tests - Improve test docstrings for clarity All 3 tool tests still passing (1 skipped intentionally). * feat(gepa): extract tools from runtime traces Runtime tool discovery: - Import Tool type for isinstance() checks - Initialize tools_by_predictor dict to collect unique tools - Add extract_tools_from_value() recursive helper function - Extract tools from predictor trace inputs during iteration - Handle single Tool, list[Tool], dict[str, Tool] structures - Serialize tools to candidate JSON after all traces processed Implements runtime tool discovery (Change 2). Captures dynamically injected tools from actual usage patterns. * feat(gepa): detect tool-using predictors at compile time - Import TOOL_MODULE_PREFIX constant - Detect predictors with dspy.Tool input fields - Create prefixed keys: tool_module:{predictor_name} - Use actual predictor name as JSON config key Pairs with tool extraction (fe19dac). Together they implement compile-time detection + runtime extraction for generic tool modules. * refactor(gepa): use predictor identity for ReAct detection - Find extract/react predictors by object identity (not paths) - Use actual predictor names as JSON config keys - Module key uses extract_predictor_name for consistency - Clearer comments about dynamic predictor names More robust than path-based matching. Config keys are now actual predictor names (e.g., "multi_agent.react", "multi_agent.extract.predict") instead of generic "react"/"extract". * test(gepa): refactor ReAct tests to use dynamic predictor names - Add get_predictor_name() helper using object identity - Remove all hardcoded predictor name strings - Update mock_optimized_react_module() to accept react_module parameter - Use expected_* naming convention for clarity - All 11 tests passing with fully dynamic approach * refactor(gepa): generalize proposer to support both ReAct and tool modules - Rename ReActModuleProposer → ToolModuleProposer - Rename signature to GenerateImprovedToolModuleDescriptionsFromFeedback - Make base signature generic (current_predictor_instruction) - Dynamically add extract fields only for ReAct modules - Use prefix checks (REACT_MODULE_PREFIX) for reliable type detection - Support both 1-predictor (tool) and 2-predictor (ReAct) modules - Update routing to handle both TOOL_MODULE_PREFIX and REACT_MODULE_PREFIX - Clean variable names: primary_predictor_key, extract_predictor_key - Update all docstrings to reflect tool-using modules (not just ReAct) * refactor(gepa): eliminate create-delete pattern in base_program build - Process ReAct modules first, then individual predictors - Skip predictors already part of module configs (check inside JSON) - Remove redundant base_program.pop() calls - No duplicate enable_tool_optimization checks * refactor(gepa): eliminate ReAct coupling in build_program Replace ReAct-specific logic with generic approach: Before: - isinstance(ReAct) checks - Direct access to module.react/module.extract/module.tools - Separate if/elif branches for instruction updates After: - Program-level __dict__ traversal to find tools - Unified aggregation: plain strings → module config overrides - Single application loop (no duplication) Why __dict__ traversal: Tools can be declared as single attributes (self.tool), lists (self.tools=[...]), or dicts (self.tools={...}), and nested in any dspy.Module. Traversing __dict__ finds all tools regardless of how they're structured, without coupling to specific module types. This makes the code resilient to ReAct internal changes and works for any module using dspy.Tool. * refactor(gepa): apply code cleanup principles consistently - Use tuple syntax for startswith() (more Pythonic) - Remove unnecessary try-except for JSON parsing (we control the source) These follow the same principles applied in build_program refactor. * refactor(gepa): unify config extraction patterns - Use isinstance(v, str) for predictor filtering (type-based) - Use .get("tools", {}) for tools extraction (more Pythonic) Both changes make the code more consistent and resilient to config structure changes. * refactor(gepa): remove verbose logs and consolidate comments Remove ~25 debug/info logs per maintainer feedback: - Internal routing/processing logs - Trace processing details - Reflective example breakdowns - Config building verbosity Consolidate multi-line comments into concise single lines while preserving important context (WHY, not WHAT). * docs(gepa): clarify ReAct trace workaround with TODO Document that this is a workaround for ReAct's multiple predictor calls with partial trajectories. After PR #8999 merges, we should test if we can remove this and use extract predictor trace directly. * test(gepa): remove deprecated ReAct-specific tests and refactor tool optimization tests * feat(gepa): add assertion for ReAct two-predictor design Fail fast with clear error if DSPy's ReAct design changes (missing extract.predict). Better than silently skipping broken modules. * test(gepa): add DSPy ReAct design docs and improve test consistency - Add header note documenting DSPy's two-predictor ReAct design - Remove test_react_trace_aggregation (was testing DSPy internals) - Move test tool fixtures to top for reuse - Fix test_selective_optimization style: - Simplify docstring to one-liner - Remove verbose inline comments - Fix assertion to use program.tools reference (clearer) - Add consistent GEPA iteration comments * fix(test): remove trailing whitespace and extra blank lines * refactor(gepa): clarify tool proposer output field descriptions * refactor(gepa): treat args as canonical for tool arg descriptions * refactor(gepa): tolerate missing arg descriptions when applying tool configs * refactor(gepa): use args as sole source of tool arg descriptions * test(gepa): drop arg_desc expectations from tool optimization tests * refactor(gepa): refine reflection prompts for tool optimization Improve instructions for the reflection LM to focus on reinforcing successful patterns and providing progressively optimized updates for predictor instructions and tool descriptions. * refactor(gepa): improve tool extraction robustness and observability Move tool extraction logic to evaluate() loop for immediate capture. Fix overwrite risk by merging discovered tools with existing config. Improve logging and docstrings for better maintainability. * refactor(gepa): simplify initialization logic Move helper function outside loop and simplify predictor deduplication check by validating keys before parsing JSON. * refactor(gepa): remove ReAct trace workaround Use standard trace selection logic (prioritizing failures) for all modules including ReAct. The extractor logic workaround is no longer needed as we handle aggregated duplicates differently. * chore(gepa): clean up whitespace and style changes from tool optimization PR * chore(gepa): clean up whitespace and style changes from tool optimization PR * chore: restore .gitignore to match main * docs(gepa): document tool optimization flag in overview * docs(gepa): clarify enable_tool_optimization and custom proposers * docs(gepa): update tool module optimization prompt to match actual code * docs(gepa): update How Tool Optimization Works section * docs(gepa): update When to Use Tool Optimization section * docs(gepa): update custom proposers section for tool optimization * docs(gepa): update usage examples with correct tool patterns and interfaces * docs(gepa): remove redundant metrics section * refactor(gepa): use absolute import for ToolModuleProposer * docs(gepa): update tool optimization doc link * docs(gepa): replace eval() example with get_weather tool * fix(gepa): change ReAct detection log from warning to info * refactor(gepa): extract _propose_component_texts as private method * refactor(gepa): TODO out generic tool module optimization, keep ReAct only * refactor(gepa): remove generic tool module detection, keep ReAct only * refactor(gepa): improve naming and extract tool update methods * refactor(gepa): remove unused TOOL_MODULE_PREFIX and rename to tool_components * refactor(gepa): rename ToolModuleProposer to ToolProposer * docs(gepa): update tool optimization docs for ReAct-only support * refactor(gepa): unify prefix to TOOL_MODULE_PREFIX for all tool-using modules - Rename REACT_MODULE_PREFIX to TOOL_MODULE_PREFIX - Single abstraction for tool modules (ReAct now, generic later) - Use count-based detection for extract predictor instead of prefix check - Update docs to reflect new naming * docs(gepa): remove CustomAgent example, keep ReAct only * docs(gepa): update enable_tool_optimization docstring for ReAct-only support * test(gepa): remove generic tool tests, keep ReAct-only tests - Remove test_detect_single_tool, test_detect_multiple_tools - Remove test_apply_optimized_tool_descriptions - Update REACT_MODULE_PREFIX -> TOOL_MODULE_PREFIX - Update docstring to reflect ReAct-only support * refactor(gepa): use local ToolProposer variable, update docs for ReAct-only - Remove self._tool_proposer instance variable - Create ToolProposer locally when needed (stateless) - Update overview.md to say ReAct-only instead of 'any module' * docs(gepa): update tool optimization docs for ReAct-only support - Remove generic tool module references, keep ReAct only - Update JSON structure examples to show both react and extract predictors - Fix comment in custom proposer example * some fixes --------- Co-authored-by: chenmoneygithub <chen.qian@databricks.com>
1 parent ed01c88 commit b6115d4

File tree

6 files changed

+1055
-136
lines changed

6 files changed

+1055
-136
lines changed

docs/docs/api/optimizers/GEPA/GEPA_Advanced.md

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -443,3 +443,146 @@ gepa = dspy.GEPA(
443443
auto="medium"
444444
)
445445
```
446+
447+
## Tool Optimization
448+
449+
### What is enable_tool_optimization?
450+
451+
When `enable_tool_optimization=True`, GEPA jointly optimizes `dspy.ReAct` modules: predictor instructions and tool descriptions and argument descriptions are updated together, instead of being tuned in isolation. This lets the model learn better patterns for when to call a tool and how to use it from the same execution traces and feedback that drive core GEPA.
452+
453+
### Usage and constraints
454+
455+
- **Expose tools as `dspy.Tool` in signatures and examples.** GEPA only optimizes tools that are represented as `dspy.Tool` and actually passed as `dspy.Tool` objects into your modules.
456+
- **Treat `Tool.name` as a stable identifier.** `Tool.name` is the tool's name, and GEPA uses it to attach improved descriptions and argument descriptions. If you reuse the same `Tool.name` for different tools, they will share the same text updates.
457+
- **Avoid custom tools named `"finish"`.** The built-in ReAct `"finish"` tool is reserved and excluded from optimization. Custom tools with the name `"finish"` are also not optimized.
458+
- **Custom instruction proposers handle all modules and tool updates.** When you provide an `instruction_proposer`, GEPA routes every optimized module through your proposer instead of the built-in instruction proposer. If `enable_tool_optimization=True`, modules that call tools are still included, and your proposer is also responsible for updating their tool descriptions and argument descriptions.
459+
460+
### Tool Module Optimization Prompt
461+
462+
GEPA uses `ToolProposer` to optimize ReAct modules when `enable_tool_optimization=True`. For each module, the proposer builds a dynamic signature from the base `GenerateImprovedToolModuleDescriptionsFromFeedback` signature shown below, then appends output fields for each tool description and each tool argument description in that module. For ReAct modules, the proposer also appends input and output fields for the extract instruction.
463+
464+
```python
465+
class GenerateImprovedToolModuleDescriptionsFromFeedback(dspy.Signature):
466+
"""I provided an assistant with predictor instructions and tool descriptions,
467+
but its performance needs improvement based on the examples_with_feedback below.
468+
469+
Your task is to propose better predictor instructions, tool descriptions, and
470+
tool argument descriptions that address the issues shown in these examples.
471+
Focus on reinforcing patterns that clearly improve the assistant's performance
472+
on similar tasks, rather than rewriting everything from scratch unless necessary.
473+
These components are progressively optimized - refine only what needs to change.
474+
475+
Analyze the examples_with_feedback to identify success and failure patterns,
476+
and write improved instructions and descriptions at their appropriate level
477+
of abstraction and/or specificity, so that each layer plays a clear,
478+
complementary role without unnecessary repetition or verbosity unless
479+
redundancy clearly helps the assistant's performance.
480+
"""
481+
482+
current_predictor_instruction = dspy.InputField(
483+
desc="Current instruction guiding the predictor"
484+
)
485+
current_tools = dspy.InputField(
486+
annotation=list[dspy.Tool],
487+
desc="Available tools with their complete schemas"
488+
)
489+
examples_with_feedback = dspy.InputField(
490+
desc="Execution examples with feedback showing successes and failures"
491+
)
492+
493+
improved_predictor_instruction: str | None = dspy.OutputField(
494+
desc="Improved instruction for the predictor",
495+
default=None
496+
)
497+
498+
# GEPA appends output fields dynamically for each tool and argument:
499+
# - improved_tool_{name}_desc with desc="Improved description of tool '{name}'"
500+
# - improved_tool_{name}_arg_{param}_desc with desc="Improved description of the argument '{param}' of tool '{name}'"
501+
# For ReAct modules, GEPA also appends:
502+
# - current_extract_instruction (input) with desc="Current instruction for extraction predictor"
503+
# - improved_extract_instruction (output) with desc="Improved instruction for extraction"
504+
```
505+
506+
The reflection LM uses this dynamically-built signature to jointly propose updates across predictor instructions, tool descriptions, and argument descriptions based on execution feedback. Updates are coordinated rather than made in isolation: the LM sees all current components together and can selectively update any subset by returning new text, or return `None` to keep a component unchanged.
507+
508+
### How Tool Optimization Works
509+
510+
When `enable_tool_optimization=True`, GEPA:
511+
512+
1. **Discovers ReAct modules** - Identifies `dspy.ReAct` modules and their associated tools
513+
2. **Treats them as joint optimization units** - Instead of only optimizing predictor instructions, GEPA optimizes predictor instructions and tool descriptions together as a coordinated set; for ReAct this includes both the react and extract instructions
514+
3. **Routes to specialized proposer** - Separates components by type and routes them appropriately:
515+
- **With custom `instruction_proposer`**: Your custom proposer receives both ReAct modules and plain predictors, and is responsible for updating all components
516+
- **With default proposer**: Plain predictors use the default instruction proposer; ReAct modules use `ToolProposer`, which employs the dynamic signature mechanism described above
517+
4. **Optimizes jointly** - `ToolProposer` improves predictor instructions and tool descriptions together based on execution feedback, coordinating updates across all components rather than tuning them in isolation
518+
5. **Applies updates** - Improved instructions update predictor signatures; improved tool descriptions and argument descriptions update all `dspy.Tool` objects with matching tool names throughout the program
519+
520+
Modules without tools (like `dspy.Predict` or `dspy.ChainOfThought`) continue using standard GEPA instruction-only optimization.
521+
522+
### When to Use Tool Optimization
523+
524+
Enable `enable_tool_optimization=True` when tools are central to your program's behavior and you want GEPA to jointly optimize predictor instructions and tool descriptions together. Common scenarios:
525+
526+
1. **Wrong tool selection** - Predictor with `search` and `weather` tools keeps searching when it should check weather, or vice versa. GEPA refines predictor instructions and tool descriptions to clarify when to use each tool.
527+
528+
2. **Underused tools** - Predictor responds "I don't know" without using available tools that could answer the question. GEPA improves predictor instructions to be more proactive about tool usage.
529+
530+
3. **Tool call loops** - Agent keeps calling `web_search` multiple times with similar queries instead of synthesizing information. GEPA improves instructions to encourage synthesis and tool descriptions to clarify when searches are sufficient.
531+
532+
4. **Extraction failures (ReAct)** - Agent executes tools correctly but fails to extract the final answer from the trajectory. GEPA improves extract instruction to better identify and format answers from tool outputs.
533+
534+
5. **Multi-agent delegation** - Parent agent has delegation tools to specialized sub-agents but doesn't understand when to use each. GEPA optimizes instructions and tool descriptions across both parent and sub-agent modules for coherent delegation.
535+
536+
See the usage example below for tool-using programs.
537+
538+
### Usage Example
539+
540+
```python
541+
import dspy
542+
543+
def search_web(query: str) -> str:
544+
return f"Search results for: {query}"
545+
546+
def get_weather(city: str) -> str:
547+
"""Get the current weather for a city."""
548+
return f"The weather in {city} is sunny and 75°F"
549+
550+
# Create tools with basic descriptions
551+
search_tool = dspy.Tool(search_web, name="search_web", desc="Search tool")
552+
weather_tool = dspy.Tool(get_weather, name="get_weather", desc="Weather tool")
553+
554+
program = dspy.ReAct("question -> answer", tools=[search_tool, weather_tool])
555+
556+
# Enable tool optimization
557+
gepa = dspy.GEPA(
558+
metric=my_metric,
559+
reflection_lm=dspy.LM(model="gpt-5-mini"),
560+
enable_tool_optimization=True,
561+
auto="medium"
562+
)
563+
564+
optimized_program = gepa.compile(program, trainset=train_examples, valset=val_examples)
565+
```
566+
567+
### Inspecting Optimized Programs
568+
569+
View optimization results and metadata (requires `track_stats=True`):
570+
571+
```python
572+
# High-level optimization metadata
573+
optimized_program.detailed_results
574+
```
575+
576+
Access optimized instructions and tool descriptions directly:
577+
578+
```python
579+
# Predictor instructions
580+
for name, predictor in optimized_program.named_predictors():
581+
print(f"{name}: {predictor.signature.instructions}")
582+
583+
# Tool descriptions and argument descriptions
584+
for tool_name, tool in optimized_program.tools.items():
585+
print(f"{tool_name}: {tool.desc}")
586+
for arg_name, arg_schema in tool.args.items():
587+
print(f" {arg_name}: {arg_schema.get('description', 'N/A')}")
588+
```

docs/docs/api/optimizers/GEPA/overview.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -117,6 +117,12 @@ Practical Recipe for GEPA-Friendly Feedback:
117117
- **Multi-Objective Tasks** (e.g., PUPA): Decompose aggregate scores to reveal contributions from each objective, highlighting tradeoffs (e.g., quality vs. privacy).
118118
- **Stacked Pipelines** (e.g., code generation: parse → compile → run → profile → evaluate): Expose stage-specific failures; natural-language traces often suffice for LLM self-correction.
119119

120+
## Tool Optimization with GEPA
121+
122+
When `enable_tool_optimization=True`, GEPA jointly optimizes `dspy.ReAct` modules with the tools - GEPA updates predictor instructions and tool descriptions/argument descriptions together, based on execution traces and feedback, instead of keeping tool behavior fixed.
123+
124+
For details, examples, and the underlying design (tool discovery, naming requirements, and interaction with custom instruction proposers), see [Tool Optimization](GEPA_Advanced.md#tool-optimization).
125+
120126
## Custom Instruction Proposal
121127

122128
For advanced customization of GEPA's instruction proposal mechanism, including custom instruction proposers and component selectors, see [Advanced Features](GEPA_Advanced.md).

0 commit comments

Comments
 (0)