-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Problem Statement
We need to implement a system to evaluate the effectiveness, consistency, and correctness of the LLM's output when using the core workflow prompts. As these prompts drive the entire development lifecycle (from spec to validation), ensuring they reliably produce high-quality, compliant outputs is critical.
Desired Outcome
Define evaluation criteria and creating test cases for the four primary prompts in the prompts/ directory.
Scope:
The evaluation framework must cover the following prompts:
prompts/generate-spec.mdprompts/generate-task-list-from-spec.mdprompts/manage-tasks.mdprompts/validate-spec-implementation.md
Evaluation Criteria:
For each prompt, we need to verify the model adheres to "Critical Constraints," "Output Requirements," and specific workflow behaviors.
1. generate-spec Evaluation
- Pre-Computation Checks:
- Scope Assessment: Did the model explicitly assess if the request is "Too Large," "Too Small," or "Just Right"?
- Clarifying Questions: Did the model stop to ask clarifying questions (using the specific file format) before generating the full spec?
- Structural Integrity:
- Does the output strictly follow the
[NN]-spec-[feature-name].mdtemplate? - Are all required sections present (Goals, User Stories, Demoable Units, etc.)?
- Does the output strictly follow the
- Content Quality:
- Demoable Units: Do units have specific "Functional Requirements" and "Proof Artifacts"?
- Junior-Friendly: Is the language clear and free of assumed technical knowledge?
2. generate-task-list-from-spec Evaluation
- Process Adherence:
- Two-Phase Generation: Did the model strictly stop after generating "Parent Tasks" to await user confirmation before generating "Sub-tasks"?
- Task Quality:
- Demoable Logic: Does each parent task represent a demoable unit of work?
- Proof Artifacts: Does every parent task include specific, verifiable Proof Artifacts (screenshots, CLI outputs, etc.)?
- Format Compliance:
- Does the file output match the defined Markdown structure for tasks and relevant files?
3. manage-tasks Evaluation
- Workflow Discipline:
- Checkpoints: Did the model offer the 3 checkpoint options (Continuous, Task, Batch) at the start?
- State Management: Does the model correctly update task states (
[ ]->[~]->[x]) in the task file?
- Evidence Collection:
- Proof Generation: Did the model enforce the creation of a single
[NN]-task-[TT]-proofs.mdfile per parent task? - Security: Did the model check for/warn about sensitive data in proof artifacts?
- Proof Generation: Did the model enforce the creation of a single
- Git Hygiene:
- Are commit messages formatted correctly with references to Spec
[NN]and TaskT[xx]?
- Are commit messages formatted correctly with references to Spec
4. validate-spec-implementation Evaluation
- Gate Enforcement:
- Did the model explicitly evaluate all Validation Gates (A-F)?
- Did it identify "Red Flags" (e.g., missing relevant files, hardcoded secrets)?
- Reporting:
- Coverage Matrix: Is the "Functional Requirements" vs. "Proof Artifacts" matrix complete?
- Evidence citation: Does the report cite specific evidence (file existence, command outputs) rather than generic statements?
- Executive Summary: Is there a clear PASS/FAIL conclusion?
Acceptance Criteria
- A defined rubric or checklist created for manually or automatically scoring outputs from each of the 4 prompts.
- A set of "Golden Datasets" (input prompts and ideal responses) for regression testing.
- Documentation added to the repo explaining how to evaluate prompt changes against these criteria.
Affected Prompts/Workflows
All prompts will be evaluated
Additional Context
No response
Pre-Submission Checks
- I searched existing issues for duplicates
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request