Skip to content

[Feature]: Add Prompt Evaluation for Spec-Driven Workflow Prompts #34

@RobertKelly

Description

@RobertKelly

Problem Statement

We need to implement a system to evaluate the effectiveness, consistency, and correctness of the LLM's output when using the core workflow prompts. As these prompts drive the entire development lifecycle (from spec to validation), ensuring they reliably produce high-quality, compliant outputs is critical.

Desired Outcome

Define evaluation criteria and creating test cases for the four primary prompts in the prompts/ directory.

Scope:
The evaluation framework must cover the following prompts:

  1. prompts/generate-spec.md
  2. prompts/generate-task-list-from-spec.md
  3. prompts/manage-tasks.md
  4. prompts/validate-spec-implementation.md

Evaluation Criteria:

For each prompt, we need to verify the model adheres to "Critical Constraints," "Output Requirements," and specific workflow behaviors.

1. generate-spec Evaluation

  • Pre-Computation Checks:
    • Scope Assessment: Did the model explicitly assess if the request is "Too Large," "Too Small," or "Just Right"?
    • Clarifying Questions: Did the model stop to ask clarifying questions (using the specific file format) before generating the full spec?
  • Structural Integrity:
    • Does the output strictly follow the [NN]-spec-[feature-name].md template?
    • Are all required sections present (Goals, User Stories, Demoable Units, etc.)?
  • Content Quality:
    • Demoable Units: Do units have specific "Functional Requirements" and "Proof Artifacts"?
    • Junior-Friendly: Is the language clear and free of assumed technical knowledge?

2. generate-task-list-from-spec Evaluation

  • Process Adherence:
    • Two-Phase Generation: Did the model strictly stop after generating "Parent Tasks" to await user confirmation before generating "Sub-tasks"?
  • Task Quality:
    • Demoable Logic: Does each parent task represent a demoable unit of work?
    • Proof Artifacts: Does every parent task include specific, verifiable Proof Artifacts (screenshots, CLI outputs, etc.)?
  • Format Compliance:
    • Does the file output match the defined Markdown structure for tasks and relevant files?

3. manage-tasks Evaluation

  • Workflow Discipline:
    • Checkpoints: Did the model offer the 3 checkpoint options (Continuous, Task, Batch) at the start?
    • State Management: Does the model correctly update task states ([ ] -> [~] -> [x]) in the task file?
  • Evidence Collection:
    • Proof Generation: Did the model enforce the creation of a single [NN]-task-[TT]-proofs.md file per parent task?
    • Security: Did the model check for/warn about sensitive data in proof artifacts?
  • Git Hygiene:
    • Are commit messages formatted correctly with references to Spec [NN] and Task T[xx]?

4. validate-spec-implementation Evaluation

  • Gate Enforcement:
    • Did the model explicitly evaluate all Validation Gates (A-F)?
    • Did it identify "Red Flags" (e.g., missing relevant files, hardcoded secrets)?
  • Reporting:
    • Coverage Matrix: Is the "Functional Requirements" vs. "Proof Artifacts" matrix complete?
    • Evidence citation: Does the report cite specific evidence (file existence, command outputs) rather than generic statements?
    • Executive Summary: Is there a clear PASS/FAIL conclusion?

Acceptance Criteria

  • A defined rubric or checklist created for manually or automatically scoring outputs from each of the 4 prompts.
  • A set of "Golden Datasets" (input prompts and ideal responses) for regression testing.
  • Documentation added to the repo explaining how to evaluate prompt changes against these criteria.

Affected Prompts/Workflows

All prompts will be evaluated

Additional Context

No response

Pre-Submission Checks

  • I searched existing issues for duplicates

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions