[Feature]: Add Prompt Evaluation for Spec-Driven Workflow Prompts

### Problem Statement

We need to implement a system to evaluate the effectiveness, consistency, and correctness of the LLM's output when using the core workflow prompts. As these prompts drive the entire development lifecycle (from spec to validation), ensuring they reliably produce high-quality, compliant outputs is critical.

### Desired Outcome

Define evaluation criteria and creating test cases for the four primary prompts in the `prompts/` directory.

**Scope:**
The evaluation framework must cover the following prompts:
1. `prompts/generate-spec.md`
2. `prompts/generate-task-list-from-spec.md`
3. `prompts/manage-tasks.md`
4. `prompts/validate-spec-implementation.md`

**Evaluation Criteria:**

For each prompt, we need to verify the model adheres to "Critical Constraints," "Output Requirements," and specific workflow behaviors.

### 1. `generate-spec` Evaluation
- **Pre-Computation Checks:**
  - **Scope Assessment:** Did the model explicitly assess if the request is "Too Large," "Too Small," or "Just Right"?
  - **Clarifying Questions:** Did the model stop to ask clarifying questions (using the specific file format) *before* generating the full spec?
- **Structural Integrity:**
  - Does the output strictly follow the `[NN]-spec-[feature-name].md` template?
  - Are all required sections present (Goals, User Stories, Demoable Units, etc.)?
- **Content Quality:**
  - **Demoable Units:** Do units have specific "Functional Requirements" and "Proof Artifacts"?
  - **Junior-Friendly:** Is the language clear and free of assumed technical knowledge?

### 2. `generate-task-list-from-spec` Evaluation
- **Process Adherence:**
  - **Two-Phase Generation:** Did the model strictly stop after generating "Parent Tasks" to await user confirmation before generating "Sub-tasks"?
- **Task Quality:**
  - **Demoable Logic:** Does each parent task represent a demoable unit of work?
  - **Proof Artifacts:** Does every parent task include specific, verifiable Proof Artifacts (screenshots, CLI outputs, etc.)?
- **Format Compliance:**
  - Does the file output match the defined Markdown structure for tasks and relevant files?

### 3. `manage-tasks` Evaluation
- **Workflow Discipline:**
  - **Checkpoints:** Did the model offer the 3 checkpoint options (Continuous, Task, Batch) at the start?
  - **State Management:** Does the model correctly update task states (`[ ]` -> `[~]` -> `[x]`) in the task file?
- **Evidence Collection:**
  - **Proof Generation:** Did the model enforce the creation of a single `[NN]-task-[TT]-proofs.md` file per parent task?
  - **Security:** Did the model check for/warn about sensitive data in proof artifacts?
- **Git Hygiene:**
  - Are commit messages formatted correctly with references to Spec `[NN]` and Task `T[xx]`?

### 4. `validate-spec-implementation` Evaluation
- **Gate Enforcement:**
  - Did the model explicitly evaluate all Validation Gates (A-F)?
  - Did it identify "Red Flags" (e.g., missing relevant files, hardcoded secrets)?
- **Reporting:**
  - **Coverage Matrix:** Is the "Functional Requirements" vs. "Proof Artifacts" matrix complete?
  - **Evidence citation:** Does the report cite specific evidence (file existence, command outputs) rather than generic statements?
  - **Executive Summary:** Is there a clear PASS/FAIL conclusion?


### Acceptance Criteria

- [ ] A defined rubric or checklist created for manually or automatically scoring outputs from each of the 4 prompts.
- [ ] A set of "Golden Datasets" (input prompts and ideal responses) for regression testing.
- [ ] Documentation added to the repo explaining how to evaluate prompt changes against these criteria.

### Affected Prompts/Workflows

All prompts will be evaluated

### Additional Context

_No response_

### Pre-Submission Checks

- [x] I searched existing issues for duplicates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature]: Add Prompt Evaluation for Spec-Driven Workflow Prompts #34

Problem Statement

Desired Outcome

1. `generate-spec` Evaluation

2. `generate-task-list-from-spec` Evaluation

3. `manage-tasks` Evaluation

4. `validate-spec-implementation` Evaluation

Acceptance Criteria

Affected Prompts/Workflows

Additional Context

Pre-Submission Checks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Add Prompt Evaluation for Spec-Driven Workflow Prompts #34

Description

Problem Statement

Desired Outcome

1. generate-spec Evaluation

2. generate-task-list-from-spec Evaluation

3. manage-tasks Evaluation

4. validate-spec-implementation Evaluation

Acceptance Criteria

Affected Prompts/Workflows

Additional Context

Pre-Submission Checks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `generate-spec` Evaluation

2. `generate-task-list-from-spec` Evaluation

3. `manage-tasks` Evaluation

4. `validate-spec-implementation` Evaluation