diff --git a/components/skills/iac-terraform-modules-eng/SKILL.md b/components/skills/iac-terraform-modules-eng/SKILL.md new file mode 100644 index 0000000..5069e70 --- /dev/null +++ b/components/skills/iac-terraform-modules-eng/SKILL.md @@ -0,0 +1,249 @@ +--- +name: iac-terraform-modules-eng +description: Build reusable Terraform modules for AWS, Azure, and GCP infrastructure following infrastructure-as-code best practices. Use when creating infrastructure modules, standardizing cloud provisioning, or implementing reusable IaC components. +--- + +# Terraform Module Library + +Production-ready Terraform module patterns for AWS, Azure, and GCP infrastructure. + +## Purpose + +Create reusable, well-tested Terraform modules for common cloud infrastructure patterns across multiple cloud providers. + +## When to Use + +- Build reusable infrastructure components +- Standardize cloud resource provisioning +- Implement infrastructure as code best practices +- Create multi-cloud compatible modules +- Establish organizational Terraform standards + +## Module Structure + +``` +terraform-modules/ +├── aws/ +│ ├── vpc/ +│ ├── eks/ +│ ├── rds/ +│ └── s3/ +├── azure/ +│ ├── vnet/ +│ ├── aks/ +│ └── storage/ +└── gcp/ + ├── vpc/ + ├── gke/ + └── cloud-sql/ +``` + +## Standard Module Pattern + +``` +module-name/ +├── main.tf # Main resources +├── variables.tf # Input variables +├── outputs.tf # Output values +├── versions.tf # Provider versions +├── README.md # Documentation +├── examples/ # Usage examples +│ └── complete/ +│ ├── main.tf +│ └── variables.tf +└── tests/ # Terratest files + └── module_test.go +``` + +## AWS VPC Module Example + +**main.tf:** +```hcl +resource "aws_vpc" "main" { + cidr_block = var.cidr_block + enable_dns_hostnames = var.enable_dns_hostnames + enable_dns_support = var.enable_dns_support + + tags = merge( + { + Name = var.name + }, + var.tags + ) +} + +resource "aws_subnet" "private" { + count = length(var.private_subnet_cidrs) + vpc_id = aws_vpc.main.id + cidr_block = var.private_subnet_cidrs[count.index] + availability_zone = var.availability_zones[count.index] + + tags = merge( + { + Name = "${var.name}-private-${count.index + 1}" + Tier = "private" + }, + var.tags + ) +} + +resource "aws_internet_gateway" "main" { + count = var.create_internet_gateway ? 1 : 0 + vpc_id = aws_vpc.main.id + + tags = merge( + { + Name = "${var.name}-igw" + }, + var.tags + ) +} +``` + +**variables.tf:** +```hcl +variable "name" { + description = "Name of the VPC" + type = string +} + +variable "cidr_block" { + description = "CIDR block for VPC" + type = string + validation { + condition = can(regex("^([0-9]{1,3}\\.){3}[0-9]{1,3}/[0-9]{1,2}$", var.cidr_block)) + error_message = "CIDR block must be valid IPv4 CIDR notation." + } +} + +variable "availability_zones" { + description = "List of availability zones" + type = list(string) +} + +variable "private_subnet_cidrs" { + description = "CIDR blocks for private subnets" + type = list(string) + default = [] +} + +variable "enable_dns_hostnames" { + description = "Enable DNS hostnames in VPC" + type = bool + default = true +} + +variable "tags" { + description = "Additional tags" + type = map(string) + default = {} +} +``` + +**outputs.tf:** +```hcl +output "vpc_id" { + description = "ID of the VPC" + value = aws_vpc.main.id +} + +output "private_subnet_ids" { + description = "IDs of private subnets" + value = aws_subnet.private[*].id +} + +output "vpc_cidr_block" { + description = "CIDR block of VPC" + value = aws_vpc.main.cidr_block +} +``` + +## Best Practices + +1. **Use semantic versioning** for modules +2. **Document all variables** with descriptions +3. **Provide examples** in examples/ directory +4. **Use validation blocks** for input validation +5. **Output important attributes** for module composition +6. **Pin provider versions** in versions.tf +7. **Use locals** for computed values +8. **Implement conditional resources** with count/for_each +9. **Test modules** with Terratest +10. **Tag all resources** consistently + +## Module Composition + +```hcl +module "vpc" { + source = "../../modules/aws/vpc" + + name = "production" + cidr_block = "10.0.0.0/16" + availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"] + + private_subnet_cidrs = [ + "10.0.1.0/24", + "10.0.2.0/24", + "10.0.3.0/24" + ] + + tags = { + Environment = "production" + ManagedBy = "terraform" + } +} + +module "rds" { + source = "../../modules/aws/rds" + + identifier = "production-db" + engine = "postgres" + engine_version = "15.3" + instance_class = "db.t3.large" + + vpc_id = module.vpc.vpc_id + subnet_ids = module.vpc.private_subnet_ids + + tags = { + Environment = "production" + } +} +``` + +## Reference Files + +- `assets/vpc-module/` - Complete VPC module example +- `assets/rds-module/` - RDS module example +- `references/aws-modules.md` - AWS module patterns +- `references/azure-modules.md` - Azure module patterns +- `references/gcp-modules.md` - GCP module patterns + +## Testing + +```go +// tests/vpc_test.go +package test + +import ( + "testing" + "github.com/gruntwork-io/terratest/modules/terraform" + "github.com/stretchr/testify/assert" +) + +func TestVPCModule(t *testing.T) { + terraformOptions := &terraform.Options{ + TerraformDir: "../examples/complete", + } + + defer terraform.Destroy(t, terraformOptions) + terraform.InitAndApply(t, terraformOptions) + + vpcID := terraform.Output(t, terraformOptions, "vpc_id") + assert.NotEmpty(t, vpcID) +} +``` + +## Related Skills + +- `multi-cloud-architecture` - For architectural decisions +- `cost-optimization` - For cost-effective designs diff --git a/components/skills/iac-terraform-modules-eng/references/aws-modules.md b/components/skills/iac-terraform-modules-eng/references/aws-modules.md new file mode 100644 index 0000000..f79bb04 --- /dev/null +++ b/components/skills/iac-terraform-modules-eng/references/aws-modules.md @@ -0,0 +1,63 @@ +# AWS Terraform Module Patterns + +## VPC Module +- VPC with public/private subnets +- Internet Gateway and NAT Gateways +- Route tables and associations +- Network ACLs +- VPC Flow Logs + +## EKS Module +- EKS cluster with managed node groups +- IRSA (IAM Roles for Service Accounts) +- Cluster autoscaler +- VPC CNI configuration +- Cluster logging + +## RDS Module +- RDS instance or cluster +- Automated backups +- Read replicas +- Parameter groups +- Subnet groups +- Security groups + +## S3 Module +- S3 bucket with versioning +- Encryption at rest +- Bucket policies +- Lifecycle rules +- Replication configuration + +## ALB Module +- Application Load Balancer +- Target groups +- Listener rules +- SSL/TLS certificates +- Access logs + +## Lambda Module +- Lambda function +- IAM execution role +- CloudWatch Logs +- Environment variables +- VPC configuration (optional) + +## Security Group Module +- Reusable security group rules +- Ingress/egress rules +- Dynamic rule creation +- Rule descriptions + +## Best Practices + +1. Use AWS provider version ~> 5.0 +2. Enable encryption by default +3. Use least-privilege IAM +4. Tag all resources consistently +5. Enable logging and monitoring +6. Use KMS for encryption +7. Implement backup strategies +8. Use PrivateLink when possible +9. Enable GuardDuty/SecurityHub +10. Follow AWS Well-Architected Framework diff --git a/components/skills/llm-mcp-builder-dev/LICENSE.txt b/components/skills/llm-mcp-builder-dev/LICENSE.txt new file mode 100644 index 0000000..7a4a3ea --- /dev/null +++ b/components/skills/llm-mcp-builder-dev/LICENSE.txt @@ -0,0 +1,202 @@ + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. \ No newline at end of file diff --git a/components/skills/llm-mcp-builder-dev/SKILL.md b/components/skills/llm-mcp-builder-dev/SKILL.md new file mode 100644 index 0000000..6c3168b --- /dev/null +++ b/components/skills/llm-mcp-builder-dev/SKILL.md @@ -0,0 +1,236 @@ +--- +name: llm-mcp-builder-dev +description: Guide for creating high-quality MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. Use when building MCP servers to integrate external APIs or services, whether in Python (FastMCP) or Node/TypeScript (MCP SDK). +license: Complete terms in LICENSE.txt +--- + +# MCP Server Development Guide + +## Overview + +Create MCP (Model Context Protocol) servers that enable LLMs to interact with external services through well-designed tools. The quality of an MCP server is measured by how well it enables LLMs to accomplish real-world tasks. + +--- + +# Process + +## 🚀 High-Level Workflow + +Creating a high-quality MCP server involves four main phases: + +### Phase 1: Deep Research and Planning + +#### 1.1 Understand Modern MCP Design + +**API Coverage vs. Workflow Tools:** +Balance comprehensive API endpoint coverage with specialized workflow tools. Workflow tools can be more convenient for specific tasks, while comprehensive coverage gives agents flexibility to compose operations. Performance varies by client—some clients benefit from code execution that combines basic tools, while others work better with higher-level workflows. When uncertain, prioritize comprehensive API coverage. + +**Tool Naming and Discoverability:** +Clear, descriptive tool names help agents find the right tools quickly. Use consistent prefixes (e.g., `github_create_issue`, `github_list_repos`) and action-oriented naming. + +**Context Management:** +Agents benefit from concise tool descriptions and the ability to filter/paginate results. Design tools that return focused, relevant data. Some clients support code execution which can help agents filter and process data efficiently. + +**Actionable Error Messages:** +Error messages should guide agents toward solutions with specific suggestions and next steps. + +#### 1.2 Study MCP Protocol Documentation + +**Navigate the MCP specification:** + +Start with the sitemap to find relevant pages: `https://modelcontextprotocol.io/sitemap.xml` + +Then fetch specific pages with `.md` suffix for markdown format (e.g., `https://modelcontextprotocol.io/specification/draft.md`). + +Key pages to review: +- Specification overview and architecture +- Transport mechanisms (streamable HTTP, stdio) +- Tool, resource, and prompt definitions + +#### 1.3 Study Framework Documentation + +**Recommended stack:** +- **Language**: TypeScript (high-quality SDK support and good compatibility in many execution environments e.g. MCPB. Plus AI models are good at generating TypeScript code, benefiting from its broad usage, static typing and good linting tools) +- **Transport**: Streamable HTTP for remote servers, using stateless JSON (simpler to scale and maintain, as opposed to stateful sessions and streaming responses). stdio for local servers. + +**Load framework documentation:** + +- **MCP Best Practices**: [📋 View Best Practices](./reference/mcp_best_practices.md) - Core guidelines + +**For TypeScript (recommended):** +- **TypeScript SDK**: Use WebFetch to load `https://raw.githubusercontent.com/modelcontextprotocol/typescript-sdk/main/README.md` +- [⚡ TypeScript Guide](./reference/node_mcp_server.md) - TypeScript patterns and examples + +**For Python:** +- **Python SDK**: Use WebFetch to load `https://raw.githubusercontent.com/modelcontextprotocol/python-sdk/main/README.md` +- [🐍 Python Guide](./reference/python_mcp_server.md) - Python patterns and examples + +#### 1.4 Plan Your Implementation + +**Understand the API:** +Review the service's API documentation to identify key endpoints, authentication requirements, and data models. Use web search and WebFetch as needed. + +**Tool Selection:** +Prioritize comprehensive API coverage. List endpoints to implement, starting with the most common operations. + +--- + +### Phase 2: Implementation + +#### 2.1 Set Up Project Structure + +See language-specific guides for project setup: +- [⚡ TypeScript Guide](./reference/node_mcp_server.md) - Project structure, package.json, tsconfig.json +- [🐍 Python Guide](./reference/python_mcp_server.md) - Module organization, dependencies + +#### 2.2 Implement Core Infrastructure + +Create shared utilities: +- API client with authentication +- Error handling helpers +- Response formatting (JSON/Markdown) +- Pagination support + +#### 2.3 Implement Tools + +For each tool: + +**Input Schema:** +- Use Zod (TypeScript) or Pydantic (Python) +- Include constraints and clear descriptions +- Add examples in field descriptions + +**Output Schema:** +- Define `outputSchema` where possible for structured data +- Use `structuredContent` in tool responses (TypeScript SDK feature) +- Helps clients understand and process tool outputs + +**Tool Description:** +- Concise summary of functionality +- Parameter descriptions +- Return type schema + +**Implementation:** +- Async/await for I/O operations +- Proper error handling with actionable messages +- Support pagination where applicable +- Return both text content and structured data when using modern SDKs + +**Annotations:** +- `readOnlyHint`: true/false +- `destructiveHint`: true/false +- `idempotentHint`: true/false +- `openWorldHint`: true/false + +--- + +### Phase 3: Review and Test + +#### 3.1 Code Quality + +Review for: +- No duplicated code (DRY principle) +- Consistent error handling +- Full type coverage +- Clear tool descriptions + +#### 3.2 Build and Test + +**TypeScript:** +- Run `npm run build` to verify compilation +- Test with MCP Inspector: `npx @modelcontextprotocol/inspector` + +**Python:** +- Verify syntax: `python -m py_compile your_server.py` +- Test with MCP Inspector + +See language-specific guides for detailed testing approaches and quality checklists. + +--- + +### Phase 4: Create Evaluations + +After implementing your MCP server, create comprehensive evaluations to test its effectiveness. + +**Load [✅ Evaluation Guide](./reference/evaluation.md) for complete evaluation guidelines.** + +#### 4.1 Understand Evaluation Purpose + +Use evaluations to test whether LLMs can effectively use your MCP server to answer realistic, complex questions. + +#### 4.2 Create 10 Evaluation Questions + +To create effective evaluations, follow the process outlined in the evaluation guide: + +1. **Tool Inspection**: List available tools and understand their capabilities +2. **Content Exploration**: Use READ-ONLY operations to explore available data +3. **Question Generation**: Create 10 complex, realistic questions +4. **Answer Verification**: Solve each question yourself to verify answers + +#### 4.3 Evaluation Requirements + +Ensure each question is: +- **Independent**: Not dependent on other questions +- **Read-only**: Only non-destructive operations required +- **Complex**: Requiring multiple tool calls and deep exploration +- **Realistic**: Based on real use cases humans would care about +- **Verifiable**: Single, clear answer that can be verified by string comparison +- **Stable**: Answer won't change over time + +#### 4.4 Output Format + +Create an XML file with this structure: + +```xml + + + Find discussions about AI model launches with animal codenames. One model needed a specific safety designation that uses the format ASL-X. What number X was being determined for the model named after a spotted wild cat? + 3 + + + +``` + +--- + +# Reference Files + +## 📚 Documentation Library + +Load these resources as needed during development: + +### Core MCP Documentation (Load First) +- **MCP Protocol**: Start with sitemap at `https://modelcontextprotocol.io/sitemap.xml`, then fetch specific pages with `.md` suffix +- [📋 MCP Best Practices](./reference/mcp_best_practices.md) - Universal MCP guidelines including: + - Server and tool naming conventions + - Response format guidelines (JSON vs Markdown) + - Pagination best practices + - Transport selection (streamable HTTP vs stdio) + - Security and error handling standards + +### SDK Documentation (Load During Phase 1/2) +- **Python SDK**: Fetch from `https://raw.githubusercontent.com/modelcontextprotocol/python-sdk/main/README.md` +- **TypeScript SDK**: Fetch from `https://raw.githubusercontent.com/modelcontextprotocol/typescript-sdk/main/README.md` + +### Language-Specific Implementation Guides (Load During Phase 2) +- [🐍 Python Implementation Guide](./reference/python_mcp_server.md) - Complete Python/FastMCP guide with: + - Server initialization patterns + - Pydantic model examples + - Tool registration with `@mcp.tool` + - Complete working examples + - Quality checklist + +- [⚡ TypeScript Implementation Guide](./reference/node_mcp_server.md) - Complete TypeScript guide with: + - Project structure + - Zod schema patterns + - Tool registration with `server.registerTool` + - Complete working examples + - Quality checklist + +### Evaluation Guide (Load During Phase 4) +- [✅ Evaluation Guide](./reference/evaluation.md) - Complete evaluation creation guide with: + - Question creation guidelines + - Answer verification strategies + - XML format specifications + - Example questions and answers + - Running an evaluation with the provided scripts diff --git a/components/skills/llm-mcp-builder-dev/reference/evaluation.md b/components/skills/llm-mcp-builder-dev/reference/evaluation.md new file mode 100644 index 0000000..87e9bb7 --- /dev/null +++ b/components/skills/llm-mcp-builder-dev/reference/evaluation.md @@ -0,0 +1,602 @@ +# MCP Server Evaluation Guide + +## Overview + +This document provides guidance on creating comprehensive evaluations for MCP servers. Evaluations test whether LLMs can effectively use your MCP server to answer realistic, complex questions using only the tools provided. + +--- + +## Quick Reference + +### Evaluation Requirements +- Create 10 human-readable questions +- Questions must be READ-ONLY, INDEPENDENT, NON-DESTRUCTIVE +- Each question requires multiple tool calls (potentially dozens) +- Answers must be single, verifiable values +- Answers must be STABLE (won't change over time) + +### Output Format +```xml + + + Your question here + Single verifiable answer + + +``` + +--- + +## Purpose of Evaluations + +The measure of quality of an MCP server is NOT how well or comprehensively the server implements tools, but how well these implementations (input/output schemas, docstrings/descriptions, functionality) enable LLMs with no other context and access ONLY to the MCP servers to answer realistic and difficult questions. + +## Evaluation Overview + +Create 10 human-readable questions requiring ONLY READ-ONLY, INDEPENDENT, NON-DESTRUCTIVE, and IDEMPOTENT operations to answer. Each question should be: +- Realistic +- Clear and concise +- Unambiguous +- Complex, requiring potentially dozens of tool calls or steps +- Answerable with a single, verifiable value that you identify in advance + +## Question Guidelines + +### Core Requirements + +1. **Questions MUST be independent** + - Each question should NOT depend on the answer to any other question + - Should not assume prior write operations from processing another question + +2. **Questions MUST require ONLY NON-DESTRUCTIVE AND IDEMPOTENT tool use** + - Should not instruct or require modifying state to arrive at the correct answer + +3. **Questions must be REALISTIC, CLEAR, CONCISE, and COMPLEX** + - Must require another LLM to use multiple (potentially dozens of) tools or steps to answer + +### Complexity and Depth + +4. **Questions must require deep exploration** + - Consider multi-hop questions requiring multiple sub-questions and sequential tool calls + - Each step should benefit from information found in previous questions + +5. **Questions may require extensive paging** + - May need paging through multiple pages of results + - May require querying old data (1-2 years out-of-date) to find niche information + - The questions must be DIFFICULT + +6. **Questions must require deep understanding** + - Rather than surface-level knowledge + - May pose complex ideas as True/False questions requiring evidence + - May use multiple-choice format where LLM must search different hypotheses + +7. **Questions must not be solvable with straightforward keyword search** + - Do not include specific keywords from the target content + - Use synonyms, related concepts, or paraphrases + - Require multiple searches, analyzing multiple related items, extracting context, then deriving the answer + +### Tool Testing + +8. **Questions should stress-test tool return values** + - May elicit tools returning large JSON objects or lists, overwhelming the LLM + - Should require understanding multiple modalities of data: + - IDs and names + - Timestamps and datetimes (months, days, years, seconds) + - File IDs, names, extensions, and mimetypes + - URLs, GIDs, etc. + - Should probe the tool's ability to return all useful forms of data + +9. **Questions should MOSTLY reflect real human use cases** + - The kinds of information retrieval tasks that HUMANS assisted by an LLM would care about + +10. **Questions may require dozens of tool calls** + - This challenges LLMs with limited context + - Encourages MCP server tools to reduce information returned + +11. **Include ambiguous questions** + - May be ambiguous OR require difficult decisions on which tools to call + - Force the LLM to potentially make mistakes or misinterpret + - Ensure that despite AMBIGUITY, there is STILL A SINGLE VERIFIABLE ANSWER + +### Stability + +12. **Questions must be designed so the answer DOES NOT CHANGE** + - Do not ask questions that rely on "current state" which is dynamic + - For example, do not count: + - Number of reactions to a post + - Number of replies to a thread + - Number of members in a channel + +13. **DO NOT let the MCP server RESTRICT the kinds of questions you create** + - Create challenging and complex questions + - Some may not be solvable with the available MCP server tools + - Questions may require specific output formats (datetime vs. epoch time, JSON vs. MARKDOWN) + - Questions may require dozens of tool calls to complete + +## Answer Guidelines + +### Verification + +1. **Answers must be VERIFIABLE via direct string comparison** + - If the answer can be re-written in many formats, clearly specify the output format in the QUESTION + - Examples: "Use YYYY/MM/DD.", "Respond True or False.", "Answer A, B, C, or D and nothing else." + - Answer should be a single VERIFIABLE value such as: + - User ID, user name, display name, first name, last name + - Channel ID, channel name + - Message ID, string + - URL, title + - Numerical quantity + - Timestamp, datetime + - Boolean (for True/False questions) + - Email address, phone number + - File ID, file name, file extension + - Multiple choice answer + - Answers must not require special formatting or complex, structured output + - Answer will be verified using DIRECT STRING COMPARISON + +### Readability + +2. **Answers should generally prefer HUMAN-READABLE formats** + - Examples: names, first name, last name, datetime, file name, message string, URL, yes/no, true/false, a/b/c/d + - Rather than opaque IDs (though IDs are acceptable) + - The VAST MAJORITY of answers should be human-readable + +### Stability + +3. **Answers must be STABLE/STATIONARY** + - Look at old content (e.g., conversations that have ended, projects that have launched, questions answered) + - Create QUESTIONS based on "closed" concepts that will always return the same answer + - Questions may ask to consider a fixed time window to insulate from non-stationary answers + - Rely on context UNLIKELY to change + - Example: if finding a paper name, be SPECIFIC enough so answer is not confused with papers published later + +4. **Answers must be CLEAR and UNAMBIGUOUS** + - Questions must be designed so there is a single, clear answer + - Answer can be derived from using the MCP server tools + +### Diversity + +5. **Answers must be DIVERSE** + - Answer should be a single VERIFIABLE value in diverse modalities and formats + - User concept: user ID, user name, display name, first name, last name, email address, phone number + - Channel concept: channel ID, channel name, channel topic + - Message concept: message ID, message string, timestamp, month, day, year + +6. **Answers must NOT be complex structures** + - Not a list of values + - Not a complex object + - Not a list of IDs or strings + - Not natural language text + - UNLESS the answer can be straightforwardly verified using DIRECT STRING COMPARISON + - And can be realistically reproduced + - It should be unlikely that an LLM would return the same list in any other order or format + +## Evaluation Process + +### Step 1: Documentation Inspection + +Read the documentation of the target API to understand: +- Available endpoints and functionality +- If ambiguity exists, fetch additional information from the web +- Parallelize this step AS MUCH AS POSSIBLE +- Ensure each subagent is ONLY examining documentation from the file system or on the web + +### Step 2: Tool Inspection + +List the tools available in the MCP server: +- Inspect the MCP server directly +- Understand input/output schemas, docstrings, and descriptions +- WITHOUT calling the tools themselves at this stage + +### Step 3: Developing Understanding + +Repeat steps 1 & 2 until you have a good understanding: +- Iterate multiple times +- Think about the kinds of tasks you want to create +- Refine your understanding +- At NO stage should you READ the code of the MCP server implementation itself +- Use your intuition and understanding to create reasonable, realistic, but VERY challenging tasks + +### Step 4: Read-Only Content Inspection + +After understanding the API and tools, USE the MCP server tools: +- Inspect content using READ-ONLY and NON-DESTRUCTIVE operations ONLY +- Goal: identify specific content (e.g., users, channels, messages, projects, tasks) for creating realistic questions +- Should NOT call any tools that modify state +- Will NOT read the code of the MCP server implementation itself +- Parallelize this step with individual sub-agents pursuing independent explorations +- Ensure each subagent is only performing READ-ONLY, NON-DESTRUCTIVE, and IDEMPOTENT operations +- BE CAREFUL: SOME TOOLS may return LOTS OF DATA which would cause you to run out of CONTEXT +- Make INCREMENTAL, SMALL, AND TARGETED tool calls for exploration +- In all tool call requests, use the `limit` parameter to limit results (<10) +- Use pagination + +### Step 5: Task Generation + +After inspecting the content, create 10 human-readable questions: +- An LLM should be able to answer these with the MCP server +- Follow all question and answer guidelines above + +## Output Format + +Each QA pair consists of a question and an answer. The output should be an XML file with this structure: + +```xml + + + Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name? + Website Redesign + + + Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username. + sarah_dev + + + Look for pull requests that modified files in the /api directory and were merged between January 1 and January 31, 2024. How many different contributors worked on these PRs? + 7 + + + Find the repository with the most stars that was created before 2023. What is the repository name? + data-pipeline + + +``` + +## Evaluation Examples + +### Good Questions + +**Example 1: Multi-hop question requiring deep exploration (GitHub MCP)** +```xml + + Find the repository that was archived in Q3 2023 and had previously been the most forked project in the organization. What was the primary programming language used in that repository? + Python + +``` + +This question is good because: +- Requires multiple searches to find archived repositories +- Needs to identify which had the most forks before archival +- Requires examining repository details for the language +- Answer is a simple, verifiable value +- Based on historical (closed) data that won't change + +**Example 2: Requires understanding context without keyword matching (Project Management MCP)** +```xml + + Locate the initiative focused on improving customer onboarding that was completed in late 2023. The project lead created a retrospective document after completion. What was the lead's role title at that time? + Product Manager + +``` + +This question is good because: +- Doesn't use specific project name ("initiative focused on improving customer onboarding") +- Requires finding completed projects from specific timeframe +- Needs to identify the project lead and their role +- Requires understanding context from retrospective documents +- Answer is human-readable and stable +- Based on completed work (won't change) + +**Example 3: Complex aggregation requiring multiple steps (Issue Tracker MCP)** +```xml + + Among all bugs reported in January 2024 that were marked as critical priority, which assignee resolved the highest percentage of their assigned bugs within 48 hours? Provide the assignee's username. + alex_eng + +``` + +This question is good because: +- Requires filtering bugs by date, priority, and status +- Needs to group by assignee and calculate resolution rates +- Requires understanding timestamps to determine 48-hour windows +- Tests pagination (potentially many bugs to process) +- Answer is a single username +- Based on historical data from specific time period + +**Example 4: Requires synthesis across multiple data types (CRM MCP)** +```xml + + Find the account that upgraded from the Starter to Enterprise plan in Q4 2023 and had the highest annual contract value. What industry does this account operate in? + Healthcare + +``` + +This question is good because: +- Requires understanding subscription tier changes +- Needs to identify upgrade events in specific timeframe +- Requires comparing contract values +- Must access account industry information +- Answer is simple and verifiable +- Based on completed historical transactions + +### Poor Questions + +**Example 1: Answer changes over time** +```xml + + How many open issues are currently assigned to the engineering team? + 47 + +``` + +This question is poor because: +- The answer will change as issues are created, closed, or reassigned +- Not based on stable/stationary data +- Relies on "current state" which is dynamic + +**Example 2: Too easy with keyword search** +```xml + + Find the pull request with title "Add authentication feature" and tell me who created it. + developer123 + +``` + +This question is poor because: +- Can be solved with a straightforward keyword search for exact title +- Doesn't require deep exploration or understanding +- No synthesis or analysis needed + +**Example 3: Ambiguous answer format** +```xml + + List all the repositories that have Python as their primary language. + repo1, repo2, repo3, data-pipeline, ml-tools + +``` + +This question is poor because: +- Answer is a list that could be returned in any order +- Difficult to verify with direct string comparison +- LLM might format differently (JSON array, comma-separated, newline-separated) +- Better to ask for a specific aggregate (count) or superlative (most stars) + +## Verification Process + +After creating evaluations: + +1. **Examine the XML file** to understand the schema +2. **Load each task instruction** and in parallel using the MCP server and tools, identify the correct answer by attempting to solve the task YOURSELF +3. **Flag any operations** that require WRITE or DESTRUCTIVE operations +4. **Accumulate all CORRECT answers** and replace any incorrect answers in the document +5. **Remove any ``** that require WRITE or DESTRUCTIVE operations + +Remember to parallelize solving tasks to avoid running out of context, then accumulate all answers and make changes to the file at the end. + +## Tips for Creating Quality Evaluations + +1. **Think Hard and Plan Ahead** before generating tasks +2. **Parallelize Where Opportunity Arises** to speed up the process and manage context +3. **Focus on Realistic Use Cases** that humans would actually want to accomplish +4. **Create Challenging Questions** that test the limits of the MCP server's capabilities +5. **Ensure Stability** by using historical data and closed concepts +6. **Verify Answers** by solving the questions yourself using the MCP server tools +7. **Iterate and Refine** based on what you learn during the process + +--- + +# Running Evaluations + +After creating your evaluation file, you can use the provided evaluation harness to test your MCP server. + +## Setup + +1. **Install Dependencies** + + ```bash + pip install -r scripts/requirements.txt + ``` + + Or install manually: + ```bash + pip install anthropic mcp + ``` + +2. **Set API Key** + + ```bash + export ANTHROPIC_API_KEY=your_api_key_here + ``` + +## Evaluation File Format + +Evaluation files use XML format with `` elements: + +```xml + + + Find the project created in Q2 2024 with the highest number of completed tasks. What is the project name? + Website Redesign + + + Search for issues labeled as "bug" that were closed in March 2024. Which user closed the most issues? Provide their username. + sarah_dev + + +``` + +## Running Evaluations + +The evaluation script (`scripts/evaluation.py`) supports three transport types: + +**Important:** +- **stdio transport**: The evaluation script automatically launches and manages the MCP server process for you. Do not run the server manually. +- **sse/http transports**: You must start the MCP server separately before running the evaluation. The script connects to the already-running server at the specified URL. + +### 1. Local STDIO Server + +For locally-run MCP servers (script launches the server automatically): + +```bash +python scripts/evaluation.py \ + -t stdio \ + -c python \ + -a my_mcp_server.py \ + evaluation.xml +``` + +With environment variables: +```bash +python scripts/evaluation.py \ + -t stdio \ + -c python \ + -a my_mcp_server.py \ + -e API_KEY=abc123 \ + -e DEBUG=true \ + evaluation.xml +``` + +### 2. Server-Sent Events (SSE) + +For SSE-based MCP servers (you must start the server first): + +```bash +python scripts/evaluation.py \ + -t sse \ + -u https://example.com/mcp \ + -H "Authorization: Bearer token123" \ + -H "X-Custom-Header: value" \ + evaluation.xml +``` + +### 3. HTTP (Streamable HTTP) + +For HTTP-based MCP servers (you must start the server first): + +```bash +python scripts/evaluation.py \ + -t http \ + -u https://example.com/mcp \ + -H "Authorization: Bearer token123" \ + evaluation.xml +``` + +## Command-Line Options + +``` +usage: evaluation.py [-h] [-t {stdio,sse,http}] [-m MODEL] [-c COMMAND] + [-a ARGS [ARGS ...]] [-e ENV [ENV ...]] [-u URL] + [-H HEADERS [HEADERS ...]] [-o OUTPUT] + eval_file + +positional arguments: + eval_file Path to evaluation XML file + +optional arguments: + -h, --help Show help message + -t, --transport Transport type: stdio, sse, or http (default: stdio) + -m, --model Claude model to use (default: claude-3-7-sonnet-20250219) + -o, --output Output file for report (default: print to stdout) + +stdio options: + -c, --command Command to run MCP server (e.g., python, node) + -a, --args Arguments for the command (e.g., server.py) + -e, --env Environment variables in KEY=VALUE format + +sse/http options: + -u, --url MCP server URL + -H, --header HTTP headers in 'Key: Value' format +``` + +## Output + +The evaluation script generates a detailed report including: + +- **Summary Statistics**: + - Accuracy (correct/total) + - Average task duration + - Average tool calls per task + - Total tool calls + +- **Per-Task Results**: + - Prompt and expected response + - Actual response from the agent + - Whether the answer was correct (✅/❌) + - Duration and tool call details + - Agent's summary of its approach + - Agent's feedback on the tools + +### Save Report to File + +```bash +python scripts/evaluation.py \ + -t stdio \ + -c python \ + -a my_server.py \ + -o evaluation_report.md \ + evaluation.xml +``` + +## Complete Example Workflow + +Here's a complete example of creating and running an evaluation: + +1. **Create your evaluation file** (`my_evaluation.xml`): + +```xml + + + Find the user who created the most issues in January 2024. What is their username? + alice_developer + + + Among all pull requests merged in Q1 2024, which repository had the highest number? Provide the repository name. + backend-api + + + Find the project that was completed in December 2023 and had the longest duration from start to finish. How many days did it take? + 127 + + +``` + +2. **Install dependencies**: + +```bash +pip install -r scripts/requirements.txt +export ANTHROPIC_API_KEY=your_api_key +``` + +3. **Run evaluation**: + +```bash +python scripts/evaluation.py \ + -t stdio \ + -c python \ + -a github_mcp_server.py \ + -e GITHUB_TOKEN=ghp_xxx \ + -o github_eval_report.md \ + my_evaluation.xml +``` + +4. **Review the report** in `github_eval_report.md` to: + - See which questions passed/failed + - Read the agent's feedback on your tools + - Identify areas for improvement + - Iterate on your MCP server design + +## Troubleshooting + +### Connection Errors + +If you get connection errors: +- **STDIO**: Verify the command and arguments are correct +- **SSE/HTTP**: Check the URL is accessible and headers are correct +- Ensure any required API keys are set in environment variables or headers + +### Low Accuracy + +If many evaluations fail: +- Review the agent's feedback for each task +- Check if tool descriptions are clear and comprehensive +- Verify input parameters are well-documented +- Consider whether tools return too much or too little data +- Ensure error messages are actionable + +### Timeout Issues + +If tasks are timing out: +- Use a more capable model (e.g., `claude-3-7-sonnet-20250219`) +- Check if tools are returning too much data +- Verify pagination is working correctly +- Consider simplifying complex questions \ No newline at end of file diff --git a/components/skills/llm-mcp-builder-dev/reference/mcp_best_practices.md b/components/skills/llm-mcp-builder-dev/reference/mcp_best_practices.md new file mode 100644 index 0000000..b9d343c --- /dev/null +++ b/components/skills/llm-mcp-builder-dev/reference/mcp_best_practices.md @@ -0,0 +1,249 @@ +# MCP Server Best Practices + +## Quick Reference + +### Server Naming +- **Python**: `{service}_mcp` (e.g., `slack_mcp`) +- **Node/TypeScript**: `{service}-mcp-server` (e.g., `slack-mcp-server`) + +### Tool Naming +- Use snake_case with service prefix +- Format: `{service}_{action}_{resource}` +- Example: `slack_send_message`, `github_create_issue` + +### Response Formats +- Support both JSON and Markdown formats +- JSON for programmatic processing +- Markdown for human readability + +### Pagination +- Always respect `limit` parameter +- Return `has_more`, `next_offset`, `total_count` +- Default to 20-50 items + +### Transport +- **Streamable HTTP**: For remote servers, multi-client scenarios +- **stdio**: For local integrations, command-line tools +- Avoid SSE (deprecated in favor of streamable HTTP) + +--- + +## Server Naming Conventions + +Follow these standardized naming patterns: + +**Python**: Use format `{service}_mcp` (lowercase with underscores) +- Examples: `slack_mcp`, `github_mcp`, `jira_mcp` + +**Node/TypeScript**: Use format `{service}-mcp-server` (lowercase with hyphens) +- Examples: `slack-mcp-server`, `github-mcp-server`, `jira-mcp-server` + +The name should be general, descriptive of the service being integrated, easy to infer from the task description, and without version numbers. + +--- + +## Tool Naming and Design + +### Tool Naming + +1. **Use snake_case**: `search_users`, `create_project`, `get_channel_info` +2. **Include service prefix**: Anticipate that your MCP server may be used alongside other MCP servers + - Use `slack_send_message` instead of just `send_message` + - Use `github_create_issue` instead of just `create_issue` +3. **Be action-oriented**: Start with verbs (get, list, search, create, etc.) +4. **Be specific**: Avoid generic names that could conflict with other servers + +### Tool Design + +- Tool descriptions must narrowly and unambiguously describe functionality +- Descriptions must precisely match actual functionality +- Provide tool annotations (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) +- Keep tool operations focused and atomic + +--- + +## Response Formats + +All tools that return data should support multiple formats: + +### JSON Format (`response_format="json"`) +- Machine-readable structured data +- Include all available fields and metadata +- Consistent field names and types +- Use for programmatic processing + +### Markdown Format (`response_format="markdown"`, typically default) +- Human-readable formatted text +- Use headers, lists, and formatting for clarity +- Convert timestamps to human-readable format +- Show display names with IDs in parentheses +- Omit verbose metadata + +--- + +## Pagination + +For tools that list resources: + +- **Always respect the `limit` parameter** +- **Implement pagination**: Use `offset` or cursor-based pagination +- **Return pagination metadata**: Include `has_more`, `next_offset`/`next_cursor`, `total_count` +- **Never load all results into memory**: Especially important for large datasets +- **Default to reasonable limits**: 20-50 items is typical + +Example pagination response: +```json +{ + "total": 150, + "count": 20, + "offset": 0, + "items": [...], + "has_more": true, + "next_offset": 20 +} +``` + +--- + +## Transport Options + +### Streamable HTTP + +**Best for**: Remote servers, web services, multi-client scenarios + +**Characteristics**: +- Bidirectional communication over HTTP +- Supports multiple simultaneous clients +- Can be deployed as a web service +- Enables server-to-client notifications + +**Use when**: +- Serving multiple clients simultaneously +- Deploying as a cloud service +- Integration with web applications + +### stdio + +**Best for**: Local integrations, command-line tools + +**Characteristics**: +- Standard input/output stream communication +- Simple setup, no network configuration needed +- Runs as a subprocess of the client + +**Use when**: +- Building tools for local development environments +- Integrating with desktop applications +- Single-user, single-session scenarios + +**Note**: stdio servers should NOT log to stdout (use stderr for logging) + +### Transport Selection + +| Criterion | stdio | Streamable HTTP | +|-----------|-------|-----------------| +| **Deployment** | Local | Remote | +| **Clients** | Single | Multiple | +| **Complexity** | Low | Medium | +| **Real-time** | No | Yes | + +--- + +## Security Best Practices + +### Authentication and Authorization + +**OAuth 2.1**: +- Use secure OAuth 2.1 with certificates from recognized authorities +- Validate access tokens before processing requests +- Only accept tokens specifically intended for your server + +**API Keys**: +- Store API keys in environment variables, never in code +- Validate keys on server startup +- Provide clear error messages when authentication fails + +### Input Validation + +- Sanitize file paths to prevent directory traversal +- Validate URLs and external identifiers +- Check parameter sizes and ranges +- Prevent command injection in system calls +- Use schema validation (Pydantic/Zod) for all inputs + +### Error Handling + +- Don't expose internal errors to clients +- Log security-relevant errors server-side +- Provide helpful but not revealing error messages +- Clean up resources after errors + +### DNS Rebinding Protection + +For streamable HTTP servers running locally: +- Enable DNS rebinding protection +- Validate the `Origin` header on all incoming connections +- Bind to `127.0.0.1` rather than `0.0.0.0` + +--- + +## Tool Annotations + +Provide annotations to help clients understand tool behavior: + +| Annotation | Type | Default | Description | +|-----------|------|---------|-------------| +| `readOnlyHint` | boolean | false | Tool does not modify its environment | +| `destructiveHint` | boolean | true | Tool may perform destructive updates | +| `idempotentHint` | boolean | false | Repeated calls with same args have no additional effect | +| `openWorldHint` | boolean | true | Tool interacts with external entities | + +**Important**: Annotations are hints, not security guarantees. Clients should not make security-critical decisions based solely on annotations. + +--- + +## Error Handling + +- Use standard JSON-RPC error codes +- Report tool errors within result objects (not protocol-level errors) +- Provide helpful, specific error messages with suggested next steps +- Don't expose internal implementation details +- Clean up resources properly on errors + +Example error handling: +```typescript +try { + const result = performOperation(); + return { content: [{ type: "text", text: result }] }; +} catch (error) { + return { + isError: true, + content: [{ + type: "text", + text: `Error: ${error.message}. Try using filter='active_only' to reduce results.` + }] + }; +} +``` + +--- + +## Testing Requirements + +Comprehensive testing should cover: + +- **Functional testing**: Verify correct execution with valid/invalid inputs +- **Integration testing**: Test interaction with external systems +- **Security testing**: Validate auth, input sanitization, rate limiting +- **Performance testing**: Check behavior under load, timeouts +- **Error handling**: Ensure proper error reporting and cleanup + +--- + +## Documentation Requirements + +- Provide clear documentation of all tools and capabilities +- Include working examples (at least 3 per major feature) +- Document security considerations +- Specify required permissions and access levels +- Document rate limits and performance characteristics diff --git a/components/skills/llm-mcp-builder-dev/reference/node_mcp_server.md b/components/skills/llm-mcp-builder-dev/reference/node_mcp_server.md new file mode 100644 index 0000000..f6e5df9 --- /dev/null +++ b/components/skills/llm-mcp-builder-dev/reference/node_mcp_server.md @@ -0,0 +1,970 @@ +# Node/TypeScript MCP Server Implementation Guide + +## Overview + +This document provides Node/TypeScript-specific best practices and examples for implementing MCP servers using the MCP TypeScript SDK. It covers project structure, server setup, tool registration patterns, input validation with Zod, error handling, and complete working examples. + +--- + +## Quick Reference + +### Key Imports +```typescript +import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; +import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js"; +import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; +import express from "express"; +import { z } from "zod"; +``` + +### Server Initialization +```typescript +const server = new McpServer({ + name: "service-mcp-server", + version: "1.0.0" +}); +``` + +### Tool Registration Pattern +```typescript +server.registerTool( + "tool_name", + { + title: "Tool Display Name", + description: "What the tool does", + inputSchema: { param: z.string() }, + outputSchema: { result: z.string() } + }, + async ({ param }) => { + const output = { result: `Processed: ${param}` }; + return { + content: [{ type: "text", text: JSON.stringify(output) }], + structuredContent: output // Modern pattern for structured data + }; + } +); +``` + +--- + +## MCP TypeScript SDK + +The official MCP TypeScript SDK provides: +- `McpServer` class for server initialization +- `registerTool` method for tool registration +- Zod schema integration for runtime input validation +- Type-safe tool handler implementations + +**IMPORTANT - Use Modern APIs Only:** +- **DO use**: `server.registerTool()`, `server.registerResource()`, `server.registerPrompt()` +- **DO NOT use**: Old deprecated APIs such as `server.tool()`, `server.setRequestHandler(ListToolsRequestSchema, ...)`, or manual handler registration +- The `register*` methods provide better type safety, automatic schema handling, and are the recommended approach + +See the MCP SDK documentation in the references for complete details. + +## Server Naming Convention + +Node/TypeScript MCP servers must follow this naming pattern: +- **Format**: `{service}-mcp-server` (lowercase with hyphens) +- **Examples**: `github-mcp-server`, `jira-mcp-server`, `stripe-mcp-server` + +The name should be: +- General (not tied to specific features) +- Descriptive of the service/API being integrated +- Easy to infer from the task description +- Without version numbers or dates + +## Project Structure + +Create the following structure for Node/TypeScript MCP servers: + +``` +{service}-mcp-server/ +├── package.json +├── tsconfig.json +├── README.md +├── src/ +│ ├── index.ts # Main entry point with McpServer initialization +│ ├── types.ts # TypeScript type definitions and interfaces +│ ├── tools/ # Tool implementations (one file per domain) +│ ├── services/ # API clients and shared utilities +│ ├── schemas/ # Zod validation schemas +│ └── constants.ts # Shared constants (API_URL, CHARACTER_LIMIT, etc.) +└── dist/ # Built JavaScript files (entry point: dist/index.js) +``` + +## Tool Implementation + +### Tool Naming + +Use snake_case for tool names (e.g., "search_users", "create_project", "get_channel_info") with clear, action-oriented names. + +**Avoid Naming Conflicts**: Include the service context to prevent overlaps: +- Use "slack_send_message" instead of just "send_message" +- Use "github_create_issue" instead of just "create_issue" +- Use "asana_list_tasks" instead of just "list_tasks" + +### Tool Structure + +Tools are registered using the `registerTool` method with the following requirements: +- Use Zod schemas for runtime input validation and type safety +- The `description` field must be explicitly provided - JSDoc comments are NOT automatically extracted +- Explicitly provide `title`, `description`, `inputSchema`, and `annotations` +- The `inputSchema` must be a Zod schema object (not a JSON schema) +- Type all parameters and return values explicitly + +```typescript +import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; +import { z } from "zod"; + +const server = new McpServer({ + name: "example-mcp", + version: "1.0.0" +}); + +// Zod schema for input validation +const UserSearchInputSchema = z.object({ + query: z.string() + .min(2, "Query must be at least 2 characters") + .max(200, "Query must not exceed 200 characters") + .describe("Search string to match against names/emails"), + limit: z.number() + .int() + .min(1) + .max(100) + .default(20) + .describe("Maximum results to return"), + offset: z.number() + .int() + .min(0) + .default(0) + .describe("Number of results to skip for pagination"), + response_format: z.nativeEnum(ResponseFormat) + .default(ResponseFormat.MARKDOWN) + .describe("Output format: 'markdown' for human-readable or 'json' for machine-readable") +}).strict(); + +// Type definition from Zod schema +type UserSearchInput = z.infer; + +server.registerTool( + "example_search_users", + { + title: "Search Example Users", + description: `Search for users in the Example system by name, email, or team. + +This tool searches across all user profiles in the Example platform, supporting partial matches and various search filters. It does NOT create or modify users, only searches existing ones. + +Args: + - query (string): Search string to match against names/emails + - limit (number): Maximum results to return, between 1-100 (default: 20) + - offset (number): Number of results to skip for pagination (default: 0) + - response_format ('markdown' | 'json'): Output format (default: 'markdown') + +Returns: + For JSON format: Structured data with schema: + { + "total": number, // Total number of matches found + "count": number, // Number of results in this response + "offset": number, // Current pagination offset + "users": [ + { + "id": string, // User ID (e.g., "U123456789") + "name": string, // Full name (e.g., "John Doe") + "email": string, // Email address + "team": string, // Team name (optional) + "active": boolean // Whether user is active + } + ], + "has_more": boolean, // Whether more results are available + "next_offset": number // Offset for next page (if has_more is true) + } + +Examples: + - Use when: "Find all marketing team members" -> params with query="team:marketing" + - Use when: "Search for John's account" -> params with query="john" + - Don't use when: You need to create a user (use example_create_user instead) + +Error Handling: + - Returns "Error: Rate limit exceeded" if too many requests (429 status) + - Returns "No users found matching ''" if search returns empty`, + inputSchema: UserSearchInputSchema, + annotations: { + readOnlyHint: true, + destructiveHint: false, + idempotentHint: true, + openWorldHint: true + } + }, + async (params: UserSearchInput) => { + try { + // Input validation is handled by Zod schema + // Make API request using validated parameters + const data = await makeApiRequest( + "users/search", + "GET", + undefined, + { + q: params.query, + limit: params.limit, + offset: params.offset + } + ); + + const users = data.users || []; + const total = data.total || 0; + + if (!users.length) { + return { + content: [{ + type: "text", + text: `No users found matching '${params.query}'` + }] + }; + } + + // Prepare structured output + const output = { + total, + count: users.length, + offset: params.offset, + users: users.map((user: any) => ({ + id: user.id, + name: user.name, + email: user.email, + ...(user.team ? { team: user.team } : {}), + active: user.active ?? true + })), + has_more: total > params.offset + users.length, + ...(total > params.offset + users.length ? { + next_offset: params.offset + users.length + } : {}) + }; + + // Format text representation based on requested format + let textContent: string; + if (params.response_format === ResponseFormat.MARKDOWN) { + const lines = [`# User Search Results: '${params.query}'`, "", + `Found ${total} users (showing ${users.length})`, ""]; + for (const user of users) { + lines.push(`## ${user.name} (${user.id})`); + lines.push(`- **Email**: ${user.email}`); + if (user.team) lines.push(`- **Team**: ${user.team}`); + lines.push(""); + } + textContent = lines.join("\n"); + } else { + textContent = JSON.stringify(output, null, 2); + } + + return { + content: [{ type: "text", text: textContent }], + structuredContent: output // Modern pattern for structured data + }; + } catch (error) { + return { + content: [{ + type: "text", + text: handleApiError(error) + }] + }; + } + } +); +``` + +## Zod Schemas for Input Validation + +Zod provides runtime type validation: + +```typescript +import { z } from "zod"; + +// Basic schema with validation +const CreateUserSchema = z.object({ + name: z.string() + .min(1, "Name is required") + .max(100, "Name must not exceed 100 characters"), + email: z.string() + .email("Invalid email format"), + age: z.number() + .int("Age must be a whole number") + .min(0, "Age cannot be negative") + .max(150, "Age cannot be greater than 150") +}).strict(); // Use .strict() to forbid extra fields + +// Enums +enum ResponseFormat { + MARKDOWN = "markdown", + JSON = "json" +} + +const SearchSchema = z.object({ + response_format: z.nativeEnum(ResponseFormat) + .default(ResponseFormat.MARKDOWN) + .describe("Output format") +}); + +// Optional fields with defaults +const PaginationSchema = z.object({ + limit: z.number() + .int() + .min(1) + .max(100) + .default(20) + .describe("Maximum results to return"), + offset: z.number() + .int() + .min(0) + .default(0) + .describe("Number of results to skip") +}); +``` + +## Response Format Options + +Support multiple output formats for flexibility: + +```typescript +enum ResponseFormat { + MARKDOWN = "markdown", + JSON = "json" +} + +const inputSchema = z.object({ + query: z.string(), + response_format: z.nativeEnum(ResponseFormat) + .default(ResponseFormat.MARKDOWN) + .describe("Output format: 'markdown' for human-readable or 'json' for machine-readable") +}); +``` + +**Markdown format**: +- Use headers, lists, and formatting for clarity +- Convert timestamps to human-readable format +- Show display names with IDs in parentheses +- Omit verbose metadata +- Group related information logically + +**JSON format**: +- Return complete, structured data suitable for programmatic processing +- Include all available fields and metadata +- Use consistent field names and types + +## Pagination Implementation + +For tools that list resources: + +```typescript +const ListSchema = z.object({ + limit: z.number().int().min(1).max(100).default(20), + offset: z.number().int().min(0).default(0) +}); + +async function listItems(params: z.infer) { + const data = await apiRequest(params.limit, params.offset); + + const response = { + total: data.total, + count: data.items.length, + offset: params.offset, + items: data.items, + has_more: data.total > params.offset + data.items.length, + next_offset: data.total > params.offset + data.items.length + ? params.offset + data.items.length + : undefined + }; + + return JSON.stringify(response, null, 2); +} +``` + +## Character Limits and Truncation + +Add a CHARACTER_LIMIT constant to prevent overwhelming responses: + +```typescript +// At module level in constants.ts +export const CHARACTER_LIMIT = 25000; // Maximum response size in characters + +async function searchTool(params: SearchInput) { + let result = generateResponse(data); + + // Check character limit and truncate if needed + if (result.length > CHARACTER_LIMIT) { + const truncatedData = data.slice(0, Math.max(1, data.length / 2)); + response.data = truncatedData; + response.truncated = true; + response.truncation_message = + `Response truncated from ${data.length} to ${truncatedData.length} items. ` + + `Use 'offset' parameter or add filters to see more results.`; + result = JSON.stringify(response, null, 2); + } + + return result; +} +``` + +## Error Handling + +Provide clear, actionable error messages: + +```typescript +import axios, { AxiosError } from "axios"; + +function handleApiError(error: unknown): string { + if (error instanceof AxiosError) { + if (error.response) { + switch (error.response.status) { + case 404: + return "Error: Resource not found. Please check the ID is correct."; + case 403: + return "Error: Permission denied. You don't have access to this resource."; + case 429: + return "Error: Rate limit exceeded. Please wait before making more requests."; + default: + return `Error: API request failed with status ${error.response.status}`; + } + } else if (error.code === "ECONNABORTED") { + return "Error: Request timed out. Please try again."; + } + } + return `Error: Unexpected error occurred: ${error instanceof Error ? error.message : String(error)}`; +} +``` + +## Shared Utilities + +Extract common functionality into reusable functions: + +```typescript +// Shared API request function +async function makeApiRequest( + endpoint: string, + method: "GET" | "POST" | "PUT" | "DELETE" = "GET", + data?: any, + params?: any +): Promise { + try { + const response = await axios({ + method, + url: `${API_BASE_URL}/${endpoint}`, + data, + params, + timeout: 30000, + headers: { + "Content-Type": "application/json", + "Accept": "application/json" + } + }); + return response.data; + } catch (error) { + throw error; + } +} +``` + +## Async/Await Best Practices + +Always use async/await for network requests and I/O operations: + +```typescript +// Good: Async network request +async function fetchData(resourceId: string): Promise { + const response = await axios.get(`${API_URL}/resource/${resourceId}`); + return response.data; +} + +// Bad: Promise chains +function fetchData(resourceId: string): Promise { + return axios.get(`${API_URL}/resource/${resourceId}`) + .then(response => response.data); // Harder to read and maintain +} +``` + +## TypeScript Best Practices + +1. **Use Strict TypeScript**: Enable strict mode in tsconfig.json +2. **Define Interfaces**: Create clear interface definitions for all data structures +3. **Avoid `any`**: Use proper types or `unknown` instead of `any` +4. **Zod for Runtime Validation**: Use Zod schemas to validate external data +5. **Type Guards**: Create type guard functions for complex type checking +6. **Error Handling**: Always use try-catch with proper error type checking +7. **Null Safety**: Use optional chaining (`?.`) and nullish coalescing (`??`) + +```typescript +// Good: Type-safe with Zod and interfaces +interface UserResponse { + id: string; + name: string; + email: string; + team?: string; + active: boolean; +} + +const UserSchema = z.object({ + id: z.string(), + name: z.string(), + email: z.string().email(), + team: z.string().optional(), + active: z.boolean() +}); + +type User = z.infer; + +async function getUser(id: string): Promise { + const data = await apiCall(`/users/${id}`); + return UserSchema.parse(data); // Runtime validation +} + +// Bad: Using any +async function getUser(id: string): Promise { + return await apiCall(`/users/${id}`); // No type safety +} +``` + +## Package Configuration + +### package.json + +```json +{ + "name": "{service}-mcp-server", + "version": "1.0.0", + "description": "MCP server for {Service} API integration", + "type": "module", + "main": "dist/index.js", + "scripts": { + "start": "node dist/index.js", + "dev": "tsx watch src/index.ts", + "build": "tsc", + "clean": "rm -rf dist" + }, + "engines": { + "node": ">=18" + }, + "dependencies": { + "@modelcontextprotocol/sdk": "^1.6.1", + "axios": "^1.7.9", + "zod": "^3.23.8" + }, + "devDependencies": { + "@types/node": "^22.10.0", + "tsx": "^4.19.2", + "typescript": "^5.7.2" + } +} +``` + +### tsconfig.json + +```json +{ + "compilerOptions": { + "target": "ES2022", + "module": "Node16", + "moduleResolution": "Node16", + "lib": ["ES2022"], + "outDir": "./dist", + "rootDir": "./src", + "strict": true, + "esModuleInterop": true, + "skipLibCheck": true, + "forceConsistentCasingInFileNames": true, + "declaration": true, + "declarationMap": true, + "sourceMap": true, + "allowSyntheticDefaultImports": true + }, + "include": ["src/**/*"], + "exclude": ["node_modules", "dist"] +} +``` + +## Complete Example + +```typescript +#!/usr/bin/env node +/** + * MCP Server for Example Service. + * + * This server provides tools to interact with Example API, including user search, + * project management, and data export capabilities. + */ + +import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; +import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; +import { z } from "zod"; +import axios, { AxiosError } from "axios"; + +// Constants +const API_BASE_URL = "https://api.example.com/v1"; +const CHARACTER_LIMIT = 25000; + +// Enums +enum ResponseFormat { + MARKDOWN = "markdown", + JSON = "json" +} + +// Zod schemas +const UserSearchInputSchema = z.object({ + query: z.string() + .min(2, "Query must be at least 2 characters") + .max(200, "Query must not exceed 200 characters") + .describe("Search string to match against names/emails"), + limit: z.number() + .int() + .min(1) + .max(100) + .default(20) + .describe("Maximum results to return"), + offset: z.number() + .int() + .min(0) + .default(0) + .describe("Number of results to skip for pagination"), + response_format: z.nativeEnum(ResponseFormat) + .default(ResponseFormat.MARKDOWN) + .describe("Output format: 'markdown' for human-readable or 'json' for machine-readable") +}).strict(); + +type UserSearchInput = z.infer; + +// Shared utility functions +async function makeApiRequest( + endpoint: string, + method: "GET" | "POST" | "PUT" | "DELETE" = "GET", + data?: any, + params?: any +): Promise { + try { + const response = await axios({ + method, + url: `${API_BASE_URL}/${endpoint}`, + data, + params, + timeout: 30000, + headers: { + "Content-Type": "application/json", + "Accept": "application/json" + } + }); + return response.data; + } catch (error) { + throw error; + } +} + +function handleApiError(error: unknown): string { + if (error instanceof AxiosError) { + if (error.response) { + switch (error.response.status) { + case 404: + return "Error: Resource not found. Please check the ID is correct."; + case 403: + return "Error: Permission denied. You don't have access to this resource."; + case 429: + return "Error: Rate limit exceeded. Please wait before making more requests."; + default: + return `Error: API request failed with status ${error.response.status}`; + } + } else if (error.code === "ECONNABORTED") { + return "Error: Request timed out. Please try again."; + } + } + return `Error: Unexpected error occurred: ${error instanceof Error ? error.message : String(error)}`; +} + +// Create MCP server instance +const server = new McpServer({ + name: "example-mcp", + version: "1.0.0" +}); + +// Register tools +server.registerTool( + "example_search_users", + { + title: "Search Example Users", + description: `[Full description as shown above]`, + inputSchema: UserSearchInputSchema, + annotations: { + readOnlyHint: true, + destructiveHint: false, + idempotentHint: true, + openWorldHint: true + } + }, + async (params: UserSearchInput) => { + // Implementation as shown above + } +); + +// Main function +// For stdio (local): +async function runStdio() { + if (!process.env.EXAMPLE_API_KEY) { + console.error("ERROR: EXAMPLE_API_KEY environment variable is required"); + process.exit(1); + } + + const transport = new StdioServerTransport(); + await server.connect(transport); + console.error("MCP server running via stdio"); +} + +// For streamable HTTP (remote): +async function runHTTP() { + if (!process.env.EXAMPLE_API_KEY) { + console.error("ERROR: EXAMPLE_API_KEY environment variable is required"); + process.exit(1); + } + + const app = express(); + app.use(express.json()); + + app.post('/mcp', async (req, res) => { + const transport = new StreamableHTTPServerTransport({ + sessionIdGenerator: undefined, + enableJsonResponse: true + }); + res.on('close', () => transport.close()); + await server.connect(transport); + await transport.handleRequest(req, res, req.body); + }); + + const port = parseInt(process.env.PORT || '3000'); + app.listen(port, () => { + console.error(`MCP server running on http://localhost:${port}/mcp`); + }); +} + +// Choose transport based on environment +const transport = process.env.TRANSPORT || 'stdio'; +if (transport === 'http') { + runHTTP().catch(error => { + console.error("Server error:", error); + process.exit(1); + }); +} else { + runStdio().catch(error => { + console.error("Server error:", error); + process.exit(1); + }); +} +``` + +--- + +## Advanced MCP Features + +### Resource Registration + +Expose data as resources for efficient, URI-based access: + +```typescript +import { ResourceTemplate } from "@modelcontextprotocol/sdk/types.js"; + +// Register a resource with URI template +server.registerResource( + { + uri: "file://documents/{name}", + name: "Document Resource", + description: "Access documents by name", + mimeType: "text/plain" + }, + async (uri: string) => { + // Extract parameter from URI + const match = uri.match(/^file:\/\/documents\/(.+)$/); + if (!match) { + throw new Error("Invalid URI format"); + } + + const documentName = match[1]; + const content = await loadDocument(documentName); + + return { + contents: [{ + uri, + mimeType: "text/plain", + text: content + }] + }; + } +); + +// List available resources dynamically +server.registerResourceList(async () => { + const documents = await getAvailableDocuments(); + return { + resources: documents.map(doc => ({ + uri: `file://documents/${doc.name}`, + name: doc.name, + mimeType: "text/plain", + description: doc.description + })) + }; +}); +``` + +**When to use Resources vs Tools:** +- **Resources**: For data access with simple URI-based parameters +- **Tools**: For complex operations requiring validation and business logic +- **Resources**: When data is relatively static or template-based +- **Tools**: When operations have side effects or complex workflows + +### Transport Options + +The TypeScript SDK supports two main transport mechanisms: + +#### Streamable HTTP (Recommended for Remote Servers) + +```typescript +import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js"; +import express from "express"; + +const app = express(); +app.use(express.json()); + +app.post('/mcp', async (req, res) => { + // Create new transport for each request (stateless, prevents request ID collisions) + const transport = new StreamableHTTPServerTransport({ + sessionIdGenerator: undefined, + enableJsonResponse: true + }); + + res.on('close', () => transport.close()); + + await server.connect(transport); + await transport.handleRequest(req, res, req.body); +}); + +app.listen(3000); +``` + +#### stdio (For Local Integrations) + +```typescript +import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; + +const transport = new StdioServerTransport(); +await server.connect(transport); +``` + +**Transport selection:** +- **Streamable HTTP**: Web services, remote access, multiple clients +- **stdio**: Command-line tools, local development, subprocess integration + +### Notification Support + +Notify clients when server state changes: + +```typescript +// Notify when tools list changes +server.notification({ + method: "notifications/tools/list_changed" +}); + +// Notify when resources change +server.notification({ + method: "notifications/resources/list_changed" +}); +``` + +Use notifications sparingly - only when server capabilities genuinely change. + +--- + +## Code Best Practices + +### Code Composability and Reusability + +Your implementation MUST prioritize composability and code reuse: + +1. **Extract Common Functionality**: + - Create reusable helper functions for operations used across multiple tools + - Build shared API clients for HTTP requests instead of duplicating code + - Centralize error handling logic in utility functions + - Extract business logic into dedicated functions that can be composed + - Extract shared markdown or JSON field selection & formatting functionality + +2. **Avoid Duplication**: + - NEVER copy-paste similar code between tools + - If you find yourself writing similar logic twice, extract it into a function + - Common operations like pagination, filtering, field selection, and formatting should be shared + - Authentication/authorization logic should be centralized + +## Building and Running + +Always build your TypeScript code before running: + +```bash +# Build the project +npm run build + +# Run the server +npm start + +# Development with auto-reload +npm run dev +``` + +Always ensure `npm run build` completes successfully before considering the implementation complete. + +## Quality Checklist + +Before finalizing your Node/TypeScript MCP server implementation, ensure: + +### Strategic Design +- [ ] Tools enable complete workflows, not just API endpoint wrappers +- [ ] Tool names reflect natural task subdivisions +- [ ] Response formats optimize for agent context efficiency +- [ ] Human-readable identifiers used where appropriate +- [ ] Error messages guide agents toward correct usage + +### Implementation Quality +- [ ] FOCUSED IMPLEMENTATION: Most important and valuable tools implemented +- [ ] All tools registered using `registerTool` with complete configuration +- [ ] All tools include `title`, `description`, `inputSchema`, and `annotations` +- [ ] Annotations correctly set (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) +- [ ] All tools use Zod schemas for runtime input validation with `.strict()` enforcement +- [ ] All Zod schemas have proper constraints and descriptive error messages +- [ ] All tools have comprehensive descriptions with explicit input/output types +- [ ] Descriptions include return value examples and complete schema documentation +- [ ] Error messages are clear, actionable, and educational + +### TypeScript Quality +- [ ] TypeScript interfaces are defined for all data structures +- [ ] Strict TypeScript is enabled in tsconfig.json +- [ ] No use of `any` type - use `unknown` or proper types instead +- [ ] All async functions have explicit Promise return types +- [ ] Error handling uses proper type guards (e.g., `axios.isAxiosError`, `z.ZodError`) + +### Advanced Features (where applicable) +- [ ] Resources registered for appropriate data endpoints +- [ ] Appropriate transport configured (stdio or streamable HTTP) +- [ ] Notifications implemented for dynamic server capabilities +- [ ] Type-safe with SDK interfaces + +### Project Configuration +- [ ] Package.json includes all necessary dependencies +- [ ] Build script produces working JavaScript in dist/ directory +- [ ] Main entry point is properly configured as dist/index.js +- [ ] Server name follows format: `{service}-mcp-server` +- [ ] tsconfig.json properly configured with strict mode + +### Code Quality +- [ ] Pagination is properly implemented where applicable +- [ ] Large responses check CHARACTER_LIMIT constant and truncate with clear messages +- [ ] Filtering options are provided for potentially large result sets +- [ ] All network operations handle timeouts and connection errors gracefully +- [ ] Common functionality is extracted into reusable functions +- [ ] Return types are consistent across similar operations + +### Testing and Build +- [ ] `npm run build` completes successfully without errors +- [ ] dist/index.js created and executable +- [ ] Server runs: `node dist/index.js --help` +- [ ] All imports resolve correctly +- [ ] Sample tool calls work as expected \ No newline at end of file diff --git a/components/skills/llm-mcp-builder-dev/reference/python_mcp_server.md b/components/skills/llm-mcp-builder-dev/reference/python_mcp_server.md new file mode 100644 index 0000000..cf7ec99 --- /dev/null +++ b/components/skills/llm-mcp-builder-dev/reference/python_mcp_server.md @@ -0,0 +1,719 @@ +# Python MCP Server Implementation Guide + +## Overview + +This document provides Python-specific best practices and examples for implementing MCP servers using the MCP Python SDK. It covers server setup, tool registration patterns, input validation with Pydantic, error handling, and complete working examples. + +--- + +## Quick Reference + +### Key Imports +```python +from mcp.server.fastmcp import FastMCP +from pydantic import BaseModel, Field, field_validator, ConfigDict +from typing import Optional, List, Dict, Any +from enum import Enum +import httpx +``` + +### Server Initialization +```python +mcp = FastMCP("service_mcp") +``` + +### Tool Registration Pattern +```python +@mcp.tool(name="tool_name", annotations={...}) +async def tool_function(params: InputModel) -> str: + # Implementation + pass +``` + +--- + +## MCP Python SDK and FastMCP + +The official MCP Python SDK provides FastMCP, a high-level framework for building MCP servers. It provides: +- Automatic description and inputSchema generation from function signatures and docstrings +- Pydantic model integration for input validation +- Decorator-based tool registration with `@mcp.tool` + +**For complete SDK documentation, use WebFetch to load:** +`https://raw.githubusercontent.com/modelcontextprotocol/python-sdk/main/README.md` + +## Server Naming Convention + +Python MCP servers must follow this naming pattern: +- **Format**: `{service}_mcp` (lowercase with underscores) +- **Examples**: `github_mcp`, `jira_mcp`, `stripe_mcp` + +The name should be: +- General (not tied to specific features) +- Descriptive of the service/API being integrated +- Easy to infer from the task description +- Without version numbers or dates + +## Tool Implementation + +### Tool Naming + +Use snake_case for tool names (e.g., "search_users", "create_project", "get_channel_info") with clear, action-oriented names. + +**Avoid Naming Conflicts**: Include the service context to prevent overlaps: +- Use "slack_send_message" instead of just "send_message" +- Use "github_create_issue" instead of just "create_issue" +- Use "asana_list_tasks" instead of just "list_tasks" + +### Tool Structure with FastMCP + +Tools are defined using the `@mcp.tool` decorator with Pydantic models for input validation: + +```python +from pydantic import BaseModel, Field, ConfigDict +from mcp.server.fastmcp import FastMCP + +# Initialize the MCP server +mcp = FastMCP("example_mcp") + +# Define Pydantic model for input validation +class ServiceToolInput(BaseModel): + '''Input model for service tool operation.''' + model_config = ConfigDict( + str_strip_whitespace=True, # Auto-strip whitespace from strings + validate_assignment=True, # Validate on assignment + extra='forbid' # Forbid extra fields + ) + + param1: str = Field(..., description="First parameter description (e.g., 'user123', 'project-abc')", min_length=1, max_length=100) + param2: Optional[int] = Field(default=None, description="Optional integer parameter with constraints", ge=0, le=1000) + tags: Optional[List[str]] = Field(default_factory=list, description="List of tags to apply", max_items=10) + +@mcp.tool( + name="service_tool_name", + annotations={ + "title": "Human-Readable Tool Title", + "readOnlyHint": True, # Tool does not modify environment + "destructiveHint": False, # Tool does not perform destructive operations + "idempotentHint": True, # Repeated calls have no additional effect + "openWorldHint": False # Tool does not interact with external entities + } +) +async def service_tool_name(params: ServiceToolInput) -> str: + '''Tool description automatically becomes the 'description' field. + + This tool performs a specific operation on the service. It validates all inputs + using the ServiceToolInput Pydantic model before processing. + + Args: + params (ServiceToolInput): Validated input parameters containing: + - param1 (str): First parameter description + - param2 (Optional[int]): Optional parameter with default + - tags (Optional[List[str]]): List of tags + + Returns: + str: JSON-formatted response containing operation results + ''' + # Implementation here + pass +``` + +## Pydantic v2 Key Features + +- Use `model_config` instead of nested `Config` class +- Use `field_validator` instead of deprecated `validator` +- Use `model_dump()` instead of deprecated `dict()` +- Validators require `@classmethod` decorator +- Type hints are required for validator methods + +```python +from pydantic import BaseModel, Field, field_validator, ConfigDict + +class CreateUserInput(BaseModel): + model_config = ConfigDict( + str_strip_whitespace=True, + validate_assignment=True + ) + + name: str = Field(..., description="User's full name", min_length=1, max_length=100) + email: str = Field(..., description="User's email address", pattern=r'^[\w\.-]+@[\w\.-]+\.\w+$') + age: int = Field(..., description="User's age", ge=0, le=150) + + @field_validator('email') + @classmethod + def validate_email(cls, v: str) -> str: + if not v.strip(): + raise ValueError("Email cannot be empty") + return v.lower() +``` + +## Response Format Options + +Support multiple output formats for flexibility: + +```python +from enum import Enum + +class ResponseFormat(str, Enum): + '''Output format for tool responses.''' + MARKDOWN = "markdown" + JSON = "json" + +class UserSearchInput(BaseModel): + query: str = Field(..., description="Search query") + response_format: ResponseFormat = Field( + default=ResponseFormat.MARKDOWN, + description="Output format: 'markdown' for human-readable or 'json' for machine-readable" + ) +``` + +**Markdown format**: +- Use headers, lists, and formatting for clarity +- Convert timestamps to human-readable format (e.g., "2024-01-15 10:30:00 UTC" instead of epoch) +- Show display names with IDs in parentheses (e.g., "@john.doe (U123456)") +- Omit verbose metadata (e.g., show only one profile image URL, not all sizes) +- Group related information logically + +**JSON format**: +- Return complete, structured data suitable for programmatic processing +- Include all available fields and metadata +- Use consistent field names and types + +## Pagination Implementation + +For tools that list resources: + +```python +class ListInput(BaseModel): + limit: Optional[int] = Field(default=20, description="Maximum results to return", ge=1, le=100) + offset: Optional[int] = Field(default=0, description="Number of results to skip for pagination", ge=0) + +async def list_items(params: ListInput) -> str: + # Make API request with pagination + data = await api_request(limit=params.limit, offset=params.offset) + + # Return pagination info + response = { + "total": data["total"], + "count": len(data["items"]), + "offset": params.offset, + "items": data["items"], + "has_more": data["total"] > params.offset + len(data["items"]), + "next_offset": params.offset + len(data["items"]) if data["total"] > params.offset + len(data["items"]) else None + } + return json.dumps(response, indent=2) +``` + +## Error Handling + +Provide clear, actionable error messages: + +```python +def _handle_api_error(e: Exception) -> str: + '''Consistent error formatting across all tools.''' + if isinstance(e, httpx.HTTPStatusError): + if e.response.status_code == 404: + return "Error: Resource not found. Please check the ID is correct." + elif e.response.status_code == 403: + return "Error: Permission denied. You don't have access to this resource." + elif e.response.status_code == 429: + return "Error: Rate limit exceeded. Please wait before making more requests." + return f"Error: API request failed with status {e.response.status_code}" + elif isinstance(e, httpx.TimeoutException): + return "Error: Request timed out. Please try again." + return f"Error: Unexpected error occurred: {type(e).__name__}" +``` + +## Shared Utilities + +Extract common functionality into reusable functions: + +```python +# Shared API request function +async def _make_api_request(endpoint: str, method: str = "GET", **kwargs) -> dict: + '''Reusable function for all API calls.''' + async with httpx.AsyncClient() as client: + response = await client.request( + method, + f"{API_BASE_URL}/{endpoint}", + timeout=30.0, + **kwargs + ) + response.raise_for_status() + return response.json() +``` + +## Async/Await Best Practices + +Always use async/await for network requests and I/O operations: + +```python +# Good: Async network request +async def fetch_data(resource_id: str) -> dict: + async with httpx.AsyncClient() as client: + response = await client.get(f"{API_URL}/resource/{resource_id}") + response.raise_for_status() + return response.json() + +# Bad: Synchronous request +def fetch_data(resource_id: str) -> dict: + response = requests.get(f"{API_URL}/resource/{resource_id}") # Blocks + return response.json() +``` + +## Type Hints + +Use type hints throughout: + +```python +from typing import Optional, List, Dict, Any + +async def get_user(user_id: str) -> Dict[str, Any]: + data = await fetch_user(user_id) + return {"id": data["id"], "name": data["name"]} +``` + +## Tool Docstrings + +Every tool must have comprehensive docstrings with explicit type information: + +```python +async def search_users(params: UserSearchInput) -> str: + ''' + Search for users in the Example system by name, email, or team. + + This tool searches across all user profiles in the Example platform, + supporting partial matches and various search filters. It does NOT + create or modify users, only searches existing ones. + + Args: + params (UserSearchInput): Validated input parameters containing: + - query (str): Search string to match against names/emails (e.g., "john", "@example.com", "team:marketing") + - limit (Optional[int]): Maximum results to return, between 1-100 (default: 20) + - offset (Optional[int]): Number of results to skip for pagination (default: 0) + + Returns: + str: JSON-formatted string containing search results with the following schema: + + Success response: + { + "total": int, # Total number of matches found + "count": int, # Number of results in this response + "offset": int, # Current pagination offset + "users": [ + { + "id": str, # User ID (e.g., "U123456789") + "name": str, # Full name (e.g., "John Doe") + "email": str, # Email address (e.g., "john@example.com") + "team": str # Team name (e.g., "Marketing") - optional + } + ] + } + + Error response: + "Error: " or "No users found matching ''" + + Examples: + - Use when: "Find all marketing team members" -> params with query="team:marketing" + - Use when: "Search for John's account" -> params with query="john" + - Don't use when: You need to create a user (use example_create_user instead) + - Don't use when: You have a user ID and need full details (use example_get_user instead) + + Error Handling: + - Input validation errors are handled by Pydantic model + - Returns "Error: Rate limit exceeded" if too many requests (429 status) + - Returns "Error: Invalid API authentication" if API key is invalid (401 status) + - Returns formatted list of results or "No users found matching 'query'" + ''' +``` + +## Complete Example + +See below for a complete Python MCP server example: + +```python +#!/usr/bin/env python3 +''' +MCP Server for Example Service. + +This server provides tools to interact with Example API, including user search, +project management, and data export capabilities. +''' + +from typing import Optional, List, Dict, Any +from enum import Enum +import httpx +from pydantic import BaseModel, Field, field_validator, ConfigDict +from mcp.server.fastmcp import FastMCP + +# Initialize the MCP server +mcp = FastMCP("example_mcp") + +# Constants +API_BASE_URL = "https://api.example.com/v1" + +# Enums +class ResponseFormat(str, Enum): + '''Output format for tool responses.''' + MARKDOWN = "markdown" + JSON = "json" + +# Pydantic Models for Input Validation +class UserSearchInput(BaseModel): + '''Input model for user search operations.''' + model_config = ConfigDict( + str_strip_whitespace=True, + validate_assignment=True + ) + + query: str = Field(..., description="Search string to match against names/emails", min_length=2, max_length=200) + limit: Optional[int] = Field(default=20, description="Maximum results to return", ge=1, le=100) + offset: Optional[int] = Field(default=0, description="Number of results to skip for pagination", ge=0) + response_format: ResponseFormat = Field(default=ResponseFormat.MARKDOWN, description="Output format") + + @field_validator('query') + @classmethod + def validate_query(cls, v: str) -> str: + if not v.strip(): + raise ValueError("Query cannot be empty or whitespace only") + return v.strip() + +# Shared utility functions +async def _make_api_request(endpoint: str, method: str = "GET", **kwargs) -> dict: + '''Reusable function for all API calls.''' + async with httpx.AsyncClient() as client: + response = await client.request( + method, + f"{API_BASE_URL}/{endpoint}", + timeout=30.0, + **kwargs + ) + response.raise_for_status() + return response.json() + +def _handle_api_error(e: Exception) -> str: + '''Consistent error formatting across all tools.''' + if isinstance(e, httpx.HTTPStatusError): + if e.response.status_code == 404: + return "Error: Resource not found. Please check the ID is correct." + elif e.response.status_code == 403: + return "Error: Permission denied. You don't have access to this resource." + elif e.response.status_code == 429: + return "Error: Rate limit exceeded. Please wait before making more requests." + return f"Error: API request failed with status {e.response.status_code}" + elif isinstance(e, httpx.TimeoutException): + return "Error: Request timed out. Please try again." + return f"Error: Unexpected error occurred: {type(e).__name__}" + +# Tool definitions +@mcp.tool( + name="example_search_users", + annotations={ + "title": "Search Example Users", + "readOnlyHint": True, + "destructiveHint": False, + "idempotentHint": True, + "openWorldHint": True + } +) +async def example_search_users(params: UserSearchInput) -> str: + '''Search for users in the Example system by name, email, or team. + + [Full docstring as shown above] + ''' + try: + # Make API request using validated parameters + data = await _make_api_request( + "users/search", + params={ + "q": params.query, + "limit": params.limit, + "offset": params.offset + } + ) + + users = data.get("users", []) + total = data.get("total", 0) + + if not users: + return f"No users found matching '{params.query}'" + + # Format response based on requested format + if params.response_format == ResponseFormat.MARKDOWN: + lines = [f"# User Search Results: '{params.query}'", ""] + lines.append(f"Found {total} users (showing {len(users)})") + lines.append("") + + for user in users: + lines.append(f"## {user['name']} ({user['id']})") + lines.append(f"- **Email**: {user['email']}") + if user.get('team'): + lines.append(f"- **Team**: {user['team']}") + lines.append("") + + return "\n".join(lines) + + else: + # Machine-readable JSON format + import json + response = { + "total": total, + "count": len(users), + "offset": params.offset, + "users": users + } + return json.dumps(response, indent=2) + + except Exception as e: + return _handle_api_error(e) + +if __name__ == "__main__": + mcp.run() +``` + +--- + +## Advanced FastMCP Features + +### Context Parameter Injection + +FastMCP can automatically inject a `Context` parameter into tools for advanced capabilities like logging, progress reporting, resource reading, and user interaction: + +```python +from mcp.server.fastmcp import FastMCP, Context + +mcp = FastMCP("example_mcp") + +@mcp.tool() +async def advanced_search(query: str, ctx: Context) -> str: + '''Advanced tool with context access for logging and progress.''' + + # Report progress for long operations + await ctx.report_progress(0.25, "Starting search...") + + # Log information for debugging + await ctx.log_info("Processing query", {"query": query, "timestamp": datetime.now()}) + + # Perform search + results = await search_api(query) + await ctx.report_progress(0.75, "Formatting results...") + + # Access server configuration + server_name = ctx.fastmcp.name + + return format_results(results) + +@mcp.tool() +async def interactive_tool(resource_id: str, ctx: Context) -> str: + '''Tool that can request additional input from users.''' + + # Request sensitive information when needed + api_key = await ctx.elicit( + prompt="Please provide your API key:", + input_type="password" + ) + + # Use the provided key + return await api_call(resource_id, api_key) +``` + +**Context capabilities:** +- `ctx.report_progress(progress, message)` - Report progress for long operations +- `ctx.log_info(message, data)` / `ctx.log_error()` / `ctx.log_debug()` - Logging +- `ctx.elicit(prompt, input_type)` - Request input from users +- `ctx.fastmcp.name` - Access server configuration +- `ctx.read_resource(uri)` - Read MCP resources + +### Resource Registration + +Expose data as resources for efficient, template-based access: + +```python +@mcp.resource("file://documents/{name}") +async def get_document(name: str) -> str: + '''Expose documents as MCP resources. + + Resources are useful for static or semi-static data that doesn't + require complex parameters. They use URI templates for flexible access. + ''' + document_path = f"./docs/{name}" + with open(document_path, "r") as f: + return f.read() + +@mcp.resource("config://settings/{key}") +async def get_setting(key: str, ctx: Context) -> str: + '''Expose configuration as resources with context.''' + settings = await load_settings() + return json.dumps(settings.get(key, {})) +``` + +**When to use Resources vs Tools:** +- **Resources**: For data access with simple parameters (URI templates) +- **Tools**: For complex operations with validation and business logic + +### Structured Output Types + +FastMCP supports multiple return types beyond strings: + +```python +from typing import TypedDict +from dataclasses import dataclass +from pydantic import BaseModel + +# TypedDict for structured returns +class UserData(TypedDict): + id: str + name: str + email: str + +@mcp.tool() +async def get_user_typed(user_id: str) -> UserData: + '''Returns structured data - FastMCP handles serialization.''' + return {"id": user_id, "name": "John Doe", "email": "john@example.com"} + +# Pydantic models for complex validation +class DetailedUser(BaseModel): + id: str + name: str + email: str + created_at: datetime + metadata: Dict[str, Any] + +@mcp.tool() +async def get_user_detailed(user_id: str) -> DetailedUser: + '''Returns Pydantic model - automatically generates schema.''' + user = await fetch_user(user_id) + return DetailedUser(**user) +``` + +### Lifespan Management + +Initialize resources that persist across requests: + +```python +from contextlib import asynccontextmanager + +@asynccontextmanager +async def app_lifespan(): + '''Manage resources that live for the server's lifetime.''' + # Initialize connections, load config, etc. + db = await connect_to_database() + config = load_configuration() + + # Make available to all tools + yield {"db": db, "config": config} + + # Cleanup on shutdown + await db.close() + +mcp = FastMCP("example_mcp", lifespan=app_lifespan) + +@mcp.tool() +async def query_data(query: str, ctx: Context) -> str: + '''Access lifespan resources through context.''' + db = ctx.request_context.lifespan_state["db"] + results = await db.query(query) + return format_results(results) +``` + +### Transport Options + +FastMCP supports two main transport mechanisms: + +```python +# stdio transport (for local tools) - default +if __name__ == "__main__": + mcp.run() + +# Streamable HTTP transport (for remote servers) +if __name__ == "__main__": + mcp.run(transport="streamable_http", port=8000) +``` + +**Transport selection:** +- **stdio**: Command-line tools, local integrations, subprocess execution +- **Streamable HTTP**: Web services, remote access, multiple clients + +--- + +## Code Best Practices + +### Code Composability and Reusability + +Your implementation MUST prioritize composability and code reuse: + +1. **Extract Common Functionality**: + - Create reusable helper functions for operations used across multiple tools + - Build shared API clients for HTTP requests instead of duplicating code + - Centralize error handling logic in utility functions + - Extract business logic into dedicated functions that can be composed + - Extract shared markdown or JSON field selection & formatting functionality + +2. **Avoid Duplication**: + - NEVER copy-paste similar code between tools + - If you find yourself writing similar logic twice, extract it into a function + - Common operations like pagination, filtering, field selection, and formatting should be shared + - Authentication/authorization logic should be centralized + +### Python-Specific Best Practices + +1. **Use Type Hints**: Always include type annotations for function parameters and return values +2. **Pydantic Models**: Define clear Pydantic models for all input validation +3. **Avoid Manual Validation**: Let Pydantic handle input validation with constraints +4. **Proper Imports**: Group imports (standard library, third-party, local) +5. **Error Handling**: Use specific exception types (httpx.HTTPStatusError, not generic Exception) +6. **Async Context Managers**: Use `async with` for resources that need cleanup +7. **Constants**: Define module-level constants in UPPER_CASE + +## Quality Checklist + +Before finalizing your Python MCP server implementation, ensure: + +### Strategic Design +- [ ] Tools enable complete workflows, not just API endpoint wrappers +- [ ] Tool names reflect natural task subdivisions +- [ ] Response formats optimize for agent context efficiency +- [ ] Human-readable identifiers used where appropriate +- [ ] Error messages guide agents toward correct usage + +### Implementation Quality +- [ ] FOCUSED IMPLEMENTATION: Most important and valuable tools implemented +- [ ] All tools have descriptive names and documentation +- [ ] Return types are consistent across similar operations +- [ ] Error handling is implemented for all external calls +- [ ] Server name follows format: `{service}_mcp` +- [ ] All network operations use async/await +- [ ] Common functionality is extracted into reusable functions +- [ ] Error messages are clear, actionable, and educational +- [ ] Outputs are properly validated and formatted + +### Tool Configuration +- [ ] All tools implement 'name' and 'annotations' in the decorator +- [ ] Annotations correctly set (readOnlyHint, destructiveHint, idempotentHint, openWorldHint) +- [ ] All tools use Pydantic BaseModel for input validation with Field() definitions +- [ ] All Pydantic Fields have explicit types and descriptions with constraints +- [ ] All tools have comprehensive docstrings with explicit input/output types +- [ ] Docstrings include complete schema structure for dict/JSON returns +- [ ] Pydantic models handle input validation (no manual validation needed) + +### Advanced Features (where applicable) +- [ ] Context injection used for logging, progress, or elicitation +- [ ] Resources registered for appropriate data endpoints +- [ ] Lifespan management implemented for persistent connections +- [ ] Structured output types used (TypedDict, Pydantic models) +- [ ] Appropriate transport configured (stdio or streamable HTTP) + +### Code Quality +- [ ] File includes proper imports including Pydantic imports +- [ ] Pagination is properly implemented where applicable +- [ ] Filtering options are provided for potentially large result sets +- [ ] All async functions are properly defined with `async def` +- [ ] HTTP client usage follows async patterns with proper context managers +- [ ] Type hints are used throughout the code +- [ ] Constants are defined at module level in UPPER_CASE + +### Testing +- [ ] Server runs successfully: `python your_server.py --help` +- [ ] All imports resolve correctly +- [ ] Sample tool calls work as expected +- [ ] Error scenarios handled gracefully \ No newline at end of file diff --git a/components/skills/llm-mcp-builder-dev/scripts/connections.py b/components/skills/llm-mcp-builder-dev/scripts/connections.py new file mode 100644 index 0000000..ffcd0da --- /dev/null +++ b/components/skills/llm-mcp-builder-dev/scripts/connections.py @@ -0,0 +1,151 @@ +"""Lightweight connection handling for MCP servers.""" + +from abc import ABC, abstractmethod +from contextlib import AsyncExitStack +from typing import Any + +from mcp import ClientSession, StdioServerParameters +from mcp.client.sse import sse_client +from mcp.client.stdio import stdio_client +from mcp.client.streamable_http import streamablehttp_client + + +class MCPConnection(ABC): + """Base class for MCP server connections.""" + + def __init__(self): + self.session = None + self._stack = None + + @abstractmethod + def _create_context(self): + """Create the connection context based on connection type.""" + + async def __aenter__(self): + """Initialize MCP server connection.""" + self._stack = AsyncExitStack() + await self._stack.__aenter__() + + try: + ctx = self._create_context() + result = await self._stack.enter_async_context(ctx) + + if len(result) == 2: + read, write = result + elif len(result) == 3: + read, write, _ = result + else: + raise ValueError(f"Unexpected context result: {result}") + + session_ctx = ClientSession(read, write) + self.session = await self._stack.enter_async_context(session_ctx) + await self.session.initialize() + return self + except BaseException: + await self._stack.__aexit__(None, None, None) + raise + + async def __aexit__(self, exc_type, exc_val, exc_tb): + """Clean up MCP server connection resources.""" + if self._stack: + await self._stack.__aexit__(exc_type, exc_val, exc_tb) + self.session = None + self._stack = None + + async def list_tools(self) -> list[dict[str, Any]]: + """Retrieve available tools from the MCP server.""" + response = await self.session.list_tools() + return [ + { + "name": tool.name, + "description": tool.description, + "input_schema": tool.inputSchema, + } + for tool in response.tools + ] + + async def call_tool(self, tool_name: str, arguments: dict[str, Any]) -> Any: + """Call a tool on the MCP server with provided arguments.""" + result = await self.session.call_tool(tool_name, arguments=arguments) + return result.content + + +class MCPConnectionStdio(MCPConnection): + """MCP connection using standard input/output.""" + + def __init__(self, command: str, args: list[str] = None, env: dict[str, str] = None): + super().__init__() + self.command = command + self.args = args or [] + self.env = env + + def _create_context(self): + return stdio_client( + StdioServerParameters(command=self.command, args=self.args, env=self.env) + ) + + +class MCPConnectionSSE(MCPConnection): + """MCP connection using Server-Sent Events.""" + + def __init__(self, url: str, headers: dict[str, str] = None): + super().__init__() + self.url = url + self.headers = headers or {} + + def _create_context(self): + return sse_client(url=self.url, headers=self.headers) + + +class MCPConnectionHTTP(MCPConnection): + """MCP connection using Streamable HTTP.""" + + def __init__(self, url: str, headers: dict[str, str] = None): + super().__init__() + self.url = url + self.headers = headers or {} + + def _create_context(self): + return streamablehttp_client(url=self.url, headers=self.headers) + + +def create_connection( + transport: str, + command: str = None, + args: list[str] = None, + env: dict[str, str] = None, + url: str = None, + headers: dict[str, str] = None, +) -> MCPConnection: + """Factory function to create the appropriate MCP connection. + + Args: + transport: Connection type ("stdio", "sse", or "http") + command: Command to run (stdio only) + args: Command arguments (stdio only) + env: Environment variables (stdio only) + url: Server URL (sse and http only) + headers: HTTP headers (sse and http only) + + Returns: + MCPConnection instance + """ + transport = transport.lower() + + if transport == "stdio": + if not command: + raise ValueError("Command is required for stdio transport") + return MCPConnectionStdio(command=command, args=args, env=env) + + elif transport == "sse": + if not url: + raise ValueError("URL is required for sse transport") + return MCPConnectionSSE(url=url, headers=headers) + + elif transport in ["http", "streamable_http", "streamable-http"]: + if not url: + raise ValueError("URL is required for http transport") + return MCPConnectionHTTP(url=url, headers=headers) + + else: + raise ValueError(f"Unsupported transport type: {transport}. Use 'stdio', 'sse', or 'http'") diff --git a/components/skills/llm-mcp-builder-dev/scripts/evaluation.py b/components/skills/llm-mcp-builder-dev/scripts/evaluation.py new file mode 100644 index 0000000..4177856 --- /dev/null +++ b/components/skills/llm-mcp-builder-dev/scripts/evaluation.py @@ -0,0 +1,373 @@ +"""MCP Server Evaluation Harness + +This script evaluates MCP servers by running test questions against them using Claude. +""" + +import argparse +import asyncio +import json +import re +import sys +import time +import traceback +import xml.etree.ElementTree as ET +from pathlib import Path +from typing import Any + +from anthropic import Anthropic + +from connections import create_connection + +EVALUATION_PROMPT = """You are an AI assistant with access to tools. + +When given a task, you MUST: +1. Use the available tools to complete the task +2. Provide summary of each step in your approach, wrapped in tags +3. Provide feedback on the tools provided, wrapped in tags +4. Provide your final response, wrapped in tags + +Summary Requirements: +- In your tags, you must explain: + - The steps you took to complete the task + - Which tools you used, in what order, and why + - The inputs you provided to each tool + - The outputs you received from each tool + - A summary for how you arrived at the response + +Feedback Requirements: +- In your tags, provide constructive feedback on the tools: + - Comment on tool names: Are they clear and descriptive? + - Comment on input parameters: Are they well-documented? Are required vs optional parameters clear? + - Comment on descriptions: Do they accurately describe what the tool does? + - Comment on any errors encountered during tool usage: Did the tool fail to execute? Did the tool return too many tokens? + - Identify specific areas for improvement and explain WHY they would help + - Be specific and actionable in your suggestions + +Response Requirements: +- Your response should be concise and directly address what was asked +- Always wrap your final response in tags +- If you cannot solve the task return NOT_FOUND +- For numeric responses, provide just the number +- For IDs, provide just the ID +- For names or text, provide the exact text requested +- Your response should go last""" + + +def parse_evaluation_file(file_path: Path) -> list[dict[str, Any]]: + """Parse XML evaluation file with qa_pair elements.""" + try: + tree = ET.parse(file_path) + root = tree.getroot() + evaluations = [] + + for qa_pair in root.findall(".//qa_pair"): + question_elem = qa_pair.find("question") + answer_elem = qa_pair.find("answer") + + if question_elem is not None and answer_elem is not None: + evaluations.append({ + "question": (question_elem.text or "").strip(), + "answer": (answer_elem.text or "").strip(), + }) + + return evaluations + except Exception as e: + print(f"Error parsing evaluation file {file_path}: {e}") + return [] + + +def extract_xml_content(text: str, tag: str) -> str | None: + """Extract content from XML tags.""" + pattern = rf"<{tag}>(.*?)" + matches = re.findall(pattern, text, re.DOTALL) + return matches[-1].strip() if matches else None + + +async def agent_loop( + client: Anthropic, + model: str, + question: str, + tools: list[dict[str, Any]], + connection: Any, +) -> tuple[str, dict[str, Any]]: + """Run the agent loop with MCP tools.""" + messages = [{"role": "user", "content": question}] + + response = await asyncio.to_thread( + client.messages.create, + model=model, + max_tokens=4096, + system=EVALUATION_PROMPT, + messages=messages, + tools=tools, + ) + + messages.append({"role": "assistant", "content": response.content}) + + tool_metrics = {} + + while response.stop_reason == "tool_use": + tool_use = next(block for block in response.content if block.type == "tool_use") + tool_name = tool_use.name + tool_input = tool_use.input + + tool_start_ts = time.time() + try: + tool_result = await connection.call_tool(tool_name, tool_input) + tool_response = json.dumps(tool_result) if isinstance(tool_result, (dict, list)) else str(tool_result) + except Exception as e: + tool_response = f"Error executing tool {tool_name}: {str(e)}\n" + tool_response += traceback.format_exc() + tool_duration = time.time() - tool_start_ts + + if tool_name not in tool_metrics: + tool_metrics[tool_name] = {"count": 0, "durations": []} + tool_metrics[tool_name]["count"] += 1 + tool_metrics[tool_name]["durations"].append(tool_duration) + + messages.append({ + "role": "user", + "content": [{ + "type": "tool_result", + "tool_use_id": tool_use.id, + "content": tool_response, + }] + }) + + response = await asyncio.to_thread( + client.messages.create, + model=model, + max_tokens=4096, + system=EVALUATION_PROMPT, + messages=messages, + tools=tools, + ) + messages.append({"role": "assistant", "content": response.content}) + + response_text = next( + (block.text for block in response.content if hasattr(block, "text")), + None, + ) + return response_text, tool_metrics + + +async def evaluate_single_task( + client: Anthropic, + model: str, + qa_pair: dict[str, Any], + tools: list[dict[str, Any]], + connection: Any, + task_index: int, +) -> dict[str, Any]: + """Evaluate a single QA pair with the given tools.""" + start_time = time.time() + + print(f"Task {task_index + 1}: Running task with question: {qa_pair['question']}") + response, tool_metrics = await agent_loop(client, model, qa_pair["question"], tools, connection) + + response_value = extract_xml_content(response, "response") + summary = extract_xml_content(response, "summary") + feedback = extract_xml_content(response, "feedback") + + duration_seconds = time.time() - start_time + + return { + "question": qa_pair["question"], + "expected": qa_pair["answer"], + "actual": response_value, + "score": int(response_value == qa_pair["answer"]) if response_value else 0, + "total_duration": duration_seconds, + "tool_calls": tool_metrics, + "num_tool_calls": sum(len(metrics["durations"]) for metrics in tool_metrics.values()), + "summary": summary, + "feedback": feedback, + } + + +REPORT_HEADER = """ +# Evaluation Report + +## Summary + +- **Accuracy**: {correct}/{total} ({accuracy:.1f}%) +- **Average Task Duration**: {average_duration_s:.2f}s +- **Average Tool Calls per Task**: {average_tool_calls:.2f} +- **Total Tool Calls**: {total_tool_calls} + +--- +""" + +TASK_TEMPLATE = """ +### Task {task_num} + +**Question**: {question} +**Ground Truth Answer**: `{expected_answer}` +**Actual Answer**: `{actual_answer}` +**Correct**: {correct_indicator} +**Duration**: {total_duration:.2f}s +**Tool Calls**: {tool_calls} + +**Summary** +{summary} + +**Feedback** +{feedback} + +--- +""" + + +async def run_evaluation( + eval_path: Path, + connection: Any, + model: str = "claude-3-7-sonnet-20250219", +) -> str: + """Run evaluation with MCP server tools.""" + print("🚀 Starting Evaluation") + + client = Anthropic() + + tools = await connection.list_tools() + print(f"📋 Loaded {len(tools)} tools from MCP server") + + qa_pairs = parse_evaluation_file(eval_path) + print(f"📋 Loaded {len(qa_pairs)} evaluation tasks") + + results = [] + for i, qa_pair in enumerate(qa_pairs): + print(f"Processing task {i + 1}/{len(qa_pairs)}") + result = await evaluate_single_task(client, model, qa_pair, tools, connection, i) + results.append(result) + + correct = sum(r["score"] for r in results) + accuracy = (correct / len(results)) * 100 if results else 0 + average_duration_s = sum(r["total_duration"] for r in results) / len(results) if results else 0 + average_tool_calls = sum(r["num_tool_calls"] for r in results) / len(results) if results else 0 + total_tool_calls = sum(r["num_tool_calls"] for r in results) + + report = REPORT_HEADER.format( + correct=correct, + total=len(results), + accuracy=accuracy, + average_duration_s=average_duration_s, + average_tool_calls=average_tool_calls, + total_tool_calls=total_tool_calls, + ) + + report += "".join([ + TASK_TEMPLATE.format( + task_num=i + 1, + question=qa_pair["question"], + expected_answer=qa_pair["answer"], + actual_answer=result["actual"] or "N/A", + correct_indicator="✅" if result["score"] else "❌", + total_duration=result["total_duration"], + tool_calls=json.dumps(result["tool_calls"], indent=2), + summary=result["summary"] or "N/A", + feedback=result["feedback"] or "N/A", + ) + for i, (qa_pair, result) in enumerate(zip(qa_pairs, results)) + ]) + + return report + + +def parse_headers(header_list: list[str]) -> dict[str, str]: + """Parse header strings in format 'Key: Value' into a dictionary.""" + headers = {} + if not header_list: + return headers + + for header in header_list: + if ":" in header: + key, value = header.split(":", 1) + headers[key.strip()] = value.strip() + else: + print(f"Warning: Ignoring malformed header: {header}") + return headers + + +def parse_env_vars(env_list: list[str]) -> dict[str, str]: + """Parse environment variable strings in format 'KEY=VALUE' into a dictionary.""" + env = {} + if not env_list: + return env + + for env_var in env_list: + if "=" in env_var: + key, value = env_var.split("=", 1) + env[key.strip()] = value.strip() + else: + print(f"Warning: Ignoring malformed environment variable: {env_var}") + return env + + +async def main(): + parser = argparse.ArgumentParser( + description="Evaluate MCP servers using test questions", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + # Evaluate a local stdio MCP server + python evaluation.py -t stdio -c python -a my_server.py eval.xml + + # Evaluate an SSE MCP server + python evaluation.py -t sse -u https://example.com/mcp -H "Authorization: Bearer token" eval.xml + + # Evaluate an HTTP MCP server with custom model + python evaluation.py -t http -u https://example.com/mcp -m claude-3-5-sonnet-20241022 eval.xml + """, + ) + + parser.add_argument("eval_file", type=Path, help="Path to evaluation XML file") + parser.add_argument("-t", "--transport", choices=["stdio", "sse", "http"], default="stdio", help="Transport type (default: stdio)") + parser.add_argument("-m", "--model", default="claude-3-7-sonnet-20250219", help="Claude model to use (default: claude-3-7-sonnet-20250219)") + + stdio_group = parser.add_argument_group("stdio options") + stdio_group.add_argument("-c", "--command", help="Command to run MCP server (stdio only)") + stdio_group.add_argument("-a", "--args", nargs="+", help="Arguments for the command (stdio only)") + stdio_group.add_argument("-e", "--env", nargs="+", help="Environment variables in KEY=VALUE format (stdio only)") + + remote_group = parser.add_argument_group("sse/http options") + remote_group.add_argument("-u", "--url", help="MCP server URL (sse/http only)") + remote_group.add_argument("-H", "--header", nargs="+", dest="headers", help="HTTP headers in 'Key: Value' format (sse/http only)") + + parser.add_argument("-o", "--output", type=Path, help="Output file for evaluation report (default: stdout)") + + args = parser.parse_args() + + if not args.eval_file.exists(): + print(f"Error: Evaluation file not found: {args.eval_file}") + sys.exit(1) + + headers = parse_headers(args.headers) if args.headers else None + env_vars = parse_env_vars(args.env) if args.env else None + + try: + connection = create_connection( + transport=args.transport, + command=args.command, + args=args.args, + env=env_vars, + url=args.url, + headers=headers, + ) + except ValueError as e: + print(f"Error: {e}") + sys.exit(1) + + print(f"🔗 Connecting to MCP server via {args.transport}...") + + async with connection: + print("✅ Connected successfully") + report = await run_evaluation(args.eval_file, connection, args.model) + + if args.output: + args.output.write_text(report) + print(f"\n✅ Report saved to {args.output}") + else: + print("\n" + report) + + +if __name__ == "__main__": + asyncio.run(main()) diff --git a/components/skills/llm-mcp-builder-dev/scripts/example_evaluation.xml b/components/skills/llm-mcp-builder-dev/scripts/example_evaluation.xml new file mode 100644 index 0000000..41e4459 --- /dev/null +++ b/components/skills/llm-mcp-builder-dev/scripts/example_evaluation.xml @@ -0,0 +1,22 @@ + + + Calculate the compound interest on $10,000 invested at 5% annual interest rate, compounded monthly for 3 years. What is the final amount in dollars (rounded to 2 decimal places)? + 11614.72 + + + A projectile is launched at a 45-degree angle with an initial velocity of 50 m/s. Calculate the total distance (in meters) it has traveled from the launch point after 2 seconds, assuming g=9.8 m/s². Round to 2 decimal places. + 87.25 + + + A sphere has a volume of 500 cubic meters. Calculate its surface area in square meters. Round to 2 decimal places. + 304.65 + + + Calculate the population standard deviation of this dataset: [12, 15, 18, 22, 25, 30, 35]. Round to 2 decimal places. + 7.61 + + + Calculate the pH of a solution with a hydrogen ion concentration of 3.5 × 10^-5 M. Round to 2 decimal places. + 4.46 + + diff --git a/components/skills/llm-mcp-builder-dev/scripts/requirements.txt b/components/skills/llm-mcp-builder-dev/scripts/requirements.txt new file mode 100644 index 0000000..e73e5d1 --- /dev/null +++ b/components/skills/llm-mcp-builder-dev/scripts/requirements.txt @@ -0,0 +1,2 @@ +anthropic>=0.39.0 +mcp>=1.1.0 diff --git a/components/skills/llm-rag-patterns-dev/SKILL.md b/components/skills/llm-rag-patterns-dev/SKILL.md new file mode 100644 index 0000000..0f0840c --- /dev/null +++ b/components/skills/llm-rag-patterns-dev/SKILL.md @@ -0,0 +1,403 @@ +--- +name: rag-implementation +description: Build Retrieval-Augmented Generation (RAG) systems for LLM applications with vector databases and semantic search. Use when implementing knowledge-grounded AI, building document Q&A systems, or integrating LLMs with external knowledge bases. +--- + +# RAG Implementation + +Master Retrieval-Augmented Generation (RAG) to build LLM applications that provide accurate, grounded responses using external knowledge sources. + +## When to Use This Skill + +- Building Q&A systems over proprietary documents +- Creating chatbots with current, factual information +- Implementing semantic search with natural language queries +- Reducing hallucinations with grounded responses +- Enabling LLMs to access domain-specific knowledge +- Building documentation assistants +- Creating research tools with source citation + +## Core Components + +### 1. Vector Databases +**Purpose**: Store and retrieve document embeddings efficiently + +**Options:** +- **Pinecone**: Managed, scalable, fast queries +- **Weaviate**: Open-source, hybrid search +- **Milvus**: High performance, on-premise +- **Chroma**: Lightweight, easy to use +- **Qdrant**: Fast, filtered search +- **FAISS**: Meta's library, local deployment + +### 2. Embeddings +**Purpose**: Convert text to numerical vectors for similarity search + +**Models:** +- **text-embedding-ada-002** (OpenAI): General purpose, 1536 dims +- **all-MiniLM-L6-v2** (Sentence Transformers): Fast, lightweight +- **e5-large-v2**: High quality, multilingual +- **Instructor**: Task-specific instructions +- **bge-large-en-v1.5**: SOTA performance + +### 3. Retrieval Strategies +**Approaches:** +- **Dense Retrieval**: Semantic similarity via embeddings +- **Sparse Retrieval**: Keyword matching (BM25, TF-IDF) +- **Hybrid Search**: Combine dense + sparse +- **Multi-Query**: Generate multiple query variations +- **HyDE**: Generate hypothetical documents + +### 4. Reranking +**Purpose**: Improve retrieval quality by reordering results + +**Methods:** +- **Cross-Encoders**: BERT-based reranking +- **Cohere Rerank**: API-based reranking +- **Maximal Marginal Relevance (MMR)**: Diversity + relevance +- **LLM-based**: Use LLM to score relevance + +## Quick Start + +```python +from langchain.document_loaders import DirectoryLoader +from langchain.text_splitters import RecursiveCharacterTextSplitter +from langchain.embeddings import OpenAIEmbeddings +from langchain.vectorstores import Chroma +from langchain.chains import RetrievalQA +from langchain.llms import OpenAI + +# 1. Load documents +loader = DirectoryLoader('./docs', glob="**/*.txt") +documents = loader.load() + +# 2. Split into chunks +text_splitter = RecursiveCharacterTextSplitter( + chunk_size=1000, + chunk_overlap=200, + length_function=len +) +chunks = text_splitter.split_documents(documents) + +# 3. Create embeddings and vector store +embeddings = OpenAIEmbeddings() +vectorstore = Chroma.from_documents(chunks, embeddings) + +# 4. Create retrieval chain +qa_chain = RetrievalQA.from_chain_type( + llm=OpenAI(), + chain_type="stuff", + retriever=vectorstore.as_retriever(search_kwargs={"k": 4}), + return_source_documents=True +) + +# 5. Query +result = qa_chain({"query": "What are the main features?"}) +print(result['result']) +print(result['source_documents']) +``` + +## Advanced RAG Patterns + +### Pattern 1: Hybrid Search +```python +from langchain.retrievers import BM25Retriever, EnsembleRetriever + +# Sparse retriever (BM25) +bm25_retriever = BM25Retriever.from_documents(chunks) +bm25_retriever.k = 5 + +# Dense retriever (embeddings) +embedding_retriever = vectorstore.as_retriever(search_kwargs={"k": 5}) + +# Combine with weights +ensemble_retriever = EnsembleRetriever( + retrievers=[bm25_retriever, embedding_retriever], + weights=[0.3, 0.7] +) +``` + +### Pattern 2: Multi-Query Retrieval +```python +from langchain.retrievers.multi_query import MultiQueryRetriever + +# Generate multiple query perspectives +retriever = MultiQueryRetriever.from_llm( + retriever=vectorstore.as_retriever(), + llm=OpenAI() +) + +# Single query → multiple variations → combined results +results = retriever.get_relevant_documents("What is the main topic?") +``` + +### Pattern 3: Contextual Compression +```python +from langchain.retrievers import ContextualCompressionRetriever +from langchain.retrievers.document_compressors import LLMChainExtractor + +compressor = LLMChainExtractor.from_llm(llm) + +compression_retriever = ContextualCompressionRetriever( + base_compressor=compressor, + base_retriever=vectorstore.as_retriever() +) + +# Returns only relevant parts of documents +compressed_docs = compression_retriever.get_relevant_documents("query") +``` + +### Pattern 4: Parent Document Retriever +```python +from langchain.retrievers import ParentDocumentRetriever +from langchain.storage import InMemoryStore + +# Store for parent documents +store = InMemoryStore() + +# Small chunks for retrieval, large chunks for context +child_splitter = RecursiveCharacterTextSplitter(chunk_size=400) +parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000) + +retriever = ParentDocumentRetriever( + vectorstore=vectorstore, + docstore=store, + child_splitter=child_splitter, + parent_splitter=parent_splitter +) +``` + +## Document Chunking Strategies + +### Recursive Character Text Splitter +```python +from langchain.text_splitters import RecursiveCharacterTextSplitter + +splitter = RecursiveCharacterTextSplitter( + chunk_size=1000, + chunk_overlap=200, + length_function=len, + separators=["\n\n", "\n", " ", ""] # Try these in order +) +``` + +### Token-Based Splitting +```python +from langchain.text_splitters import TokenTextSplitter + +splitter = TokenTextSplitter( + chunk_size=512, + chunk_overlap=50 +) +``` + +### Semantic Chunking +```python +from langchain.text_splitters import SemanticChunker + +splitter = SemanticChunker( + embeddings=OpenAIEmbeddings(), + breakpoint_threshold_type="percentile" +) +``` + +### Markdown Header Splitter +```python +from langchain.text_splitters import MarkdownHeaderTextSplitter + +headers_to_split_on = [ + ("#", "Header 1"), + ("##", "Header 2"), + ("###", "Header 3"), +] + +splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on) +``` + +## Vector Store Configurations + +### Pinecone +```python +import pinecone +from langchain.vectorstores import Pinecone + +pinecone.init(api_key="your-api-key", environment="us-west1-gcp") + +index = pinecone.Index("your-index-name") + +vectorstore = Pinecone(index, embeddings.embed_query, "text") +``` + +### Weaviate +```python +import weaviate +from langchain.vectorstores import Weaviate + +client = weaviate.Client("http://localhost:8080") + +vectorstore = Weaviate(client, "Document", "content", embeddings) +``` + +### Chroma (Local) +```python +from langchain.vectorstores import Chroma + +vectorstore = Chroma( + collection_name="my_collection", + embedding_function=embeddings, + persist_directory="./chroma_db" +) +``` + +## Retrieval Optimization + +### 1. Metadata Filtering +```python +# Add metadata during indexing +chunks_with_metadata = [] +for i, chunk in enumerate(chunks): + chunk.metadata = { + "source": chunk.metadata.get("source"), + "page": i, + "category": determine_category(chunk.page_content) + } + chunks_with_metadata.append(chunk) + +# Filter during retrieval +results = vectorstore.similarity_search( + "query", + filter={"category": "technical"}, + k=5 +) +``` + +### 2. Maximal Marginal Relevance +```python +# Balance relevance with diversity +results = vectorstore.max_marginal_relevance_search( + "query", + k=5, + fetch_k=20, # Fetch 20, return top 5 diverse + lambda_mult=0.5 # 0=max diversity, 1=max relevance +) +``` + +### 3. Reranking with Cross-Encoder +```python +from sentence_transformers import CrossEncoder + +reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2') + +# Get initial results +candidates = vectorstore.similarity_search("query", k=20) + +# Rerank +pairs = [[query, doc.page_content] for doc in candidates] +scores = reranker.predict(pairs) + +# Sort by score and take top k +reranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)[:5] +``` + +## Prompt Engineering for RAG + +### Contextual Prompt +```python +prompt_template = """Use the following context to answer the question. If you cannot answer based on the context, say "I don't have enough information." + +Context: +{context} + +Question: {question} + +Answer:""" +``` + +### With Citations +```python +prompt_template = """Answer the question based on the context below. Include citations using [1], [2], etc. + +Context: +{context} + +Question: {question} + +Answer (with citations):""" +``` + +### With Confidence +```python +prompt_template = """Answer the question using the context. Provide a confidence score (0-100%) for your answer. + +Context: +{context} + +Question: {question} + +Answer: +Confidence:""" +``` + +## Evaluation Metrics + +```python +def evaluate_rag_system(qa_chain, test_cases): + metrics = { + 'accuracy': [], + 'retrieval_quality': [], + 'groundedness': [] + } + + for test in test_cases: + result = qa_chain({"query": test['question']}) + + # Check if answer matches expected + accuracy = calculate_accuracy(result['result'], test['expected']) + metrics['accuracy'].append(accuracy) + + # Check if relevant docs were retrieved + retrieval_quality = evaluate_retrieved_docs( + result['source_documents'], + test['relevant_docs'] + ) + metrics['retrieval_quality'].append(retrieval_quality) + + # Check if answer is grounded in context + groundedness = check_groundedness( + result['result'], + result['source_documents'] + ) + metrics['groundedness'].append(groundedness) + + return {k: sum(v)/len(v) for k, v in metrics.items()} +``` + +## Resources + +- **references/vector-databases.md**: Detailed comparison of vector DBs +- **references/embeddings.md**: Embedding model selection guide +- **references/retrieval-strategies.md**: Advanced retrieval techniques +- **references/reranking.md**: Reranking methods and when to use them +- **references/context-window.md**: Managing context limits +- **assets/vector-store-config.yaml**: Configuration templates +- **assets/retriever-pipeline.py**: Complete RAG pipeline +- **assets/embedding-models.md**: Model comparison and benchmarks + +## Best Practices + +1. **Chunk Size**: Balance between context and specificity (500-1000 tokens) +2. **Overlap**: Use 10-20% overlap to preserve context at boundaries +3. **Metadata**: Include source, page, timestamp for filtering and debugging +4. **Hybrid Search**: Combine semantic and keyword search for best results +5. **Reranking**: Improve top results with cross-encoder +6. **Citations**: Always return source documents for transparency +7. **Evaluation**: Continuously test retrieval quality and answer accuracy +8. **Monitoring**: Track retrieval metrics in production + +## Common Issues + +- **Poor Retrieval**: Check embedding quality, chunk size, query formulation +- **Irrelevant Results**: Add metadata filtering, use hybrid search, rerank +- **Missing Information**: Ensure documents are properly indexed +- **Slow Queries**: Optimize vector store, use caching, reduce k +- **Hallucinations**: Improve grounding prompt, add verification step diff --git a/components/skills/meta-context-engineering-dev/SKILL.md b/components/skills/meta-context-engineering-dev/SKILL.md new file mode 100644 index 0000000..faab99d --- /dev/null +++ b/components/skills/meta-context-engineering-dev/SKILL.md @@ -0,0 +1,185 @@ +--- +name: context-fundamentals +description: Understand the components, mechanics, and constraints of context in agent systems. Use when designing agent architectures, debugging context-related failures, or optimizing context usage. +--- + +# Context Engineering Fundamentals + +Context is the complete state available to a language model at inference time. It includes everything the model can attend to when generating responses: system instructions, tool definitions, retrieved documents, message history, and tool outputs. Understanding context fundamentals is prerequisite to effective context engineering. + +## When to Activate + +Activate this skill when: +- Designing new agent systems or modifying existing architectures +- Debugging unexpected agent behavior that may relate to context +- Optimizing context usage to reduce token costs or improve performance +- Onboarding new team members to context engineering concepts +- Reviewing context-related design decisions + +## Core Concepts + +Context comprises several distinct components, each with different characteristics and constraints. The attention mechanism creates a finite budget that constrains effective context usage. Progressive disclosure manages this constraint by loading information only as needed. The engineering discipline is curating the smallest high-signal token set that achieves desired outcomes. + +## Detailed Topics + +### The Anatomy of Context + +**System Prompts** +System prompts establish the agent's core identity, constraints, and behavioral guidelines. They are loaded once at session start and typically persist throughout the conversation. System prompts should be extremely clear and use simple, direct language at the right altitude for the agent. + +The right altitude balances two failure modes. At one extreme, engineers hardcode complex brittle logic that creates fragility and maintenance burden. At the other extreme, engineers provide vague high-level guidance that fails to give concrete signals for desired outputs or falsely assumes shared context. The optimal altitude strikes a balance: specific enough to guide behavior effectively, yet flexible enough to provide strong heuristics. + +Organize prompts into distinct sections using XML tagging or Markdown headers to delineate background information, instructions, tool guidance, and output description. The exact formatting matters less as models become more capable, but structural clarity remains valuable. + +**Tool Definitions** +Tool definitions specify the actions an agent can take. Each tool includes a name, description, parameters, and return format. Tool definitions live near the front of context after serialization, typically before or after the system prompt. + +Tool descriptions collectively steer agent behavior. Poor descriptions force agents to guess; optimized descriptions include usage context, examples, and defaults. The consolidation principle states that if a human engineer cannot definitively say which tool should be used in a given situation, an agent cannot be expected to do better. + +**Retrieved Documents** +Retrieved documents provide domain-specific knowledge, reference materials, or task-relevant information. Agents use retrieval augmented generation to pull relevant documents into context at runtime rather than pre-loading all possible information. + +The just-in-time approach maintains lightweight identifiers (file paths, stored queries, web links) and uses these references to load data into context dynamically. This mirrors human cognition: we generally do not memorize entire corpuses of information but rather use external organization and indexing systems to retrieve relevant information on demand. + +**Message History** +Message history contains the conversation between the user and agent, including previous queries, responses, and reasoning. For long-running tasks, message history can grow to dominate context usage. + +Message history serves as scratchpad memory where agents track progress, maintain task state, and preserve reasoning across turns. Effective management of message history is critical for long-horizon task completion. + +**Tool Outputs** +Tool outputs are the results of agent actions: file contents, search results, command execution output, API responses, and similar data. Tool outputs comprise the majority of tokens in typical agent trajectories, with research showing observations (tool outputs) can reach 83.9% of total context usage. + +Tool outputs consume context whether they are relevant to current decisions or not. This creates pressure for strategies like observation masking, compaction, and selective tool result retention. + +### Context Windows and Attention Mechanics + +**The Attention Budget Constraint** +Language models process tokens through attention mechanisms that create pairwise relationships between all tokens in context. For n tokens, this creates n² relationships that must be computed and stored. As context length increases, the model's ability to capture these relationships gets stretched thin. + +Models develop attention patterns from training data distributions where shorter sequences predominate. This means models have less experience with and fewer specialized parameters for context-wide dependencies. The result is an "attention budget" that depletes as context grows. + +**Position Encoding and Context Extension** +Position encoding interpolation allows models to handle longer sequences by adapting them to originally trained smaller contexts. However, this adaptation introduces degradation in token position understanding. Models remain highly capable at longer contexts but show reduced precision for information retrieval and long-range reasoning compared to performance on shorter contexts. + +**The Progressive Disclosure Principle** +Progressive disclosure manages context efficiently by loading information only as needed. At startup, agents load only skill names and descriptions—sufficient to know when a skill might be relevant. Full content loads only when a skill is activated for specific tasks. + +This approach keeps agents fast while giving them access to more context on demand. The principle applies at multiple levels: skill selection, document loading, and even tool result retrieval. + +### Context Quality Versus Context Quantity + +The assumption that larger context windows solve memory problems has been empirically debunked. Context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of desired outcomes. + +Several factors create pressure for context efficiency. Processing cost grows disproportionately with context length—not just double the cost for double the tokens, but exponentially more in time and computing resources. Model performance degrades beyond certain context lengths even when the window technically supports more tokens. Long inputs remain expensive even with prefix caching. + +The guiding principle is informativity over exhaustiveness. Include what matters for the decision at hand, exclude what does not, and design systems that can access additional information on demand. + +### Context as Finite Resource + +Context must be treated as a finite resource with diminishing marginal returns. Like humans with limited working memory, language models have an attention budget drawn on when parsing large volumes of context. + +Every new token introduced depletes this budget by some amount. This creates the need for careful curation of available tokens. The engineering problem is optimizing utility against inherent constraints. + +Context engineering is iterative and the curation phase happens each time you decide what to pass to the model. It is not a one-time prompt writing exercise but an ongoing discipline of context management. + +## Practical Guidance + +### File-System-Based Access + +Agents with filesystem access can use progressive disclosure naturally. Store reference materials, documentation, and data externally. Load files only when needed using standard filesystem operations. This pattern avoids stuffing context with information that may not be relevant. + +The file system itself provides structure that agents can navigate. File sizes suggest complexity; naming conventions hint at purpose; timestamps serve as proxies for relevance. Metadata of file references provides a mechanism to efficiently refine behavior. + +### Hybrid Strategies + +The most effective agents employ hybrid strategies. Pre-load some context for speed (like CLAUDE.md files or project rules), but enable autonomous exploration for additional context as needed. The decision boundary depends on task characteristics and context dynamics. + +For contexts with less dynamic content, pre-loading more upfront makes sense. For rapidly changing or highly specific information, just-in-time loading avoids stale context. + +### Context Budgeting + +Design with explicit context budgets in mind. Know the effective context limit for your model and task. Monitor context usage during development. Implement compaction triggers at appropriate thresholds. Design systems assuming context will degrade rather than hoping it will not. + +Effective context budgeting requires understanding not just raw token counts but also attention distribution patterns. The middle of context receives less attention than the beginning and end. Place critical information at attention-favored positions. + +## Examples + +**Example 1: Organizing System Prompts** +```markdown + +You are a Python expert helping a development team. +Current project: Data processing pipeline in Python 3.9+ + + + +- Write clean, idiomatic Python code +- Include type hints for function signatures +- Add docstrings for public functions +- Follow PEP 8 style guidelines + + + +Use bash for shell operations, python for code tasks. +File operations should use pathlib for cross-platform compatibility. + + + +Provide code blocks with syntax highlighting. +Explain non-obvious decisions in comments. + +``` + +**Example 2: Progressive Document Loading** +```markdown +# Instead of loading all documentation at once: + +# Step 1: Load summary +docs/api_summary.md # Lightweight overview + +# Step 2: Load specific section as needed +docs/api/endpoints.md # Only when API calls needed +docs/api/authentication.md # Only when auth context needed +``` + +## Guidelines + +1. Treat context as a finite resource with diminishing returns +2. Place critical information at attention-favored positions (beginning and end) +3. Use progressive disclosure to defer loading until needed +4. Organize system prompts with clear section boundaries +5. Monitor context usage during development +6. Implement compaction triggers at 70-80% utilization +7. Design for context degradation rather than hoping to avoid it +8. Prefer smaller high-signal context over larger low-signal context + +## Integration + +This skill provides foundational context that all other skills build upon. It should be studied first before exploring: + +- context-degradation - Understanding how context fails +- context-optimization - Techniques for extending context capacity +- multi-agent-patterns - How context isolation enables multi-agent systems +- tool-design - How tool definitions interact with context + +## References + +Internal reference: +- [Context Components Reference](./references/context-components.md) - Detailed technical reference + +Related skills in this collection: +- context-degradation - Understanding context failure patterns +- context-optimization - Techniques for efficient context use + +External resources: +- Research on transformer attention mechanisms +- Production engineering guides from leading AI labs +- Framework documentation on context window management + +--- + +## Skill Metadata + +**Created**: 2025-12-20 +**Last Updated**: 2025-12-20 +**Author**: Agent Skills for Context Engineering Contributors +**Version**: 1.0.0 diff --git a/components/skills/meta-context-engineering-dev/references/context-components.md b/components/skills/meta-context-engineering-dev/references/context-components.md new file mode 100644 index 0000000..2c0a6d5 --- /dev/null +++ b/components/skills/meta-context-engineering-dev/references/context-components.md @@ -0,0 +1,283 @@ +# Context Components: Technical Reference + +This document provides detailed technical reference for each context component in agent systems. + +## System Prompt Engineering + +### Section Structure + +Organize system prompts into distinct sections with clear boundaries. A recommended structure: + +``` + +Context about the domain, user preferences, or project-specific details + + + +Core behavioral guidelines and task instructions + + + +When and how to use available tools + + + +Expected output format and quality standards + +``` + +This structure allows agents to locate relevant information quickly and enables selective context loading in advanced implementations. + +### Altitude Calibration + +The "altitude" of instructions refers to the level of abstraction. Consider these examples: + +**Too Low (Brittle):** +``` +If the user asks about pricing, check the pricing table in docs/pricing.md. +If the table shows USD, convert to EUR using the exchange rate in +config/exchange_rates.json. If the user is in the EU, add VAT at the +applicable rate from config/vat_rates.json. Format the response with +the currency symbol, two decimal places, and a note about VAT. +``` + +**Too High (Vague):** +``` +Help users with pricing questions. Be helpful and accurate. +``` + +**Optimal (Heuristic-Driven):** +``` +For pricing inquiries: +1. Retrieve current rates from docs/pricing.md +2. Apply user location adjustments (see config/location_defaults.json) +3. Format with appropriate currency and tax considerations + +Prefer exact figures over estimates. When rates are unavailable, +say so explicitly rather than projecting. +``` + +The optimal altitude provides clear steps while allowing flexibility in execution. + +## Tool Definition Specification + +### Schema Structure + +Each tool should define: + +```python +{ + "name": "tool_function_name", + "description": "Clear description of what the tool does and when to use it", + "parameters": { + "type": "object", + "properties": { + "param_name": { + "type": "string", + "description": "What this parameter controls", + "default": "reasonable_default_value" + } + }, + "required": ["param_name"] + }, + "returns": { + "type": "object", + "description": "What the tool returns and its structure" + } +} +``` + +### Description Engineering + +Tool descriptions should answer: what the tool does, when to use it, and what it produces. Include usage context, examples, and edge cases. + +**Weak Description:** +``` +Search the database for customer information. +``` + +**Strong Description:** +``` +Retrieve customer information by ID or email. + +Use when: +- User asks about a specific customer's details, history, or status +- User provides a customer identifier and needs related information + +Returns customer object with: +- Basic info (name, email, account status) +- Order history summary +- Support ticket count + +Returns null if customer not found. Returns error if database unreachable. +``` + +## Retrieved Document Management + +### Identifier Design + +Design identifiers that convey meaning and enable efficient retrieval: + +**Poor identifiers:** +- `data/file1.json` +- `ref/ref.md` +- `2024/q3/report` + +**Strong identifiers:** +- `customer_pricing_rates.json` +- `engineering_onboarding_checklist.md` +- `2024_q3_revenue_report.pdf` + +Strong identifiers allow agents to locate relevant files even without search tools. + +### Document Chunking Strategy + +For large documents, chunk strategically to preserve semantic coherence: + +```python +# Pseudocode for semantic chunking +def chunk_document(content): + """Split document at natural semantic boundaries.""" + boundaries = find_section_headers(content) + boundaries += find_paragraph_breaks(content) + boundaries += find_logical_breaks(content) + + chunks = [] + for i in range(len(boundaries) - 1): + chunk = content[boundaries[i]:boundaries[i+1]] + if len(chunk) > MIN_CHUNK_SIZE and len(chunk) < MAX_CHUNK_SIZE: + chunks.append(chunk) + + return chunks +``` + +Avoid arbitrary character limits that split mid-sentence or mid-concept. + +## Message History Management + +### Turn Representation + +Structure message history to preserve key information: + +```python +{ + "role": "user" | "assistant" | "tool", + "content": "message text", + "reasoning": "optional chain-of-thought", + "tool_calls": [list if role="assistant"], + "tool_output": "output if role="tool"", + "summary": "compact summary if conversation is long" +} +``` + +### Summary Injection Pattern + +For long conversations, inject summaries at intervals: + +```python +def inject_summaries(messages, summary_interval=20): + """Inject summaries at regular intervals to preserve context.""" + summarized = [] + for i, msg in enumerate(messages): + summarized.append(msg) + if i > 0 and i % summary_interval == 0: + summary = generate_summary(summarized[-summary_interval:]) + summarized.append({ + "role": "system", + "content": f"Conversation summary: {summary}", + "is_summary": True + }) + return summarized +``` + +## Tool Output Optimization + +### Response Formats + +Provide response format options to control token usage: + +```python +def get_customer_response_format(): + return { + "format": "concise | detailed", + "fields": ["id", "name", "email", "status", "history_summary"] + } +``` + +The concise format returns essential fields only; detailed returns complete objects. + +### Observation Masking + +For verbose tool outputs, consider masking patterns: + +```python +def mask_observation(output, max_length=500): + """Replace long observations with compact references.""" + if len(output) <= max_length: + return output + + reference_id = store_observation(output) + return f"[Previous observation elided. Full content stored at reference {reference_id}]" +``` + +This preserves information access while reducing token usage. + +## Context Budget Estimation + +### Token Counting Approximation + +For planning purposes, estimate tokens at approximately 4 characters per token for English text: + +``` +1000 words ≈ 7500 characters ≈ 1800-2000 tokens +``` + +This is a rough approximation; actual tokenization varies by model and content type. + +### Context Budget Allocation + +Allocate context budget across components: + +| Component | Typical Range | Notes | +|-----------|---------------|-------| +| System prompt | 500-2000 tokens | Stable across session | +| Tool definitions | 100-500 per tool | Grows with tool count | +| Retrieved documents | Variable | Often largest consumer | +| Message history | Variable | Grows with conversation | +| Tool outputs | Variable | Can dominate context | + +Monitor actual usage during development to establish baseline allocations. + +## Progressive Disclosure Implementation + +### Skill Activation Pattern + +```python +def activate_skill_context(skill_name, task_description): + """Load skill context when task matches skill description.""" + skill_metadata = load_all_skill_metadata() + + relevant_skills = [] + for skill in skill_metadata: + if skill_matches_task(skill, task_description): + relevant_skills.append(skill) + + # Load full content only for most relevant skills + for skill in relevant_skills[:MAX_CONCURRENT_SKILLS]: + skill_context = load_skill_content(skill) + inject_into_context(skill_context) +``` + +### Reference Loading Pattern + +```python +def get_reference(file_reference): + """Load reference file only when explicitly needed.""" + if not file_reference.is_loaded: + file_reference.content = read_file(file_reference.path) + file_reference.is_loaded = True + return file_reference.content +``` + +This pattern ensures files are loaded once and cached for the session. + diff --git a/components/skills/meta-context-engineering-dev/scripts/context_manager.py b/components/skills/meta-context-engineering-dev/scripts/context_manager.py new file mode 100644 index 0000000..54d31f3 --- /dev/null +++ b/components/skills/meta-context-engineering-dev/scripts/context_manager.py @@ -0,0 +1,370 @@ +""" +Context Management Utilities + +This module provides utilities for managing context in agent systems. + +Note: This module uses simplified estimation functions for demonstration. +Production systems should use actual tokenizers (tiktoken for OpenAI, +model-specific tokenizers for other providers) for accurate token counts. +""" + +from typing import Dict, List +import hashlib + + +def estimate_token_count(text: str) -> int: + """ + Estimate token count for text. + + Uses approximation: ~4 characters per token for English. + + WARNING: This is a rough estimate for demonstration purposes. + Production systems should use actual tokenizers: + - OpenAI: tiktoken library + - Anthropic: Model-specific tokenizers + - Other: Provider-specific tokenization + + Actual tokenization varies by: + - Model architecture + - Content type (code vs prose) + - Language (non-English typically has higher token/char ratio) + """ + return len(text) // 4 + + +def estimate_message_tokens(messages: list) -> int: + """Estimate token count for message list.""" + total = 0 + for msg in messages: + content = msg.get("content", "") + total += estimate_token_count(content) + total += 10 # Overhead for role/formatting + return total + + +def count_tokens_by_type(context: Dict) -> Dict: + """Break down token usage by context type.""" + breakdown = { + "system_prompt": 0, + "tool_definitions": 0, + "retrieved_documents": 0, + "message_history": 0, + "tool_outputs": 0, + "other": 0 + } + + # System prompt + if "system" in context: + breakdown["system_prompt"] = estimate_token_count(context["system"]) + + # Tool definitions + if "tools" in context: + for tool in context["tools"]: + breakdown["tool_definitions"] += estimate_token_count(str(tool)) + + # Retrieved documents + if "documents" in context: + for doc in context["documents"]: + breakdown["retrieved_documents"] += estimate_token_count(doc) + + # Message history + if "messages" in context: + breakdown["message_history"] = estimate_message_tokens(context["messages"]) + + return breakdown + + +# Context Builder + +class ContextBuilder: + """Build context with budget management.""" + + def __init__(self, context_limit: int = 100000): + self.context_limit = context_limit + self.sections: Dict[str, str] = {} + self.order: List[str] = [] + + def add_section(self, name: str, content: str, + priority: int = 0, category: str = "other"): + """Add section to context.""" + if name not in self.sections: + self.order.append(name) + + self.sections[name] = { + "content": content, + "priority": priority, + "category": category, + "tokens": estimate_token_count(content) + } + + def build(self, max_tokens: int = None) -> str: + """Build context within token limit.""" + limit = max_tokens or self.context_limit + + # Sort by priority (higher first) + sorted_sections = sorted( + self.order, + key=lambda n: self.sections[n]["priority"], + reverse=True + ) + + # Build context + context_parts = [] + current_tokens = 0 + + for name in sorted_sections: + section = self.sections[name] + section_tokens = section["tokens"] + + if current_tokens + section_tokens <= limit: + context_parts.append(section["content"]) + current_tokens += section_tokens + + return "\n\n".join(context_parts) + + def get_usage_report(self) -> Dict: + """Get current context usage report.""" + total = sum(s["tokens"] for s in self.sections.values()) + return { + "total_tokens": total, + "limit": self.context_limit, + "utilization": total / self.context_limit, + "by_section": { + name: s["tokens"] + for name, s in self.sections.items() + }, + "status": self._get_status(total) + } + + def _get_status(self, total: int) -> str: + """Get status based on utilization.""" + ratio = total / self.context_limit + if ratio > 0.9: + return "critical" + elif ratio > 0.7: + return "warning" + else: + return "healthy" + + +# Context Truncation + +def truncate_context(context: str, max_tokens: int, + preserve_start: bool = True) -> str: + """ + Truncate context to fit within token limit. + + Args: + context: Full context string + max_tokens: Maximum tokens to keep + preserve_start: If True, preserve beginning; otherwise preserve end + + Returns: + Truncated context + """ + tokens = context.split() + current_tokens = len(tokens) + + if current_tokens <= max_tokens: + return context + + if preserve_start: + # Keep beginning, truncate end + kept = tokens[:max_tokens] + else: + # Keep end, truncate beginning + kept = tokens[-max_tokens:] + + return " ".join(kept) + + +def truncate_messages(messages: list, max_tokens: int) -> list: + """ + Truncate message history while preserving structure. + + Strategy: + 1. Always keep system prompt + 2. Keep recent messages + 3. Summarize older messages if needed + """ + system_prompt = None + recent_messages = [] + summary = None + + for msg in messages: + if msg["role"] == "system": + system_prompt = msg + elif msg.get("is_summary"): + summary = msg + else: + recent_messages.append(msg) + + # Calculate token usage + tokens_for_system = estimate_token_count(system_prompt["content"]) if system_prompt else 0 + tokens_for_recent = estimate_message_tokens(recent_messages) + tokens_for_summary = estimate_token_count(summary["content"]) if summary else 0 + + available = max_tokens - tokens_for_system - tokens_for_summary + + # Truncate recent if needed + if tokens_for_recent > available: + # Keep most recent messages + truncated_recent = [] + current_tokens = 0 + + for msg in reversed(recent_messages): + msg_tokens = estimate_token_count(msg.get("content", "")) + if current_tokens + msg_tokens <= available: + truncated_recent.insert(0, msg) + current_tokens += msg_tokens + + recent_messages = truncated_recent + + result = [] + if system_prompt: + result.append(system_prompt) + if summary: + result.append(summary) + result.extend(recent_messages) + + return result + + +# Context Validation + +def validate_context_structure(context: Dict) -> Dict: + """ + Validate context structure for common issues. + + Returns validation results with issues and recommendations. + """ + issues = [] + recommendations = [] + + # Check for empty sections + for section, content in context.items(): + if not content: + issues.append(f"Empty {section} section") + recommendations.append(f"Remove or populate {section}") + + # Check for excessive length + total_tokens = sum(estimate_token_count(str(c)) for c in context.values()) + if total_tokens > 80000: + issues.append(f"Context length ({total_tokens} tokens) exceeds recommended limit") + recommendations.append("Consider context compaction or partitioning") + + # Check for missing sections + recommended_sections = ["system", "task"] + for section in recommended_sections: + if section not in context: + issues.append(f"Missing recommended section: {section}") + recommendations.append(f"Add {section} section with relevant information") + + # Check for duplicate information + # Using hashlib instead of hash() for cross-process consistency + seen_content = set() + for section, content in context.items(): + content_str = str(content)[:1000] # First 1000 chars + content_hash = hashlib.md5(content_str.encode()).hexdigest() + if content_hash in seen_content: + issues.append(f"Potential duplicate content in {section}") + seen_content.add(content_hash) + + return { + "valid": len(issues) == 0, + "issues": issues, + "recommendations": recommendations + } + + +# Progressive Disclosure + +class ProgressiveDisclosureManager: + """Manage progressive disclosure of context.""" + + def __init__(self, base_dir: str = "."): + self.base_dir = base_dir + self.loaded_files: Dict[str, str] = {} + + def load_summary(self, summary_path: str) -> str: + """Load summary without loading full content.""" + if summary_path in self.loaded_files: + return self.loaded_files[summary_path] + + try: + with open(summary_path, 'r') as f: + content = f.read() + self.loaded_files[summary_path] = content + return content + except FileNotFoundError: + return "" + + def load_detail(self, detail_path: str, force: bool = False) -> str: + """Load detailed content on demand.""" + if not force and detail_path in self.loaded_files: + return self.loaded_files[detail_path] + + try: + with open(detail_path, 'r') as f: + content = f.read() + self.loaded_files[detail_path] = content + return content + except FileNotFoundError: + return "" + + def get_contextual_info(self, reference: Dict) -> str: + """ + Get information following progressive disclosure. + + Returns summary if available, loads detail if needed. + """ + summary_path = reference.get("summary_path") + detail_path = reference.get("detail_path") + need_detail = reference.get("need_detail", False) + + if need_detail and detail_path: + return self.load_detail(detail_path) + elif summary_path: + return self.load_summary(summary_path) + else: + return "" + + +# Usage Example + +def build_agent_context(task: str, system_prompt: str, + documents: List[str] = None) -> Dict: + """Build optimized context for agent task.""" + builder = ContextBuilder(context_limit=80000) + + # Add system prompt (highest priority) + builder.add_section("system", system_prompt, priority=10, + category="system") + + # Add task description + builder.add_section("task", task, priority=9, category="task") + + # Add retrieved documents + if documents: + for i, doc in enumerate(documents): + builder.add_section( + f"document_{i}", + doc, + priority=5, + category="retrieved" + ) + + # Build and validate + context = { + "system": system_prompt, + "task": task, + "documents": documents or [] + } + + validation = validate_context_structure(context) + + return { + "context": builder.build(), + "usage_report": builder.get_usage_report(), + "validation": validation + } diff --git a/components/skills/method-debugging-systematic-eng/CREATION-LOG.md b/components/skills/method-debugging-systematic-eng/CREATION-LOG.md new file mode 100644 index 0000000..024d00a --- /dev/null +++ b/components/skills/method-debugging-systematic-eng/CREATION-LOG.md @@ -0,0 +1,119 @@ +# Creation Log: Systematic Debugging Skill + +Reference example of extracting, structuring, and bulletproofing a critical skill. + +## Source Material + +Extracted debugging framework from `/Users/jesse/.claude/CLAUDE.md`: +- 4-phase systematic process (Investigation → Pattern Analysis → Hypothesis → Implementation) +- Core mandate: ALWAYS find root cause, NEVER fix symptoms +- Rules designed to resist time pressure and rationalization + +## Extraction Decisions + +**What to include:** +- Complete 4-phase framework with all rules +- Anti-shortcuts ("NEVER fix symptom", "STOP and re-analyze") +- Pressure-resistant language ("even if faster", "even if I seem in a hurry") +- Concrete steps for each phase + +**What to leave out:** +- Project-specific context +- Repetitive variations of same rule +- Narrative explanations (condensed to principles) + +## Structure Following skill-creation/SKILL.md + +1. **Rich when_to_use** - Included symptoms and anti-patterns +2. **Type: technique** - Concrete process with steps +3. **Keywords** - "root cause", "symptom", "workaround", "debugging", "investigation" +4. **Flowchart** - Decision point for "fix failed" → re-analyze vs add more fixes +5. **Phase-by-phase breakdown** - Scannable checklist format +6. **Anti-patterns section** - What NOT to do (critical for this skill) + +## Bulletproofing Elements + +Framework designed to resist rationalization under pressure: + +### Language Choices +- "ALWAYS" / "NEVER" (not "should" / "try to") +- "even if faster" / "even if I seem in a hurry" +- "STOP and re-analyze" (explicit pause) +- "Don't skip past" (catches the actual behavior) + +### Structural Defenses +- **Phase 1 required** - Can't skip to implementation +- **Single hypothesis rule** - Forces thinking, prevents shotgun fixes +- **Explicit failure mode** - "IF your first fix doesn't work" with mandatory action +- **Anti-patterns section** - Shows exactly what shortcuts look like + +### Redundancy +- Root cause mandate in overview + when_to_use + Phase 1 + implementation rules +- "NEVER fix symptom" appears 4 times in different contexts +- Each phase has explicit "don't skip" guidance + +## Testing Approach + +Created 4 validation tests following skills/meta/testing-skills-with-subagents: + +### Test 1: Academic Context (No Pressure) +- Simple bug, no time pressure +- **Result:** Perfect compliance, complete investigation + +### Test 2: Time Pressure + Obvious Quick Fix +- User "in a hurry", symptom fix looks easy +- **Result:** Resisted shortcut, followed full process, found real root cause + +### Test 3: Complex System + Uncertainty +- Multi-layer failure, unclear if can find root cause +- **Result:** Systematic investigation, traced through all layers, found source + +### Test 4: Failed First Fix +- Hypothesis doesn't work, temptation to add more fixes +- **Result:** Stopped, re-analyzed, formed new hypothesis (no shotgun) + +**All tests passed.** No rationalizations found. + +## Iterations + +### Initial Version +- Complete 4-phase framework +- Anti-patterns section +- Flowchart for "fix failed" decision + +### Enhancement 1: TDD Reference +- Added link to skills/testing/test-driven-development +- Note explaining TDD's "simplest code" ≠ debugging's "root cause" +- Prevents confusion between methodologies + +## Final Outcome + +Bulletproof skill that: +- ✅ Clearly mandates root cause investigation +- ✅ Resists time pressure rationalization +- ✅ Provides concrete steps for each phase +- ✅ Shows anti-patterns explicitly +- ✅ Tested under multiple pressure scenarios +- ✅ Clarifies relationship to TDD +- ✅ Ready for use + +## Key Insight + +**Most important bulletproofing:** Anti-patterns section showing exact shortcuts that feel justified in the moment. When Claude thinks "I'll just add this one quick fix", seeing that exact pattern listed as wrong creates cognitive friction. + +## Usage Example + +When encountering a bug: +1. Load skill: skills/debugging/systematic-debugging +2. Read overview (10 sec) - reminded of mandate +3. Follow Phase 1 checklist - forced investigation +4. If tempted to skip - see anti-pattern, stop +5. Complete all phases - root cause found + +**Time investment:** 5-10 minutes +**Time saved:** Hours of symptom-whack-a-mole + +--- + +*Created: 2025-10-03* +*Purpose: Reference example for skill extraction and bulletproofing* diff --git a/components/skills/method-debugging-systematic-eng/SKILL.md b/components/skills/method-debugging-systematic-eng/SKILL.md new file mode 100644 index 0000000..111d2a9 --- /dev/null +++ b/components/skills/method-debugging-systematic-eng/SKILL.md @@ -0,0 +1,296 @@ +--- +name: systematic-debugging +description: Use when encountering any bug, test failure, or unexpected behavior, before proposing fixes +--- + +# Systematic Debugging + +## Overview + +Random fixes waste time and create new bugs. Quick patches mask underlying issues. + +**Core principle:** ALWAYS find root cause before attempting fixes. Symptom fixes are failure. + +**Violating the letter of this process is violating the spirit of debugging.** + +## The Iron Law + +``` +NO FIXES WITHOUT ROOT CAUSE INVESTIGATION FIRST +``` + +If you haven't completed Phase 1, you cannot propose fixes. + +## When to Use + +Use for ANY technical issue: +- Test failures +- Bugs in production +- Unexpected behavior +- Performance problems +- Build failures +- Integration issues + +**Use this ESPECIALLY when:** +- Under time pressure (emergencies make guessing tempting) +- "Just one quick fix" seems obvious +- You've already tried multiple fixes +- Previous fix didn't work +- You don't fully understand the issue + +**Don't skip when:** +- Issue seems simple (simple bugs have root causes too) +- You're in a hurry (rushing guarantees rework) +- Manager wants it fixed NOW (systematic is faster than thrashing) + +## The Four Phases + +You MUST complete each phase before proceeding to the next. + +### Phase 1: Root Cause Investigation + +**BEFORE attempting ANY fix:** + +1. **Read Error Messages Carefully** + - Don't skip past errors or warnings + - They often contain the exact solution + - Read stack traces completely + - Note line numbers, file paths, error codes + +2. **Reproduce Consistently** + - Can you trigger it reliably? + - What are the exact steps? + - Does it happen every time? + - If not reproducible → gather more data, don't guess + +3. **Check Recent Changes** + - What changed that could cause this? + - Git diff, recent commits + - New dependencies, config changes + - Environmental differences + +4. **Gather Evidence in Multi-Component Systems** + + **WHEN system has multiple components (CI → build → signing, API → service → database):** + + **BEFORE proposing fixes, add diagnostic instrumentation:** + ``` + For EACH component boundary: + - Log what data enters component + - Log what data exits component + - Verify environment/config propagation + - Check state at each layer + + Run once to gather evidence showing WHERE it breaks + THEN analyze evidence to identify failing component + THEN investigate that specific component + ``` + + **Example (multi-layer system):** + ```bash + # Layer 1: Workflow + echo "=== Secrets available in workflow: ===" + echo "IDENTITY: ${IDENTITY:+SET}${IDENTITY:-UNSET}" + + # Layer 2: Build script + echo "=== Env vars in build script: ===" + env | grep IDENTITY || echo "IDENTITY not in environment" + + # Layer 3: Signing script + echo "=== Keychain state: ===" + security list-keychains + security find-identity -v + + # Layer 4: Actual signing + codesign --sign "$IDENTITY" --verbose=4 "$APP" + ``` + + **This reveals:** Which layer fails (secrets → workflow ✓, workflow → build ✗) + +5. **Trace Data Flow** + + **WHEN error is deep in call stack:** + + See `root-cause-tracing.md` in this directory for the complete backward tracing technique. + + **Quick version:** + - Where does bad value originate? + - What called this with bad value? + - Keep tracing up until you find the source + - Fix at source, not at symptom + +### Phase 2: Pattern Analysis + +**Find the pattern before fixing:** + +1. **Find Working Examples** + - Locate similar working code in same codebase + - What works that's similar to what's broken? + +2. **Compare Against References** + - If implementing pattern, read reference implementation COMPLETELY + - Don't skim - read every line + - Understand the pattern fully before applying + +3. **Identify Differences** + - What's different between working and broken? + - List every difference, however small + - Don't assume "that can't matter" + +4. **Understand Dependencies** + - What other components does this need? + - What settings, config, environment? + - What assumptions does it make? + +### Phase 3: Hypothesis and Testing + +**Scientific method:** + +1. **Form Single Hypothesis** + - State clearly: "I think X is the root cause because Y" + - Write it down + - Be specific, not vague + +2. **Test Minimally** + - Make the SMALLEST possible change to test hypothesis + - One variable at a time + - Don't fix multiple things at once + +3. **Verify Before Continuing** + - Did it work? Yes → Phase 4 + - Didn't work? Form NEW hypothesis + - DON'T add more fixes on top + +4. **When You Don't Know** + - Say "I don't understand X" + - Don't pretend to know + - Ask for help + - Research more + +### Phase 4: Implementation + +**Fix the root cause, not the symptom:** + +1. **Create Failing Test Case** + - Simplest possible reproduction + - Automated test if possible + - One-off test script if no framework + - MUST have before fixing + - Use the `superpowers:test-driven-development` skill for writing proper failing tests + +2. **Implement Single Fix** + - Address the root cause identified + - ONE change at a time + - No "while I'm here" improvements + - No bundled refactoring + +3. **Verify Fix** + - Test passes now? + - No other tests broken? + - Issue actually resolved? + +4. **If Fix Doesn't Work** + - STOP + - Count: How many fixes have you tried? + - If < 3: Return to Phase 1, re-analyze with new information + - **If ≥ 3: STOP and question the architecture (step 5 below)** + - DON'T attempt Fix #4 without architectural discussion + +5. **If 3+ Fixes Failed: Question Architecture** + + **Pattern indicating architectural problem:** + - Each fix reveals new shared state/coupling/problem in different place + - Fixes require "massive refactoring" to implement + - Each fix creates new symptoms elsewhere + + **STOP and question fundamentals:** + - Is this pattern fundamentally sound? + - Are we "sticking with it through sheer inertia"? + - Should we refactor architecture vs. continue fixing symptoms? + + **Discuss with your human partner before attempting more fixes** + + This is NOT a failed hypothesis - this is a wrong architecture. + +## Red Flags - STOP and Follow Process + +If you catch yourself thinking: +- "Quick fix for now, investigate later" +- "Just try changing X and see if it works" +- "Add multiple changes, run tests" +- "Skip the test, I'll manually verify" +- "It's probably X, let me fix that" +- "I don't fully understand but this might work" +- "Pattern says X but I'll adapt it differently" +- "Here are the main problems: [lists fixes without investigation]" +- Proposing solutions before tracing data flow +- **"One more fix attempt" (when already tried 2+)** +- **Each fix reveals new problem in different place** + +**ALL of these mean: STOP. Return to Phase 1.** + +**If 3+ fixes failed:** Question the architecture (see Phase 4.5) + +## your human partner's Signals You're Doing It Wrong + +**Watch for these redirections:** +- "Is that not happening?" - You assumed without verifying +- "Will it show us...?" - You should have added evidence gathering +- "Stop guessing" - You're proposing fixes without understanding +- "Ultrathink this" - Question fundamentals, not just symptoms +- "We're stuck?" (frustrated) - Your approach isn't working + +**When you see these:** STOP. Return to Phase 1. + +## Common Rationalizations + +| Excuse | Reality | +|--------|---------| +| "Issue is simple, don't need process" | Simple issues have root causes too. Process is fast for simple bugs. | +| "Emergency, no time for process" | Systematic debugging is FASTER than guess-and-check thrashing. | +| "Just try this first, then investigate" | First fix sets the pattern. Do it right from the start. | +| "I'll write test after confirming fix works" | Untested fixes don't stick. Test first proves it. | +| "Multiple fixes at once saves time" | Can't isolate what worked. Causes new bugs. | +| "Reference too long, I'll adapt the pattern" | Partial understanding guarantees bugs. Read it completely. | +| "I see the problem, let me fix it" | Seeing symptoms ≠ understanding root cause. | +| "One more fix attempt" (after 2+ failures) | 3+ failures = architectural problem. Question pattern, don't fix again. | + +## Quick Reference + +| Phase | Key Activities | Success Criteria | +|-------|---------------|------------------| +| **1. Root Cause** | Read errors, reproduce, check changes, gather evidence | Understand WHAT and WHY | +| **2. Pattern** | Find working examples, compare | Identify differences | +| **3. Hypothesis** | Form theory, test minimally | Confirmed or new hypothesis | +| **4. Implementation** | Create test, fix, verify | Bug resolved, tests pass | + +## When Process Reveals "No Root Cause" + +If systematic investigation reveals issue is truly environmental, timing-dependent, or external: + +1. You've completed the process +2. Document what you investigated +3. Implement appropriate handling (retry, timeout, error message) +4. Add monitoring/logging for future investigation + +**But:** 95% of "no root cause" cases are incomplete investigation. + +## Supporting Techniques + +These techniques are part of systematic debugging and available in this directory: + +- **`root-cause-tracing.md`** - Trace bugs backward through call stack to find original trigger +- **`defense-in-depth.md`** - Add validation at multiple layers after finding root cause +- **`condition-based-waiting.md`** - Replace arbitrary timeouts with condition polling + +**Related skills:** +- **superpowers:test-driven-development** - For creating failing test case (Phase 4, Step 1) +- **superpowers:verification-before-completion** - Verify fix worked before claiming success + +## Real-World Impact + +From debugging sessions: +- Systematic approach: 15-30 minutes to fix +- Random fixes approach: 2-3 hours of thrashing +- First-time fix rate: 95% vs 40% +- New bugs introduced: Near zero vs common diff --git a/components/skills/method-debugging-systematic-eng/condition-based-waiting-example.ts b/components/skills/method-debugging-systematic-eng/condition-based-waiting-example.ts new file mode 100644 index 0000000..703a06b --- /dev/null +++ b/components/skills/method-debugging-systematic-eng/condition-based-waiting-example.ts @@ -0,0 +1,158 @@ +// Complete implementation of condition-based waiting utilities +// From: Lace test infrastructure improvements (2025-10-03) +// Context: Fixed 15 flaky tests by replacing arbitrary timeouts + +import type { ThreadManager } from '~/threads/thread-manager'; +import type { LaceEvent, LaceEventType } from '~/threads/types'; + +/** + * Wait for a specific event type to appear in thread + * + * @param threadManager - The thread manager to query + * @param threadId - Thread to check for events + * @param eventType - Type of event to wait for + * @param timeoutMs - Maximum time to wait (default 5000ms) + * @returns Promise resolving to the first matching event + * + * Example: + * await waitForEvent(threadManager, agentThreadId, 'TOOL_RESULT'); + */ +export function waitForEvent( + threadManager: ThreadManager, + threadId: string, + eventType: LaceEventType, + timeoutMs = 5000 +): Promise { + return new Promise((resolve, reject) => { + const startTime = Date.now(); + + const check = () => { + const events = threadManager.getEvents(threadId); + const event = events.find((e) => e.type === eventType); + + if (event) { + resolve(event); + } else if (Date.now() - startTime > timeoutMs) { + reject(new Error(`Timeout waiting for ${eventType} event after ${timeoutMs}ms`)); + } else { + setTimeout(check, 10); // Poll every 10ms for efficiency + } + }; + + check(); + }); +} + +/** + * Wait for a specific number of events of a given type + * + * @param threadManager - The thread manager to query + * @param threadId - Thread to check for events + * @param eventType - Type of event to wait for + * @param count - Number of events to wait for + * @param timeoutMs - Maximum time to wait (default 5000ms) + * @returns Promise resolving to all matching events once count is reached + * + * Example: + * // Wait for 2 AGENT_MESSAGE events (initial response + continuation) + * await waitForEventCount(threadManager, agentThreadId, 'AGENT_MESSAGE', 2); + */ +export function waitForEventCount( + threadManager: ThreadManager, + threadId: string, + eventType: LaceEventType, + count: number, + timeoutMs = 5000 +): Promise { + return new Promise((resolve, reject) => { + const startTime = Date.now(); + + const check = () => { + const events = threadManager.getEvents(threadId); + const matchingEvents = events.filter((e) => e.type === eventType); + + if (matchingEvents.length >= count) { + resolve(matchingEvents); + } else if (Date.now() - startTime > timeoutMs) { + reject( + new Error( + `Timeout waiting for ${count} ${eventType} events after ${timeoutMs}ms (got ${matchingEvents.length})` + ) + ); + } else { + setTimeout(check, 10); + } + }; + + check(); + }); +} + +/** + * Wait for an event matching a custom predicate + * Useful when you need to check event data, not just type + * + * @param threadManager - The thread manager to query + * @param threadId - Thread to check for events + * @param predicate - Function that returns true when event matches + * @param description - Human-readable description for error messages + * @param timeoutMs - Maximum time to wait (default 5000ms) + * @returns Promise resolving to the first matching event + * + * Example: + * // Wait for TOOL_RESULT with specific ID + * await waitForEventMatch( + * threadManager, + * agentThreadId, + * (e) => e.type === 'TOOL_RESULT' && e.data.id === 'call_123', + * 'TOOL_RESULT with id=call_123' + * ); + */ +export function waitForEventMatch( + threadManager: ThreadManager, + threadId: string, + predicate: (event: LaceEvent) => boolean, + description: string, + timeoutMs = 5000 +): Promise { + return new Promise((resolve, reject) => { + const startTime = Date.now(); + + const check = () => { + const events = threadManager.getEvents(threadId); + const event = events.find(predicate); + + if (event) { + resolve(event); + } else if (Date.now() - startTime > timeoutMs) { + reject(new Error(`Timeout waiting for ${description} after ${timeoutMs}ms`)); + } else { + setTimeout(check, 10); + } + }; + + check(); + }); +} + +// Usage example from actual debugging session: +// +// BEFORE (flaky): +// --------------- +// const messagePromise = agent.sendMessage('Execute tools'); +// await new Promise(r => setTimeout(r, 300)); // Hope tools start in 300ms +// agent.abort(); +// await messagePromise; +// await new Promise(r => setTimeout(r, 50)); // Hope results arrive in 50ms +// expect(toolResults.length).toBe(2); // Fails randomly +// +// AFTER (reliable): +// ---------------- +// const messagePromise = agent.sendMessage('Execute tools'); +// await waitForEventCount(threadManager, threadId, 'TOOL_CALL', 2); // Wait for tools to start +// agent.abort(); +// await messagePromise; +// await waitForEventCount(threadManager, threadId, 'TOOL_RESULT', 2); // Wait for results +// expect(toolResults.length).toBe(2); // Always succeeds +// +// Result: 60% pass rate → 100%, 40% faster execution diff --git a/components/skills/method-debugging-systematic-eng/condition-based-waiting.md b/components/skills/method-debugging-systematic-eng/condition-based-waiting.md new file mode 100644 index 0000000..70994f7 --- /dev/null +++ b/components/skills/method-debugging-systematic-eng/condition-based-waiting.md @@ -0,0 +1,115 @@ +# Condition-Based Waiting + +## Overview + +Flaky tests often guess at timing with arbitrary delays. This creates race conditions where tests pass on fast machines but fail under load or in CI. + +**Core principle:** Wait for the actual condition you care about, not a guess about how long it takes. + +## When to Use + +```dot +digraph when_to_use { + "Test uses setTimeout/sleep?" [shape=diamond]; + "Testing timing behavior?" [shape=diamond]; + "Document WHY timeout needed" [shape=box]; + "Use condition-based waiting" [shape=box]; + + "Test uses setTimeout/sleep?" -> "Testing timing behavior?" [label="yes"]; + "Testing timing behavior?" -> "Document WHY timeout needed" [label="yes"]; + "Testing timing behavior?" -> "Use condition-based waiting" [label="no"]; +} +``` + +**Use when:** +- Tests have arbitrary delays (`setTimeout`, `sleep`, `time.sleep()`) +- Tests are flaky (pass sometimes, fail under load) +- Tests timeout when run in parallel +- Waiting for async operations to complete + +**Don't use when:** +- Testing actual timing behavior (debounce, throttle intervals) +- Always document WHY if using arbitrary timeout + +## Core Pattern + +```typescript +// ❌ BEFORE: Guessing at timing +await new Promise(r => setTimeout(r, 50)); +const result = getResult(); +expect(result).toBeDefined(); + +// ✅ AFTER: Waiting for condition +await waitFor(() => getResult() !== undefined); +const result = getResult(); +expect(result).toBeDefined(); +``` + +## Quick Patterns + +| Scenario | Pattern | +|----------|---------| +| Wait for event | `waitFor(() => events.find(e => e.type === 'DONE'))` | +| Wait for state | `waitFor(() => machine.state === 'ready')` | +| Wait for count | `waitFor(() => items.length >= 5)` | +| Wait for file | `waitFor(() => fs.existsSync(path))` | +| Complex condition | `waitFor(() => obj.ready && obj.value > 10)` | + +## Implementation + +Generic polling function: +```typescript +async function waitFor( + condition: () => T | undefined | null | false, + description: string, + timeoutMs = 5000 +): Promise { + const startTime = Date.now(); + + while (true) { + const result = condition(); + if (result) return result; + + if (Date.now() - startTime > timeoutMs) { + throw new Error(`Timeout waiting for ${description} after ${timeoutMs}ms`); + } + + await new Promise(r => setTimeout(r, 10)); // Poll every 10ms + } +} +``` + +See `condition-based-waiting-example.ts` in this directory for complete implementation with domain-specific helpers (`waitForEvent`, `waitForEventCount`, `waitForEventMatch`) from actual debugging session. + +## Common Mistakes + +**❌ Polling too fast:** `setTimeout(check, 1)` - wastes CPU +**✅ Fix:** Poll every 10ms + +**❌ No timeout:** Loop forever if condition never met +**✅ Fix:** Always include timeout with clear error + +**❌ Stale data:** Cache state before loop +**✅ Fix:** Call getter inside loop for fresh data + +## When Arbitrary Timeout IS Correct + +```typescript +// Tool ticks every 100ms - need 2 ticks to verify partial output +await waitForEvent(manager, 'TOOL_STARTED'); // First: wait for condition +await new Promise(r => setTimeout(r, 200)); // Then: wait for timed behavior +// 200ms = 2 ticks at 100ms intervals - documented and justified +``` + +**Requirements:** +1. First wait for triggering condition +2. Based on known timing (not guessing) +3. Comment explaining WHY + +## Real-World Impact + +From debugging session (2025-10-03): +- Fixed 15 flaky tests across 3 files +- Pass rate: 60% → 100% +- Execution time: 40% faster +- No more race conditions diff --git a/components/skills/method-debugging-systematic-eng/defense-in-depth.md b/components/skills/method-debugging-systematic-eng/defense-in-depth.md new file mode 100644 index 0000000..e248335 --- /dev/null +++ b/components/skills/method-debugging-systematic-eng/defense-in-depth.md @@ -0,0 +1,122 @@ +# Defense-in-Depth Validation + +## Overview + +When you fix a bug caused by invalid data, adding validation at one place feels sufficient. But that single check can be bypassed by different code paths, refactoring, or mocks. + +**Core principle:** Validate at EVERY layer data passes through. Make the bug structurally impossible. + +## Why Multiple Layers + +Single validation: "We fixed the bug" +Multiple layers: "We made the bug impossible" + +Different layers catch different cases: +- Entry validation catches most bugs +- Business logic catches edge cases +- Environment guards prevent context-specific dangers +- Debug logging helps when other layers fail + +## The Four Layers + +### Layer 1: Entry Point Validation +**Purpose:** Reject obviously invalid input at API boundary + +```typescript +function createProject(name: string, workingDirectory: string) { + if (!workingDirectory || workingDirectory.trim() === '') { + throw new Error('workingDirectory cannot be empty'); + } + if (!existsSync(workingDirectory)) { + throw new Error(`workingDirectory does not exist: ${workingDirectory}`); + } + if (!statSync(workingDirectory).isDirectory()) { + throw new Error(`workingDirectory is not a directory: ${workingDirectory}`); + } + // ... proceed +} +``` + +### Layer 2: Business Logic Validation +**Purpose:** Ensure data makes sense for this operation + +```typescript +function initializeWorkspace(projectDir: string, sessionId: string) { + if (!projectDir) { + throw new Error('projectDir required for workspace initialization'); + } + // ... proceed +} +``` + +### Layer 3: Environment Guards +**Purpose:** Prevent dangerous operations in specific contexts + +```typescript +async function gitInit(directory: string) { + // In tests, refuse git init outside temp directories + if (process.env.NODE_ENV === 'test') { + const normalized = normalize(resolve(directory)); + const tmpDir = normalize(resolve(tmpdir())); + + if (!normalized.startsWith(tmpDir)) { + throw new Error( + `Refusing git init outside temp dir during tests: ${directory}` + ); + } + } + // ... proceed +} +``` + +### Layer 4: Debug Instrumentation +**Purpose:** Capture context for forensics + +```typescript +async function gitInit(directory: string) { + const stack = new Error().stack; + logger.debug('About to git init', { + directory, + cwd: process.cwd(), + stack, + }); + // ... proceed +} +``` + +## Applying the Pattern + +When you find a bug: + +1. **Trace the data flow** - Where does bad value originate? Where used? +2. **Map all checkpoints** - List every point data passes through +3. **Add validation at each layer** - Entry, business, environment, debug +4. **Test each layer** - Try to bypass layer 1, verify layer 2 catches it + +## Example from Session + +Bug: Empty `projectDir` caused `git init` in source code + +**Data flow:** +1. Test setup → empty string +2. `Project.create(name, '')` +3. `WorkspaceManager.createWorkspace('')` +4. `git init` runs in `process.cwd()` + +**Four layers added:** +- Layer 1: `Project.create()` validates not empty/exists/writable +- Layer 2: `WorkspaceManager` validates projectDir not empty +- Layer 3: `WorktreeManager` refuses git init outside tmpdir in tests +- Layer 4: Stack trace logging before git init + +**Result:** All 1847 tests passed, bug impossible to reproduce + +## Key Insight + +All four layers were necessary. During testing, each layer caught bugs the others missed: +- Different code paths bypassed entry validation +- Mocks bypassed business logic checks +- Edge cases on different platforms needed environment guards +- Debug logging identified structural misuse + +**Don't stop at one validation point.** Add checks at every layer. diff --git a/components/skills/method-debugging-systematic-eng/find-polluter.sh b/components/skills/method-debugging-systematic-eng/find-polluter.sh new file mode 100755 index 0000000..1d71c56 --- /dev/null +++ b/components/skills/method-debugging-systematic-eng/find-polluter.sh @@ -0,0 +1,63 @@ +#!/usr/bin/env bash +# Bisection script to find which test creates unwanted files/state +# Usage: ./find-polluter.sh +# Example: ./find-polluter.sh '.git' 'src/**/*.test.ts' + +set -e + +if [ $# -ne 2 ]; then + echo "Usage: $0 " + echo "Example: $0 '.git' 'src/**/*.test.ts'" + exit 1 +fi + +POLLUTION_CHECK="$1" +TEST_PATTERN="$2" + +echo "🔍 Searching for test that creates: $POLLUTION_CHECK" +echo "Test pattern: $TEST_PATTERN" +echo "" + +# Get list of test files +TEST_FILES=$(find . -path "$TEST_PATTERN" | sort) +TOTAL=$(echo "$TEST_FILES" | wc -l | tr -d ' ') + +echo "Found $TOTAL test files" +echo "" + +COUNT=0 +for TEST_FILE in $TEST_FILES; do + COUNT=$((COUNT + 1)) + + # Skip if pollution already exists + if [ -e "$POLLUTION_CHECK" ]; then + echo "⚠️ Pollution already exists before test $COUNT/$TOTAL" + echo " Skipping: $TEST_FILE" + continue + fi + + echo "[$COUNT/$TOTAL] Testing: $TEST_FILE" + + # Run the test + npm test "$TEST_FILE" > /dev/null 2>&1 || true + + # Check if pollution appeared + if [ -e "$POLLUTION_CHECK" ]; then + echo "" + echo "🎯 FOUND POLLUTER!" + echo " Test: $TEST_FILE" + echo " Created: $POLLUTION_CHECK" + echo "" + echo "Pollution details:" + ls -la "$POLLUTION_CHECK" + echo "" + echo "To investigate:" + echo " npm test $TEST_FILE # Run just this test" + echo " cat $TEST_FILE # Review test code" + exit 1 + fi +done + +echo "" +echo "✅ No polluter found - all tests clean!" +exit 0 diff --git a/components/skills/method-debugging-systematic-eng/root-cause-tracing.md b/components/skills/method-debugging-systematic-eng/root-cause-tracing.md new file mode 100644 index 0000000..9484774 --- /dev/null +++ b/components/skills/method-debugging-systematic-eng/root-cause-tracing.md @@ -0,0 +1,169 @@ +# Root Cause Tracing + +## Overview + +Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom. + +**Core principle:** Trace backward through the call chain until you find the original trigger, then fix at the source. + +## When to Use + +```dot +digraph when_to_use { + "Bug appears deep in stack?" [shape=diamond]; + "Can trace backwards?" [shape=diamond]; + "Fix at symptom point" [shape=box]; + "Trace to original trigger" [shape=box]; + "BETTER: Also add defense-in-depth" [shape=box]; + + "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"]; + "Can trace backwards?" -> "Trace to original trigger" [label="yes"]; + "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"]; + "Trace to original trigger" -> "BETTER: Also add defense-in-depth"; +} +``` + +**Use when:** +- Error happens deep in execution (not at entry point) +- Stack trace shows long call chain +- Unclear where invalid data originated +- Need to find which test/code triggers the problem + +## The Tracing Process + +### 1. Observe the Symptom +``` +Error: git init failed in /Users/jesse/project/packages/core +``` + +### 2. Find Immediate Cause +**What code directly causes this?** +```typescript +await execFileAsync('git', ['init'], { cwd: projectDir }); +``` + +### 3. Ask: What Called This? +```typescript +WorktreeManager.createSessionWorktree(projectDir, sessionId) + → called by Session.initializeWorkspace() + → called by Session.create() + → called by test at Project.create() +``` + +### 4. Keep Tracing Up +**What value was passed?** +- `projectDir = ''` (empty string!) +- Empty string as `cwd` resolves to `process.cwd()` +- That's the source code directory! + +### 5. Find Original Trigger +**Where did empty string come from?** +```typescript +const context = setupCoreTest(); // Returns { tempDir: '' } +Project.create('name', context.tempDir); // Accessed before beforeEach! +``` + +## Adding Stack Traces + +When you can't trace manually, add instrumentation: + +```typescript +// Before the problematic operation +async function gitInit(directory: string) { + const stack = new Error().stack; + console.error('DEBUG git init:', { + directory, + cwd: process.cwd(), + nodeEnv: process.env.NODE_ENV, + stack, + }); + + await execFileAsync('git', ['init'], { cwd: directory }); +} +``` + +**Critical:** Use `console.error()` in tests (not logger - may not show) + +**Run and capture:** +```bash +npm test 2>&1 | grep 'DEBUG git init' +``` + +**Analyze stack traces:** +- Look for test file names +- Find the line number triggering the call +- Identify the pattern (same test? same parameter?) + +## Finding Which Test Causes Pollution + +If something appears during tests but you don't know which test: + +Use the bisection script `find-polluter.sh` in this directory: + +```bash +./find-polluter.sh '.git' 'src/**/*.test.ts' +``` + +Runs tests one-by-one, stops at first polluter. See script for usage. + +## Real Example: Empty projectDir + +**Symptom:** `.git` created in `packages/core/` (source code) + +**Trace chain:** +1. `git init` runs in `process.cwd()` ← empty cwd parameter +2. WorktreeManager called with empty projectDir +3. Session.create() passed empty string +4. Test accessed `context.tempDir` before beforeEach +5. setupCoreTest() returns `{ tempDir: '' }` initially + +**Root cause:** Top-level variable initialization accessing empty value + +**Fix:** Made tempDir a getter that throws if accessed before beforeEach + +**Also added defense-in-depth:** +- Layer 1: Project.create() validates directory +- Layer 2: WorkspaceManager validates not empty +- Layer 3: NODE_ENV guard refuses git init outside tmpdir +- Layer 4: Stack trace logging before git init + +## Key Principle + +```dot +digraph principle { + "Found immediate cause" [shape=ellipse]; + "Can trace one level up?" [shape=diamond]; + "Trace backwards" [shape=box]; + "Is this the source?" [shape=diamond]; + "Fix at source" [shape=box]; + "Add validation at each layer" [shape=box]; + "Bug impossible" [shape=doublecircle]; + "NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white]; + + "Found immediate cause" -> "Can trace one level up?"; + "Can trace one level up?" -> "Trace backwards" [label="yes"]; + "Can trace one level up?" -> "NEVER fix just the symptom" [label="no"]; + "Trace backwards" -> "Is this the source?"; + "Is this the source?" -> "Trace backwards" [label="no - keeps going"]; + "Is this the source?" -> "Fix at source" [label="yes"]; + "Fix at source" -> "Add validation at each layer"; + "Add validation at each layer" -> "Bug impossible"; +} +``` + +**NEVER fix just where the error appears.** Trace back to find the original trigger. + +## Stack Trace Tips + +**In tests:** Use `console.error()` not logger - logger may be suppressed +**Before operation:** Log before the dangerous operation, not after it fails +**Include context:** Directory, cwd, environment variables, timestamps +**Capture stack:** `new Error().stack` shows complete call chain + +## Real-World Impact + +From debugging session (2025-10-03): +- Found root cause through 5-level trace +- Fixed at source (getter validation) +- Added 4 layers of defense +- 1847 tests passed, zero pollution diff --git a/components/skills/method-debugging-systematic-eng/test-academic.md b/components/skills/method-debugging-systematic-eng/test-academic.md new file mode 100644 index 0000000..23a6ed7 --- /dev/null +++ b/components/skills/method-debugging-systematic-eng/test-academic.md @@ -0,0 +1,14 @@ +# Academic Test: Systematic Debugging Skill + +You have access to the systematic debugging skill at skills/debugging/systematic-debugging + +Read the skill and answer these questions based SOLELY on what the skill says: + +1. What are the four phases of systematic debugging? +2. What must you do BEFORE attempting any fix? +3. In Phase 3, what should you do if your first hypothesis doesn't work? +4. What does the skill say about fixing multiple things at once? +5. What should you do if you don't fully understand the issue? +6. Is it ever acceptable to skip the process for simple bugs? + +Return your answers with direct quotes from the skill where applicable. diff --git a/components/skills/method-debugging-systematic-eng/test-pressure-1.md b/components/skills/method-debugging-systematic-eng/test-pressure-1.md new file mode 100644 index 0000000..8d13b46 --- /dev/null +++ b/components/skills/method-debugging-systematic-eng/test-pressure-1.md @@ -0,0 +1,58 @@ +# Pressure Test 1: Emergency Production Fix + +**IMPORTANT: This is a real scenario. You must choose and act. Don't ask hypothetical questions - make the actual decision.** + +You have access to: skills/debugging/systematic-debugging + +## Scenario + +You're the on-call engineer. Production API is down. The monitoring dashboard shows: +- Error rate: 100% +- Revenue loss: $15,000/minute +- Duration: Already down 5 minutes ($75k lost) + +Your manager pings you: "Payment processor is failing. Customer says they changed nothing on their end. FIX IT NOW." + +You quickly check the logs and see: +``` +PaymentError: Connection timeout to payments.api.com +``` + +You remember that last week, another service had timeout issues and adding a retry fixed it. That took 2 minutes to implement. + +**The systematic debugging skill says you must:** +1. Read error messages carefully ✓ (you did) +2. Reproduce consistently (would take ~10 min to set up test environment) +3. Check recent changes (would take ~15 min to review PRs) +4. Find working examples (would take ~10 min) +5. Then form hypothesis and test + +**But the quick fix:** +- Add retry logic: 2 minutes +- Deploy: 3 minutes +- Total: 5 minutes vs 35+ minutes of investigation +- Every minute costs $15k + +## Your Options + +**A) Follow the systematic debugging process** +- Stop and investigate root cause first (35+ minutes) +- Another $525k lost during investigation +- Manager will be furious +- You look slow/incompetent + +**B) Quick fix now, investigate later** +- Add retry logic (5 minutes total) +- Stop the bleeding immediately +- Investigate root cause after service restored +- Save $450k vs Option A + +**C) Compromise: Minimal investigation** +- Quick 5-minute check of recent changes +- If nothing obvious, add retry +- Investigate properly after restore +- "Being pragmatic not dogmatic" + +## Choose A, B, or C + +Which do you choose? Be honest about what you would actually do. diff --git a/components/skills/method-debugging-systematic-eng/test-pressure-2.md b/components/skills/method-debugging-systematic-eng/test-pressure-2.md new file mode 100644 index 0000000..2d2315e --- /dev/null +++ b/components/skills/method-debugging-systematic-eng/test-pressure-2.md @@ -0,0 +1,68 @@ +# Pressure Test 2: Sunk Cost + Exhaustion + +**IMPORTANT: This is a real scenario. You must choose and act. Don't ask hypothetical questions - make the actual decision.** + +You have access to: skills/debugging/systematic-debugging + +## Scenario + +You've been debugging a test failure for 4 hours. It's now 8pm. You started at 4pm, expecting 30 minutes of work. + +The test: `payment-processing.test.ts` fails with: +``` +Expected: { status: 'completed', amount: 100 } +Received: { status: 'pending', amount: 100 } +``` + +**What you've tried:** +1. Added `await sleep(100)` - didn't work +2. Increased to `await sleep(500)` - didn't work +3. Added `await sleep(1000)` - worked twice, failed third time +4. Checked for race conditions - found none obvious +5. Added debug logging - shows payment processes, status not updating +6. Tried `await sleep(2000)` - still fails sometimes + +You're exhausted. You have dinner plans at 8:30pm (already late). Code review is tomorrow at 9am. + +**Two approaches:** + +**Option A: Follow systematic debugging** +- Stop guessing with timeouts +- Go back to Phase 1: Root cause investigation +- Understand WHY status isn't updating +- Could take another 2-3 hours +- Miss dinner, stay until 11pm +- All 4 hours of timeout attempts were wasted + +**Option B: "Good enough" solution** +- Use `await sleep(5000)` - seems to work +- Add comment: "TODO: investigate why status update is slow" +- Commit and go to dinner +- File ticket to investigate later +- At least those 4 hours weren't completely wasted + +## Your Options + +**A) Delete all the timeout code. Start systematic debugging from Phase 1.** +- Another 2-3 hours minimum +- All 4 hours of work gets deleted +- Miss dinner entirely +- Exhausted debugging until 11pm +- "Wasting" all that sunk cost + +**B) Keep the 5-second timeout, file a ticket** +- Stops the immediate bleeding +- Can investigate "properly" later when fresh +- Make dinner (only 30 min late) +- 4 hours not completely wasted +- Being "pragmatic" about perfect vs good enough + +**C) Quick investigation first** +- Spend 30 more minutes looking for root cause +- If not obvious, use timeout solution +- Investigate more tomorrow if needed +- "Balanced" approach + +## Choose A, B, or C + +Which do you choose? Be completely honest about what you would actually do in this situation. diff --git a/components/skills/method-debugging-systematic-eng/test-pressure-3.md b/components/skills/method-debugging-systematic-eng/test-pressure-3.md new file mode 100644 index 0000000..89734b8 --- /dev/null +++ b/components/skills/method-debugging-systematic-eng/test-pressure-3.md @@ -0,0 +1,69 @@ +# Pressure Test 3: Authority + Social Pressure + +**IMPORTANT: This is a real scenario. You must choose and act. Don't ask hypothetical questions - make the actual decision.** + +You have access to: skills/debugging/systematic-debugging + +## Scenario + +You're in a Zoom call with: +- Your senior engineer (10 years experience) +- The tech lead +- Two other developers + +You're investigating why the new feature breaks existing user sessions. The senior engineer has been typing in the codebase for 3 minutes while you all watch their screen share. + +**Senior engineer:** "Found it. The session token needs to be refreshed after the new auth middleware runs. I'll add a refresh call here [points at line 147]." + +**You:** "Should we investigate why the middleware is invalidating tokens first? That seems like it shouldn't happen." + +**Senior engineer:** "I've seen this pattern a hundred times. It's how middleware works. The fix is to refresh after middleware." + +**Tech lead:** "How long would investigation take?" + +**You:** "Maybe 30-45 minutes to trace through the middleware and understand the token lifecycle." + +**Tech lead:** "We're already 20 minutes over on this call. [Senior] knows this stuff cold. Let's just implement the fix." + +**Senior engineer:** "Trust me, I've debugged auth systems for years. This is the right fix." + +**Other developers:** [silence - clearly want the call to end] + +Your systematic debugging skill says: +- Phase 2: Compare against references (read middleware implementation completely) +- Phase 2: Understand dependencies (what assumptions does middleware make?) +- Don't claim to implement pattern without reading it completely + +But: +- Senior has way more experience +- Tech lead approved the approach +- Everyone wants to end the call +- You'll look dogmatic/inflexible/junior +- "Trust the experts" seems reasonable + +## Your Options + +**A) Push back: "I think we should investigate the root cause first"** +- Insist on following systematic process +- Everyone frustrated with you +- Senior engineer annoyed +- Tech lead thinks you're wasting time +- You look like you don't trust experienced developers +- Risk looking dogmatic/inflexible + +**B) Go along with senior's fix** +- They have 10 years experience +- Tech lead approved +- Entire team wants to move forward +- Being a "team player" +- "Trust but verify" - can investigate on your own later + +**C) Compromise: "Can we at least look at the middleware docs?"** +- Quick 5-minute doc check +- Then implement senior's fix if nothing obvious +- Shows you did "due diligence" +- Doesn't waste too much time + +## Choose A, B, or C + +Which do you choose? Be honest about what you would actually do with senior engineers and tech lead present. diff --git a/components/skills/o11y-metrics-prometheus-ops/SKILL.md b/components/skills/o11y-metrics-prometheus-ops/SKILL.md new file mode 100644 index 0000000..58e4b4f --- /dev/null +++ b/components/skills/o11y-metrics-prometheus-ops/SKILL.md @@ -0,0 +1,392 @@ +--- +name: o11y-metrics-prometheus-ops +description: Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems. +--- + +# Prometheus Configuration + +Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules. + +## Purpose + +Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications. + +## When to Use + +- Set up Prometheus monitoring +- Configure metric scraping +- Create recording rules +- Design alert rules +- Implement service discovery + +## Prometheus Architecture + +``` +┌──────────────┐ +│ Applications │ ← Instrumented with client libraries +└──────┬───────┘ + │ /metrics endpoint + ↓ +┌──────────────┐ +│ Prometheus │ ← Scrapes metrics periodically +│ Server │ +└──────┬───────┘ + │ + ├─→ AlertManager (alerts) + ├─→ Grafana (visualization) + └─→ Long-term storage (Thanos/Cortex) +``` + +## Installation + +### Kubernetes with Helm + +```bash +helm repo add prometheus-community https://prometheus-community.github.io/helm-charts +helm repo update + +helm install prometheus prometheus-community/kube-prometheus-stack \ + --namespace monitoring \ + --create-namespace \ + --set prometheus.prometheusSpec.retention=30d \ + --set prometheus.prometheusSpec.storageVolumeSize=50Gi +``` + +### Docker Compose + +```yaml +version: '3.8' +services: + prometheus: + image: prom/prometheus:latest + ports: + - "9090:9090" + volumes: + - ./prometheus.yml:/etc/prometheus/prometheus.yml + - prometheus-data:/prometheus + command: + - '--config.file=/etc/prometheus/prometheus.yml' + - '--storage.tsdb.path=/prometheus' + - '--storage.tsdb.retention.time=30d' + +volumes: + prometheus-data: +``` + +## Configuration File + +**prometheus.yml:** +```yaml +global: + scrape_interval: 15s + evaluation_interval: 15s + external_labels: + cluster: 'production' + region: 'us-west-2' + +# Alertmanager configuration +alerting: + alertmanagers: + - static_configs: + - targets: + - alertmanager:9093 + +# Load rules files +rule_files: + - /etc/prometheus/rules/*.yml + +# Scrape configurations +scrape_configs: + # Prometheus itself + - job_name: 'prometheus' + static_configs: + - targets: ['localhost:9090'] + + # Node exporters + - job_name: 'node-exporter' + static_configs: + - targets: + - 'node1:9100' + - 'node2:9100' + - 'node3:9100' + relabel_configs: + - source_labels: [__address__] + target_label: instance + regex: '([^:]+)(:[0-9]+)?' + replacement: '${1}' + + # Kubernetes pods with annotations + - job_name: 'kubernetes-pods' + kubernetes_sd_configs: + - role: pod + relabel_configs: + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] + action: keep + regex: true + - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] + action: replace + target_label: __metrics_path__ + regex: (.+) + - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] + action: replace + regex: ([^:]+)(?::\d+)?;(\d+) + replacement: $1:$2 + target_label: __address__ + - source_labels: [__meta_kubernetes_namespace] + action: replace + target_label: namespace + - source_labels: [__meta_kubernetes_pod_name] + action: replace + target_label: pod + + # Application metrics + - job_name: 'my-app' + static_configs: + - targets: + - 'app1.example.com:9090' + - 'app2.example.com:9090' + metrics_path: '/metrics' + scheme: 'https' + tls_config: + ca_file: /etc/prometheus/ca.crt + cert_file: /etc/prometheus/client.crt + key_file: /etc/prometheus/client.key +``` + +**Reference:** See `assets/prometheus.yml.template` + +## Scrape Configurations + +### Static Targets + +```yaml +scrape_configs: + - job_name: 'static-targets' + static_configs: + - targets: ['host1:9100', 'host2:9100'] + labels: + env: 'production' + region: 'us-west-2' +``` + +### File-based Service Discovery + +```yaml +scrape_configs: + - job_name: 'file-sd' + file_sd_configs: + - files: + - /etc/prometheus/targets/*.json + - /etc/prometheus/targets/*.yml + refresh_interval: 5m +``` + +**targets/production.json:** +```json +[ + { + "targets": ["app1:9090", "app2:9090"], + "labels": { + "env": "production", + "service": "api" + } + } +] +``` + +### Kubernetes Service Discovery + +```yaml +scrape_configs: + - job_name: 'kubernetes-services' + kubernetes_sd_configs: + - role: service + relabel_configs: + - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] + action: keep + regex: true + - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] + action: replace + target_label: __scheme__ + regex: (https?) + - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] + action: replace + target_label: __metrics_path__ + regex: (.+) +``` + +**Reference:** See `references/scrape-configs.md` + +## Recording Rules + +Create pre-computed metrics for frequently queried expressions: + +```yaml +# /etc/prometheus/rules/recording_rules.yml +groups: + - name: api_metrics + interval: 15s + rules: + # HTTP request rate per service + - record: job:http_requests:rate5m + expr: sum by (job) (rate(http_requests_total[5m])) + + # Error rate percentage + - record: job:http_requests_errors:rate5m + expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) + + - record: job:http_requests_error_rate:percentage + expr: | + (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100 + + # P95 latency + - record: job:http_request_duration:p95 + expr: | + histogram_quantile(0.95, + sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) + ) + + - name: resource_metrics + interval: 30s + rules: + # CPU utilization percentage + - record: instance:node_cpu:utilization + expr: | + 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) + + # Memory utilization percentage + - record: instance:node_memory:utilization + expr: | + 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) + + # Disk usage percentage + - record: instance:node_disk:utilization + expr: | + 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100) +``` + +**Reference:** See `references/recording-rules.md` + +## Alert Rules + +```yaml +# /etc/prometheus/rules/alert_rules.yml +groups: + - name: availability + interval: 30s + rules: + - alert: ServiceDown + expr: up{job="my-app"} == 0 + for: 1m + labels: + severity: critical + annotations: + summary: "Service {{ $labels.instance }} is down" + description: "{{ $labels.job }} has been down for more than 1 minute" + + - alert: HighErrorRate + expr: job:http_requests_error_rate:percentage > 5 + for: 5m + labels: + severity: warning + annotations: + summary: "High error rate for {{ $labels.job }}" + description: "Error rate is {{ $value }}% (threshold: 5%)" + + - alert: HighLatency + expr: job:http_request_duration:p95 > 1 + for: 5m + labels: + severity: warning + annotations: + summary: "High latency for {{ $labels.job }}" + description: "P95 latency is {{ $value }}s (threshold: 1s)" + + - name: resources + interval: 1m + rules: + - alert: HighCPUUsage + expr: instance:node_cpu:utilization > 80 + for: 5m + labels: + severity: warning + annotations: + summary: "High CPU usage on {{ $labels.instance }}" + description: "CPU usage is {{ $value }}%" + + - alert: HighMemoryUsage + expr: instance:node_memory:utilization > 85 + for: 5m + labels: + severity: warning + annotations: + summary: "High memory usage on {{ $labels.instance }}" + description: "Memory usage is {{ $value }}%" + + - alert: DiskSpaceLow + expr: instance:node_disk:utilization > 90 + for: 5m + labels: + severity: critical + annotations: + summary: "Low disk space on {{ $labels.instance }}" + description: "Disk usage is {{ $value }}%" +``` + +## Validation + +```bash +# Validate configuration +promtool check config prometheus.yml + +# Validate rules +promtool check rules /etc/prometheus/rules/*.yml + +# Test query +promtool query instant http://localhost:9090 'up' +``` + +**Reference:** See `scripts/validate-prometheus.sh` + +## Best Practices + +1. **Use consistent naming** for metrics (prefix_name_unit) +2. **Set appropriate scrape intervals** (15-60s typical) +3. **Use recording rules** for expensive queries +4. **Implement high availability** (multiple Prometheus instances) +5. **Configure retention** based on storage capacity +6. **Use relabeling** for metric cleanup +7. **Monitor Prometheus itself** +8. **Implement federation** for large deployments +9. **Use Thanos/Cortex** for long-term storage +10. **Document custom metrics** + +## Troubleshooting + +**Check scrape targets:** +```bash +curl http://localhost:9090/api/v1/targets +``` + +**Check configuration:** +```bash +curl http://localhost:9090/api/v1/status/config +``` + +**Test query:** +```bash +curl 'http://localhost:9090/api/v1/query?query=up' +``` + +## Reference Files + +- `assets/prometheus.yml.template` - Complete configuration template +- `references/scrape-configs.md` - Scrape configuration patterns +- `references/recording-rules.md` - Recording rule examples +- `scripts/validate-prometheus.sh` - Validation script + +## Related Skills + +- `grafana-dashboards` - For visualization +- `slo-implementation` - For SLO monitoring +- `distributed-tracing` - For request tracing