Adds GitHub issue templates and PR template

LoserCheems · LoserCheems · commit 3d80c84e3f60 · 2025-08-10T19:55:21.000+08:00
Establishes standardized templates for bug reports, feature requests, and performance issues to improve issue quality and streamline contributor workflow.

Includes comprehensive sections for environment information, reproduction steps, and benchmarking details specific to Flash-DMA's CUDA-accelerated attention implementation.
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,47 @@
+---
+name: Bug report
+about: Create a report to help us improve Flash-DMA
+title: '[BUG] '
+labels: 'bug'
+assignees: ''
+
+---
+
+**Describe the bug**
+A clear and concise description of what the bug is.
+
+**To Reproduce**
+Steps to reproduce the behavior:
+1. Import flash_dmattn
+2. Run the following code:
+```python
+# Paste your code here
+```
+3. See error
+
+**Expected behavior**
+A clear and concise description of what you expected to happen.
+
+**Environment Information**
+Please run the following and paste the output:
+```bash
+python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name() if torch.cuda.is_available() else \"None\"}')"
+```
+
+**Additional context**
+- OS: [e.g. Ubuntu 20.04, Windows 10, macOS 12]
+- Python version: [e.g. 3.9.7]
+- Flash-DMA version: [e.g. 0.1.0]
+- CUDA Compute Capability: [e.g. 8.6]
+
+**Error traceback**
+If applicable, add the full error traceback:
+```
+Paste the full traceback here
+```
+
+**Additional context**
+Add any other context about the problem here, including:
+- Sequence lengths and batch sizes you're using
+- Whether this works with standard PyTorch SDPA
+- Any custom modifications to the code
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,39 @@
+---
+name: Feature request
+about: Suggest an idea for Flash-DMA
+title: '[FEATURE] '
+labels: 'enhancement'
+assignees: ''
+
+---
+
+**Is your feature request related to a problem? Please describe.**
+A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
+
+**Describe the solution you'd like**
+A clear and concise description of what you want to happen.
+
+**Describe alternatives you've considered**
+A clear and concise description of any alternative solutions or features you've considered.
+
+**Implementation details**
+If you have thoughts on implementation:
+- Would this require CUDA kernel changes?
+- Does this affect the Python API?
+- Are there performance implications?
+- Any compatibility concerns with different GPU architectures?
+
+**Use case**
+Describe your specific use case:
+- What sequence lengths are you working with?
+- What is your target application (e.g., long document processing, code generation)?
+- How would this feature improve your workflow?
+
+**Additional context**
+Add any other context or screenshots about the feature request here.
+
+**Related work**
+If this feature is inspired by a paper or existing implementation, please provide:
+- Link to paper/implementation
+- Brief explanation of the technique
+- Why it would be valuable for Flash-DMA users
diff --git a/.github/ISSUE_TEMPLATE/performance_issue.md b/.github/ISSUE_TEMPLATE/performance_issue.md
@@ -0,0 +1,50 @@
+---
+name: Performance issue
+about: Report performance problems or optimization opportunities
+title: '[PERFORMANCE] '
+labels: 'performance'
+assignees: ''
+
+---
+
+**Performance Issue Description**
+Describe the performance problem you're experiencing.
+
+**Current Performance**
+Please provide benchmark results:
+- Sequence length: [e.g., 4096, 8192, 16384]
+- Batch size: [e.g., 1, 2, 4]
+- Number of heads: [e.g., 16, 32]
+- Head dimension: [e.g., 64, 128]
+- Current speed: [e.g., 15.2 ms/iteration]
+- Memory usage: [e.g., 8.5 GB]
+
+**Expected Performance**
+What performance would you expect, and why?
+- Expected speed: [e.g., <10 ms/iteration]
+- Comparison baseline: [e.g., PyTorch SDPA, Flash Attention]
+
+**Environment Information**
+```bash
+python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name() if torch.cuda.is_available() else \"None\"}')"
+```
+
+**Benchmark Code**
+Provide the code you used for benchmarking:
+```python
+# Paste your benchmark code here
+```
+
+**Profiling Information**
+If you have profiling data (from nsys, nvprof, or PyTorch profiler), please include relevant excerpts.
+
+**System Information**
+- GPU model and memory: [e.g., RTX 4090 24GB]
+- CUDA Compute Capability: [e.g., 8.9]
+- CPU: [e.g., Intel i9-12900K]
+- RAM: [e.g., 32GB DDR4]
+
+**Additional Context**
+- Is this a regression from a previous version?
+- Have you tried different batch sizes or sequence lengths?
+- Any specific attention patterns (causal, full, custom masks)?
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
@@ -0,0 +1,93 @@
+# Pull Request Template
+
+## Description
+Please provide a clear and concise description of your changes.
+
+## Type of Change
+Please check the relevant option(s):
+
+- [ ] Bug fix (non-breaking change which fixes an issue)
+- [ ] New feature (non-breaking change which adds functionality)
+- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
+- [ ] Documentation update
+- [ ] Performance optimization
+- [ ] CUDA kernel improvement
+- [ ] Code refactoring
+
+## Related Issues
+Please link any related issues:
+- Fixes #(issue number)
+- Related to #(issue number)
+
+## Changes Made
+Please describe the changes you made:
+
+### Code Changes
+- [ ] Modified Python API
+- [ ] Updated CUDA kernels
+- [ ] Changed build system
+- [ ] Updated dependencies
+
+### Documentation
+- [ ] Updated README
+- [ ] Updated API documentation
+- [ ] Added examples
+- [ ] Updated benchmarks
+
+## Testing
+Please describe the tests you ran to verify your changes:
+
+- [ ] Existing tests pass: `python -m pytest tests/ -v`
+- [ ] Added new tests for new functionality
+- [ ] Benchmarks show no performance regression
+- [ ] Tested on multiple GPU architectures (if applicable)
+
+### Test Configuration
+- OS: [e.g., Ubuntu 20.04]
+- Python: [e.g., 3.9.7]
+- PyTorch: [e.g., 2.1.0]
+- CUDA: [e.g., 11.8]
+- GPU: [e.g., RTX 4090]
+
+## Performance Impact
+If this change affects performance, please provide benchmarks:
+
+### Before
+```
+# Benchmark results before your changes
+```
+
+### After
+```
+# Benchmark results after your changes
+```
+
+## Breaking Changes
+If this PR introduces breaking changes, please describe:
+- What breaks
+- How users can migrate their code
+- Why the breaking change is necessary
+
+## Checklist
+Please check all that apply:
+
+- [ ] My code follows the project's style guidelines
+- [ ] I have performed a self-review of my own code
+- [ ] I have commented my code, particularly in hard-to-understand areas
+- [ ] I have made corresponding changes to the documentation
+- [ ] My changes generate no new warnings
+- [ ] I have added tests that prove my fix is effective or that my feature works
+- [ ] New and existing unit tests pass locally with my changes
+- [ ] Any dependent changes have been merged and published
+
+### CUDA-specific (if applicable)
+- [ ] CUDA kernels compile without warnings
+- [ ] Tested on SM 8.0+ architectures
+- [ ] Memory usage has been profiled
+- [ ] No memory leaks detected
+
+## Additional Notes
+Any additional information that reviewers should know:
+
+## Screenshots (if applicable)
+If your changes include visual elements or performance improvements, please add screenshots or graphs.