Skip to content

Commit 3d80c84

Browse files
committed
Adds GitHub issue templates and PR template
Establishes standardized templates for bug reports, feature requests, and performance issues to improve issue quality and streamline contributor workflow. Includes comprehensive sections for environment information, reproduction steps, and benchmarking details specific to Flash-DMA's CUDA-accelerated attention implementation.
1 parent 4edb7a8 commit 3d80c84

File tree

4 files changed

+229
-0
lines changed

4 files changed

+229
-0
lines changed
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
---
2+
name: Bug report
3+
about: Create a report to help us improve Flash-DMA
4+
title: '[BUG] '
5+
labels: 'bug'
6+
assignees: ''
7+
8+
---
9+
10+
**Describe the bug**
11+
A clear and concise description of what the bug is.
12+
13+
**To Reproduce**
14+
Steps to reproduce the behavior:
15+
1. Import flash_dmattn
16+
2. Run the following code:
17+
```python
18+
# Paste your code here
19+
```
20+
3. See error
21+
22+
**Expected behavior**
23+
A clear and concise description of what you expected to happen.
24+
25+
**Environment Information**
26+
Please run the following and paste the output:
27+
```bash
28+
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name() if torch.cuda.is_available() else \"None\"}')"
29+
```
30+
31+
**Additional context**
32+
- OS: [e.g. Ubuntu 20.04, Windows 10, macOS 12]
33+
- Python version: [e.g. 3.9.7]
34+
- Flash-DMA version: [e.g. 0.1.0]
35+
- CUDA Compute Capability: [e.g. 8.6]
36+
37+
**Error traceback**
38+
If applicable, add the full error traceback:
39+
```
40+
Paste the full traceback here
41+
```
42+
43+
**Additional context**
44+
Add any other context about the problem here, including:
45+
- Sequence lengths and batch sizes you're using
46+
- Whether this works with standard PyTorch SDPA
47+
- Any custom modifications to the code
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
---
2+
name: Feature request
3+
about: Suggest an idea for Flash-DMA
4+
title: '[FEATURE] '
5+
labels: 'enhancement'
6+
assignees: ''
7+
8+
---
9+
10+
**Is your feature request related to a problem? Please describe.**
11+
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12+
13+
**Describe the solution you'd like**
14+
A clear and concise description of what you want to happen.
15+
16+
**Describe alternatives you've considered**
17+
A clear and concise description of any alternative solutions or features you've considered.
18+
19+
**Implementation details**
20+
If you have thoughts on implementation:
21+
- Would this require CUDA kernel changes?
22+
- Does this affect the Python API?
23+
- Are there performance implications?
24+
- Any compatibility concerns with different GPU architectures?
25+
26+
**Use case**
27+
Describe your specific use case:
28+
- What sequence lengths are you working with?
29+
- What is your target application (e.g., long document processing, code generation)?
30+
- How would this feature improve your workflow?
31+
32+
**Additional context**
33+
Add any other context or screenshots about the feature request here.
34+
35+
**Related work**
36+
If this feature is inspired by a paper or existing implementation, please provide:
37+
- Link to paper/implementation
38+
- Brief explanation of the technique
39+
- Why it would be valuable for Flash-DMA users
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
---
2+
name: Performance issue
3+
about: Report performance problems or optimization opportunities
4+
title: '[PERFORMANCE] '
5+
labels: 'performance'
6+
assignees: ''
7+
8+
---
9+
10+
**Performance Issue Description**
11+
Describe the performance problem you're experiencing.
12+
13+
**Current Performance**
14+
Please provide benchmark results:
15+
- Sequence length: [e.g., 4096, 8192, 16384]
16+
- Batch size: [e.g., 1, 2, 4]
17+
- Number of heads: [e.g., 16, 32]
18+
- Head dimension: [e.g., 64, 128]
19+
- Current speed: [e.g., 15.2 ms/iteration]
20+
- Memory usage: [e.g., 8.5 GB]
21+
22+
**Expected Performance**
23+
What performance would you expect, and why?
24+
- Expected speed: [e.g., <10 ms/iteration]
25+
- Comparison baseline: [e.g., PyTorch SDPA, Flash Attention]
26+
27+
**Environment Information**
28+
```bash
29+
python -c "import torch; print(f'PyTorch: {torch.__version__}'); print(f'CUDA: {torch.version.cuda}'); print(f'GPU: {torch.cuda.get_device_name() if torch.cuda.is_available() else \"None\"}')"
30+
```
31+
32+
**Benchmark Code**
33+
Provide the code you used for benchmarking:
34+
```python
35+
# Paste your benchmark code here
36+
```
37+
38+
**Profiling Information**
39+
If you have profiling data (from nsys, nvprof, or PyTorch profiler), please include relevant excerpts.
40+
41+
**System Information**
42+
- GPU model and memory: [e.g., RTX 4090 24GB]
43+
- CUDA Compute Capability: [e.g., 8.9]
44+
- CPU: [e.g., Intel i9-12900K]
45+
- RAM: [e.g., 32GB DDR4]
46+
47+
**Additional Context**
48+
- Is this a regression from a previous version?
49+
- Have you tried different batch sizes or sequence lengths?
50+
- Any specific attention patterns (causal, full, custom masks)?

.github/pull_request_template.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
# Pull Request Template
2+
3+
## Description
4+
Please provide a clear and concise description of your changes.
5+
6+
## Type of Change
7+
Please check the relevant option(s):
8+
9+
- [ ] Bug fix (non-breaking change which fixes an issue)
10+
- [ ] New feature (non-breaking change which adds functionality)
11+
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
12+
- [ ] Documentation update
13+
- [ ] Performance optimization
14+
- [ ] CUDA kernel improvement
15+
- [ ] Code refactoring
16+
17+
## Related Issues
18+
Please link any related issues:
19+
- Fixes #(issue number)
20+
- Related to #(issue number)
21+
22+
## Changes Made
23+
Please describe the changes you made:
24+
25+
### Code Changes
26+
- [ ] Modified Python API
27+
- [ ] Updated CUDA kernels
28+
- [ ] Changed build system
29+
- [ ] Updated dependencies
30+
31+
### Documentation
32+
- [ ] Updated README
33+
- [ ] Updated API documentation
34+
- [ ] Added examples
35+
- [ ] Updated benchmarks
36+
37+
## Testing
38+
Please describe the tests you ran to verify your changes:
39+
40+
- [ ] Existing tests pass: `python -m pytest tests/ -v`
41+
- [ ] Added new tests for new functionality
42+
- [ ] Benchmarks show no performance regression
43+
- [ ] Tested on multiple GPU architectures (if applicable)
44+
45+
### Test Configuration
46+
- OS: [e.g., Ubuntu 20.04]
47+
- Python: [e.g., 3.9.7]
48+
- PyTorch: [e.g., 2.1.0]
49+
- CUDA: [e.g., 11.8]
50+
- GPU: [e.g., RTX 4090]
51+
52+
## Performance Impact
53+
If this change affects performance, please provide benchmarks:
54+
55+
### Before
56+
```
57+
# Benchmark results before your changes
58+
```
59+
60+
### After
61+
```
62+
# Benchmark results after your changes
63+
```
64+
65+
## Breaking Changes
66+
If this PR introduces breaking changes, please describe:
67+
- What breaks
68+
- How users can migrate their code
69+
- Why the breaking change is necessary
70+
71+
## Checklist
72+
Please check all that apply:
73+
74+
- [ ] My code follows the project's style guidelines
75+
- [ ] I have performed a self-review of my own code
76+
- [ ] I have commented my code, particularly in hard-to-understand areas
77+
- [ ] I have made corresponding changes to the documentation
78+
- [ ] My changes generate no new warnings
79+
- [ ] I have added tests that prove my fix is effective or that my feature works
80+
- [ ] New and existing unit tests pass locally with my changes
81+
- [ ] Any dependent changes have been merged and published
82+
83+
### CUDA-specific (if applicable)
84+
- [ ] CUDA kernels compile without warnings
85+
- [ ] Tested on SM 8.0+ architectures
86+
- [ ] Memory usage has been profiled
87+
- [ ] No memory leaks detected
88+
89+
## Additional Notes
90+
Any additional information that reviewers should know:
91+
92+
## Screenshots (if applicable)
93+
If your changes include visual elements or performance improvements, please add screenshots or graphs.

0 commit comments

Comments
 (0)