-
Notifications
You must be signed in to change notification settings - Fork 125
Update GPU condition checks in frontier.mako #1076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CodeAnt AI is reviewing your PR. Thanks for using CodeAnt! 🎉We're free for open-source projects. if you're enjoying it, help us grow by sharing. Share on X · |
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThese changes improve GPU handling in Frontier cluster setup scripts by replacing hardcoded GPU mode selection with dynamic conditionals and explicit "no-gpu" flags. GPU-related logic across submit scripts, test workflows, and job templates now consistently checks Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| #SBATCH --cpus-per-task=7 | ||
| #SBATCH -C nvme | ||
| % if gpu: | ||
| % if gpu != 'no': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Refine the GPU check to if gpu and gpu != 'no': to prevent falsy values from incorrectly enabling GPU settings. [possible issue, importance: 7]
| % if gpu != 'no': | |
| % if gpu and gpu != 'no': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No issues found across 3 files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a critical bug in the Frontier template where CPU mode was broken due to incorrect GPU condition checks. The gpu variable can have values 'no', 'acc', or 'mp', but since 'no' is a non-empty string, the previous if gpu: conditionals incorrectly evaluated to true even when GPU support was disabled. The fix changes all GPU checks to explicitly compare against 'no'.
Key changes:
- Updated frontier.mako template to use
gpu != 'no'instead ofif gpu:for all GPU-related conditionals - Added
--no-gpuflag to CPU test invocations in test.sh - Modified submit.sh to dynamically select GPU/CPU mode based on device parameter
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| toolchain/templates/frontier.mako | Fixed GPU conditionals to properly check for 'no' value instead of relying on truthiness |
| .github/workflows/frontier/test.sh | Added --no-gpu flag for CPU test runs to ensure proper configuration |
| .github/workflows/frontier/submit.sh | Changed from hardcoded GPU mode to dynamic GPU/CPU selection based on device argument |
| job_interface="$3" | ||
| . ./mfc.sh load -c f -m g | ||
| . ./mfc.sh load -c f -m $([ "$2" = "gpu" ] && echo "g" || echo "c") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: The command substitution in the added line is unescaped inside an unquoted here-document, so $([ "$2" = "gpu" ] ...) is evaluated on the submission host when building the heredoc instead of being evaluated on the compute node at job runtime; escape the substitution (or quote the heredoc) so evaluation happens in the intended context. [logic error]
Severity Level: Minor
| . ./mfc.sh load -c f -m $([ "$2" = "gpu" ] && echo "g" || echo "c") | |
| . ./mfc.sh load -c f -m \$([ "$2" = "gpu" ] && echo "g" || echo "c") |
Why it matters? ⭐
Correct. The heredoc is unquoted, so the outer shell will perform command and variable expansion before sending the script to sbatch.
Escaping the $ (or quoting the heredoc delimiter) is the right fix if you want that substitution to occur on the compute node at job runtime.
The proposed improved_code (escape the $) achieves that with minimal risk.
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** .github/workflows/frontier/submit.sh
**Line:** 51:51
**Comment:**
*Logic Error: The command substitution in the added line is unescaped inside an unquoted here-document, so $([ "$2" = "gpu" ] ...) is evaluated on the submission host when building the heredoc instead of being evaluated on the compute node at job runtime; escape the substitution (or quote the heredoc) so evaluation happens in the intended context.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.| ./mfc.sh test -a --rdma-mpi --max-attempts 3 -j $ngpus $device_opts -- -c frontier | ||
| else | ||
| ./mfc.sh test -a --max-attempts 3 -j 32 -- -c frontier | ||
| ./mfc.sh test -a --max-attempts 3 -j 32 --no-gpu -- -c frontier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Hard-coded parallelism: using a fixed -j 32 ignores the actual number of available CPU cores and can lead to oversubscription or underutilization; replace the literal with a dynamic value such as $(nproc) to adapt to the runner's CPU count. [performance]
Severity Level: Minor
| ./mfc.sh test -a --max-attempts 3 -j 32 --no-gpu -- -c frontier | |
| ./mfc.sh test -a --max-attempts 3 -j "$(nproc)" --no-gpu -- -c frontier |
Why it matters? ⭐
Hard-coding -j 32 is brittle across runner types. Using $(nproc) adapts to available CPU cores and is a safe, practical improvement for parallelism observable in this hunk.
Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** .github/workflows/frontier/test.sh
**Line:** 19:19
**Comment:**
*Performance: Hard-coded parallelism: using a fixed `-j 32` ignores the actual number of available CPU cores and can lead to oversubscription or underutilization; replace the literal with a dynamic value such as `$(nproc)` to adapt to the runner's CPU count.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
Nitpicks 🔍
|
|
CodeAnt AI finished reviewing your PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
.github/workflows/frontier/submit.sh (1)
51-51: Dynamic GPU/CPU mode selection looks correct; only minor readability nits.Using
$2to choose-m gvs-m cmatches the intended CPU/GPU split and stays consistent with the Frontier template. As a tiny readability improvement, you could reusejob_devicehere and/or quote the substitution, but functionally this is solid.toolchain/templates/frontier.mako (1)
13-16: Consistentgpu != 'no'checks correctly gate all Frontier GPU behavior.Gating SBATCH GPU directives,
mfc.sh load -m g|c,MPICH_GPU_SUPPORT_ENABLED, and the interactivesrunGPU flags ongpu != 'no'fixes the earlier truthiness issue forgpu='no'and keeps CPU jobs on a pure-CPU path. If this grows more complex later, you might consider normalizinggputo a boolean or small enum upstream instead of relying on the'no'sentinel, but as-is the changes are coherent and internally consistent.Also applies to: 37-37, 42-46, 69-71
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/frontier/submit.sh(1 hunks).github/workflows/frontier/test.sh(1 hunks)toolchain/templates/frontier.mako(3 hunks)
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Ensure GPU code compiles with Cray ftn, NVIDIA nvfortran, GNU gfortran, and Intel ifx/ifort compilers
📚 Learning: 2025-11-24T21:50:16.713Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Ensure GPU code compiles with Cray ftn, NVIDIA nvfortran, GNU gfortran, and Intel ifx/ifort compilers
Applied to files:
.github/workflows/frontier/test.shtoolchain/templates/frontier.mako.github/workflows/frontier/submit.sh
📚 Learning: 2025-11-24T21:50:46.909Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .cursor/rules/mfc-agent-rules.mdc:0-0
Timestamp: 2025-11-24T21:50:46.909Z
Learning: After each successful build step, run focused tests using `./mfc.sh test -j $(nproc) -f EA8FA07E -t 9E2CA336` instead of running all tests
Applied to files:
.github/workflows/frontier/test.sh
📚 Learning: 2025-11-24T21:50:16.713Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Add and update focused tests for changes; ensure no regressions in golden test outputs without clear justification
Applied to files:
.github/workflows/frontier/test.sh
📚 Learning: 2025-11-24T21:50:46.909Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .cursor/rules/mfc-agent-rules.mdc:0-0
Timestamp: 2025-11-24T21:50:46.909Z
Learning: Compile with Cray `ftn` or NVIDIA `nvfortran` for GPU offloading; also build CPU-only with GNU `gfortran` and Intel `ifx`/`ifort` for portability
Applied to files:
.github/workflows/frontier/test.sh
📚 Learning: 2025-11-24T21:50:16.713Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Avoid stop/error stop inside GPU device code
Applied to files:
.github/workflows/frontier/test.shtoolchain/templates/frontier.mako
📚 Learning: 2025-11-24T21:50:46.909Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .cursor/rules/mfc-agent-rules.mdc:0-0
Timestamp: 2025-11-24T21:50:46.909Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Mark GPU-callable helpers with `$:GPU_ROUTINE(function_name='...', parallelism='[seq]')` immediately after declaration
Applied to files:
toolchain/templates/frontier.mako
📚 Learning: 2025-11-24T21:50:46.909Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .cursor/rules/mfc-agent-rules.mdc:0-0
Timestamp: 2025-11-24T21:50:46.909Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Wrap tight loops with `$:GPU_PARALLEL_FOR(private='[...]', copy='[...]')` macro; add `collapse=n` for safe nested loop merging
Applied to files:
toolchain/templates/frontier.mako
📚 Learning: 2025-11-24T21:50:16.713Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Wrap tight GPU loops with !$acc parallel loop gang vector default(present) reduction(...); add collapse(n) when safe; declare loop-local variables with private(...)
Applied to files:
toolchain/templates/frontier.mako
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
- GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu)
- GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu)
- GitHub Check: Oak Ridge | Frontier (CCE) (gpu)
- GitHub Check: Self Hosted (gpu, acc, frontier)
- GitHub Check: Github (ubuntu, mpi, no-debug, true)
- GitHub Check: Github (macos, mpi, debug, false)
- GitHub Check: Self Hosted (cpu, none, gt)
- GitHub Check: Github (macos, mpi, no-debug, false)
- GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
- GitHub Check: Github (ubuntu, mpi, no-debug, false)
- GitHub Check: Github (ubuntu, mpi, debug, true)
- GitHub Check: Self Hosted (gpu, omp, gt)
- GitHub Check: Github (ubuntu, mpi, debug, false)
- GitHub Check: Self Hosted (cpu, none, frontier)
- GitHub Check: Self Hosted (gpu, omp, frontier)
- GitHub Check: Self Hosted (gpu, acc, gt)
- GitHub Check: Coverage Test on CodeCov
- GitHub Check: Code Cleanliness Check
- GitHub Check: cubic · AI code reviewer
- GitHub Check: Agent
- GitHub Check: Build & Publish
🔇 Additional comments (1)
.github/workflows/frontier/test.sh (1)
19-19: Explicit--no-gpuflag in CPU branch aligns with the new GPU gating.Passing
--no-gpuhere clearly keeps the non-GPU tests on a CPU-only path and matches the Frontier template’s semantics; change looks good.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1076 +/- ##
=======================================
Coverage 44.16% 44.16%
=======================================
Files 71 71
Lines 20197 20197
Branches 1970 1970
=======================================
Hits 8921 8921
Misses 10148 10148
Partials 1128 1128 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
User description
User description
Frontier CPU was broken because the templates are broken. Partial PR was here that i think recognized part of this #1055 but no activity there so i fixed it maybe
PR Type
Bug fix, Enhancement
Description
Fix GPU condition checks in frontier template from boolean to string comparison
Update device type handling in submit.sh to conditionally load GPU or CPU modules
Add --no-gpu flag for non-GPU device tests in test.sh
Diagram Walkthrough
File Walkthrough
frontier.mako
Fix GPU condition checks to use string comparisontoolchain/templates/frontier.mako
if gpu:toif gpu != 'no':forproper string comparison
'g' if gpu != 'no' else 'c'instead of boolean check
environment variables, and srun commands
fail
submit.sh
Conditionally load GPU or CPU modules.github/workflows/frontier/submit.sh
(c) modules based on device parameter
[ "$2" = "gpu" ] && echo "g" || echo "c"todetermine correct module
test.sh
Add no-gpu flag for CPU tests.github/workflows/frontier/test.sh
--no-gpuflag to CPU device test commandCodeAnt-AI Description
Prevent CPU runs from requesting GPUs; load correct module and add no-GPU test flag
What Changed
Impact
✅ No accidental GPU allocations for CPU jobs✅ Correct module loaded for GPU vs CPU runs✅ CPU tests run without reserving GPUs💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.