Skip to content

Conversation

@sbryngelson
Copy link
Member

@sbryngelson sbryngelson commented Dec 4, 2025

User description

User description

Frontier CPU was broken because the templates are broken. Partial PR was here that i think recognized part of this #1055 but no activity there so i fixed it maybe


PR Type

Bug fix, Enhancement


Description

  • Fix GPU condition checks in frontier template from boolean to string comparison

  • Update device type handling in submit.sh to conditionally load GPU or CPU modules

  • Add --no-gpu flag for non-GPU device tests in test.sh


Diagram Walkthrough

flowchart LR
  A["frontier.mako template"] -->|"Change gpu boolean to gpu != 'no' string check"| B["Fixed GPU conditions"]
  C["submit.sh script"] -->|"Conditionally load g or c module based on device"| D["Dynamic module loading"]
  E["test.sh script"] -->|"Add --no-gpu flag for CPU tests"| F["CPU test support"]
  B --> G["Proper GPU/CPU handling"]
  D --> G
  F --> G
Loading

File Walkthrough

Relevant files
Bug fix
frontier.mako
Fix GPU condition checks to use string comparison               

toolchain/templates/frontier.mako

  • Changed GPU condition checks from if gpu: to if gpu != 'no': for
    proper string comparison
  • Updated module loading logic to use 'g' if gpu != 'no' else 'c'
    instead of boolean check
  • Applied consistent GPU condition checks across SBATCH directives,
    environment variables, and srun commands
  • Fixes broken template logic that was causing Frontier CPU tests to
    fail
+4/-4     
Enhancement
submit.sh
Conditionally load GPU or CPU modules                                       

.github/workflows/frontier/submit.sh

  • Modified module loading command to conditionally load GPU (g) or CPU
    (c) modules based on device parameter
  • Uses bash conditional [ "$2" = "gpu" ] && echo "g" || echo "c" to
    determine correct module
  • Enables proper CPU module loading when device is not GPU
+1/-1     
test.sh
Add no-gpu flag for CPU tests                                                       

.github/workflows/frontier/test.sh

  • Added --no-gpu flag to CPU device test command
  • Ensures non-GPU tests run without GPU-specific configurations
  • Maintains GPU-specific flags only for GPU device tests
+1/-1     


CodeAnt-AI Description

Prevent CPU runs from requesting GPUs; load correct module and add no-GPU test flag

What Changed

  • Job scripts and templates no longer request GPUs or enable GPU support when a run is marked CPU-only (gpu = 'no'); GPU SBATCH and srun options are only added for GPU runs
  • The submit step now loads the GPU module for GPU jobs and the CPU module for non-GPU jobs instead of always loading the GPU module
  • The test runner adds a --no-gpu flag for CPU test runs so tests do not reserve or try to use GPU resources

Impact

✅ No accidental GPU allocations for CPU jobs
✅ Correct module loaded for GPU vs CPU runs
✅ CPU tests run without reserving GPUs

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

Summary by CodeRabbit

Release Notes

  • Chores
    • Enhanced GPU configuration handling in Frontier workflows to explicitly distinguish between GPU and non-GPU execution modes during build and test operations.

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings December 4, 2025 16:50
@codeant-ai
Copy link

codeant-ai bot commented Dec 4, 2025

CodeAnt AI is reviewing your PR.


Thanks for using CodeAnt! 🎉

We're free for open-source projects. if you're enjoying it, help us grow by sharing.

Share on X ·
Reddit ·
LinkedIn

@qodo-code-review
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Issue

The conditional module selector uses $2 directly, while a variable job_device is set to $2 just above. Mixing sources may cause mismatch if call sites change; consider using job_device consistently to avoid subtle bugs.

. ./mfc.sh load -c f -m $([ "$2" = "gpu" ] && echo "g" || echo "c")
Logic Consistency

The template now treats any gpu value other than 'no' as GPU-enabled. Verify all callers pass 'gpu' exactly or 'no' and never empty/None; otherwise CPU jobs might incorrectly request GPUs. Consider normalizing gpu or using explicit set {'gpu','no'}.

% if gpu != 'no':
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=closest
% endif
Flag Propagation

New --no-gpu flag is added only for CPU path; ensure mfc.sh recognizes this flag and it doesn’t conflict with other options (e.g., rdma/mpi defaults). Validate that CPU tests still select correct runner and resources.

    ./mfc.sh test -a --max-attempts 3 -j 32 --no-gpu -- -c frontier
fi

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 4, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

These changes improve GPU handling in Frontier cluster setup scripts by replacing hardcoded GPU mode selection with dynamic conditionals and explicit "no-gpu" flags. GPU-related logic across submit scripts, test workflows, and job templates now consistently checks gpu != 'no' instead of relying on truthiness, ensuring uniform behavior.

Changes

Cohort / File(s) Change Summary
GPU Mode Selection Logic
.github/workflows/frontier/submit.sh
Replaced hardcoded "g" mode with dynamic conditional that selects "g" when second argument is "gpu", otherwise "c" in mfc.sh invocation
Test Workflow GPU Flag
.github/workflows/frontier/test.sh
Added --no-gpu flag to non-GPU execution path of mfc.sh test command
Template GPU Conditionals
toolchain/templates/frontier.mako
Changed four GPU-related conditional blocks from truthiness checks to explicit gpu != 'no' comparisons affecting SBATCH directives, mfc.sh load command, MPICH_GPU_SUPPORT_ENABLED flag, and interactive/srun gpu binding

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify the conditional logic in submit.sh correctly maps second argument values to "g"/"c" modes
  • Confirm --no-gpu flag behavior in test workflow aligns with intended non-GPU execution path
  • Check that explicit gpu != 'no' checks across template are semantically equivalent to previous truthiness logic and consistently applied

Poem

🐰 GPU flags now dance with grace,
No more hardcoded in their place,
Dynamic modes and checks so clear,
Frontier's logic draws more near,
When "no" means no—oh what a feat!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: updating GPU condition checks in frontier.mako from boolean checks to string comparisons.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description includes a clear summary of changes, type of change (bug fix/enhancement), detailed file walkthrough with explanations, and a helpful mermaid diagram.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codeant-ai codeant-ai bot added the size:XS This PR changes 0-9 lines, ignoring generated files label Dec 4, 2025
#SBATCH --cpus-per-task=7
#SBATCH -C nvme
% if gpu:
% if gpu != 'no':
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Refine the GPU check to if gpu and gpu != 'no': to prevent falsy values from incorrectly enabling GPU settings. [possible issue, importance: 7]

Suggested change
% if gpu != 'no':
% if gpu and gpu != 'no':

Copilot finished reviewing on behalf of sbryngelson December 4, 2025 16:52
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 3 files

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical bug in the Frontier template where CPU mode was broken due to incorrect GPU condition checks. The gpu variable can have values 'no', 'acc', or 'mp', but since 'no' is a non-empty string, the previous if gpu: conditionals incorrectly evaluated to true even when GPU support was disabled. The fix changes all GPU checks to explicitly compare against 'no'.

Key changes:

  • Updated frontier.mako template to use gpu != 'no' instead of if gpu: for all GPU-related conditionals
  • Added --no-gpu flag to CPU test invocations in test.sh
  • Modified submit.sh to dynamically select GPU/CPU mode based on device parameter

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
toolchain/templates/frontier.mako Fixed GPU conditionals to properly check for 'no' value instead of relying on truthiness
.github/workflows/frontier/test.sh Added --no-gpu flag for CPU test runs to ensure proper configuration
.github/workflows/frontier/submit.sh Changed from hardcoded GPU mode to dynamic GPU/CPU selection based on device argument

job_interface="$3"
. ./mfc.sh load -c f -m g
. ./mfc.sh load -c f -m $([ "$2" = "gpu" ] && echo "g" || echo "c")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The command substitution in the added line is unescaped inside an unquoted here-document, so $([ "$2" = "gpu" ] ...) is evaluated on the submission host when building the heredoc instead of being evaluated on the compute node at job runtime; escape the substitution (or quote the heredoc) so evaluation happens in the intended context. [logic error]

Severity Level: Minor ⚠️

Suggested change
. ./mfc.sh load -c f -m $([ "$2" = "gpu" ] && echo "g" || echo "c")
. ./mfc.sh load -c f -m \$([ "$2" = "gpu" ] && echo "g" || echo "c")
Why it matters? ⭐

Correct. The heredoc is unquoted, so the outer shell will perform command and variable expansion before sending the script to sbatch.
Escaping the $ (or quoting the heredoc delimiter) is the right fix if you want that substitution to occur on the compute node at job runtime.
The proposed improved_code (escape the $) achieves that with minimal risk.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** .github/workflows/frontier/submit.sh
**Line:** 51:51
**Comment:**
	*Logic Error: The command substitution in the added line is unescaped inside an unquoted here-document, so $([ "$2" = "gpu" ] ...) is evaluated on the submission host when building the heredoc instead of being evaluated on the compute node at job runtime; escape the substitution (or quote the heredoc) so evaluation happens in the intended context.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

./mfc.sh test -a --rdma-mpi --max-attempts 3 -j $ngpus $device_opts -- -c frontier
else
./mfc.sh test -a --max-attempts 3 -j 32 -- -c frontier
./mfc.sh test -a --max-attempts 3 -j 32 --no-gpu -- -c frontier
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Hard-coded parallelism: using a fixed -j 32 ignores the actual number of available CPU cores and can lead to oversubscription or underutilization; replace the literal with a dynamic value such as $(nproc) to adapt to the runner's CPU count. [performance]

Severity Level: Minor ⚠️

Suggested change
./mfc.sh test -a --max-attempts 3 -j 32 --no-gpu -- -c frontier
./mfc.sh test -a --max-attempts 3 -j "$(nproc)" --no-gpu -- -c frontier
Why it matters? ⭐

Hard-coding -j 32 is brittle across runner types. Using $(nproc) adapts to available CPU cores and is a safe, practical improvement for parallelism observable in this hunk.

Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** .github/workflows/frontier/test.sh
**Line:** 19:19
**Comment:**
	*Performance: Hard-coded parallelism: using a fixed `-j 32` ignores the actual number of available CPU cores and can lead to oversubscription or underutilization; replace the literal with a dynamic value such as `$(nproc)` to adapt to the runner's CPU count.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

@codeant-ai
Copy link

codeant-ai bot commented Dec 4, 2025

Nitpicks 🔍

🔒 No security issues identified
⚡ Recommended areas for review

  • Behavior Change
    The template changed checks from truthy if gpu: to string comparison if gpu != 'no'. If gpu is undefined, None, or an unexpected value this now evaluates to True (enabling GPU behavior) whereas previously it would be False. This can silently enable GPUs by default; validate caller behavior and any defaults for gpu.

  • Here-doc expansion timing
    The command substitution in the added line is evaluated when the submitting script builds the here-doc (on the submit host), not on the compute node. Confirm this is the intended behaviour and that the computed value is safe and predictable at submit-time. If the intention was to evaluate on the compute node, the current approach is incorrect.

  • Repeated Logic
    The same gpu != 'no' check is repeated in multiple places. Repeating the expression increases risk of inconsistent updates and potential bugs. Consider computing a single normalized gpu_enabled value once and reuse it.

  • Hardcoded parallelism
    The CPU branch uses a hardcoded '-j 32' which may be suboptimal across different runners. Consider deriving the job parallelism from the number of available CPU cores (nproc) or making it configurable so tests don't overload small machines or underutilize large ones.

  • Flag compatibility
    The new CPU command passes '--no-gpu' to ./mfc.sh. Verify that the called script actually supports that flag and that its semantics match the intended behavior. If mfc.sh changes, this may silently do nothing or error.

  • Inconsistent flags between branches
    The GPU branch includes '--rdma-mpi' while the CPU branch does not. Confirm whether the absence is intentional. Inconsistencies in passed options may lead to different test paths and coverage.

@codeant-ai
Copy link

codeant-ai bot commented Dec 4, 2025

CodeAnt AI finished reviewing your PR.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
.github/workflows/frontier/submit.sh (1)

51-51: Dynamic GPU/CPU mode selection looks correct; only minor readability nits.

Using $2 to choose -m g vs -m c matches the intended CPU/GPU split and stays consistent with the Frontier template. As a tiny readability improvement, you could reuse job_device here and/or quote the substitution, but functionally this is solid.

toolchain/templates/frontier.mako (1)

13-16: Consistent gpu != 'no' checks correctly gate all Frontier GPU behavior.

Gating SBATCH GPU directives, mfc.sh load -m g|c, MPICH_GPU_SUPPORT_ENABLED, and the interactive srun GPU flags on gpu != 'no' fixes the earlier truthiness issue for gpu='no' and keeps CPU jobs on a pure-CPU path. If this grows more complex later, you might consider normalizing gpu to a boolean or small enum upstream instead of relying on the 'no' sentinel, but as-is the changes are coherent and internally consistent.

Also applies to: 37-37, 42-46, 69-71

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ab9a1fd and 3d4ff7d.

📒 Files selected for processing (3)
  • .github/workflows/frontier/submit.sh (1 hunks)
  • .github/workflows/frontier/test.sh (1 hunks)
  • toolchain/templates/frontier.mako (3 hunks)
🧰 Additional context used
🧠 Learnings (9)
📓 Common learnings
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Ensure GPU code compiles with Cray ftn, NVIDIA nvfortran, GNU gfortran, and Intel ifx/ifort compilers
📚 Learning: 2025-11-24T21:50:16.713Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Ensure GPU code compiles with Cray ftn, NVIDIA nvfortran, GNU gfortran, and Intel ifx/ifort compilers

Applied to files:

  • .github/workflows/frontier/test.sh
  • toolchain/templates/frontier.mako
  • .github/workflows/frontier/submit.sh
📚 Learning: 2025-11-24T21:50:46.909Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .cursor/rules/mfc-agent-rules.mdc:0-0
Timestamp: 2025-11-24T21:50:46.909Z
Learning: After each successful build step, run focused tests using `./mfc.sh test -j $(nproc) -f EA8FA07E -t 9E2CA336` instead of running all tests

Applied to files:

  • .github/workflows/frontier/test.sh
📚 Learning: 2025-11-24T21:50:16.713Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Add and update focused tests for changes; ensure no regressions in golden test outputs without clear justification

Applied to files:

  • .github/workflows/frontier/test.sh
📚 Learning: 2025-11-24T21:50:46.909Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .cursor/rules/mfc-agent-rules.mdc:0-0
Timestamp: 2025-11-24T21:50:46.909Z
Learning: Compile with Cray `ftn` or NVIDIA `nvfortran` for GPU offloading; also build CPU-only with GNU `gfortran` and Intel `ifx`/`ifort` for portability

Applied to files:

  • .github/workflows/frontier/test.sh
📚 Learning: 2025-11-24T21:50:16.713Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Avoid stop/error stop inside GPU device code

Applied to files:

  • .github/workflows/frontier/test.sh
  • toolchain/templates/frontier.mako
📚 Learning: 2025-11-24T21:50:46.909Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .cursor/rules/mfc-agent-rules.mdc:0-0
Timestamp: 2025-11-24T21:50:46.909Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Mark GPU-callable helpers with `$:GPU_ROUTINE(function_name='...', parallelism='[seq]')` immediately after declaration

Applied to files:

  • toolchain/templates/frontier.mako
📚 Learning: 2025-11-24T21:50:46.909Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .cursor/rules/mfc-agent-rules.mdc:0-0
Timestamp: 2025-11-24T21:50:46.909Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Wrap tight loops with `$:GPU_PARALLEL_FOR(private='[...]', copy='[...]')` macro; add `collapse=n` for safe nested loop merging

Applied to files:

  • toolchain/templates/frontier.mako
📚 Learning: 2025-11-24T21:50:16.713Z
Learnt from: CR
Repo: MFlowCode/MFC PR: 0
File: .github/copilot-instructions.md:0-0
Timestamp: 2025-11-24T21:50:16.713Z
Learning: Applies to src/simulation/**/*.{fpp,f90} : Wrap tight GPU loops with !$acc parallel loop gang vector default(present) reduction(...); add collapse(n) when safe; declare loop-local variables with private(...)

Applied to files:

  • toolchain/templates/frontier.mako
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (21)
  • GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu)
  • GitHub Check: Georgia Tech | Phoenix (NVHPC) (gpu)
  • GitHub Check: Oak Ridge | Frontier (CCE) (gpu)
  • GitHub Check: Self Hosted (gpu, acc, frontier)
  • GitHub Check: Github (ubuntu, mpi, no-debug, true)
  • GitHub Check: Github (macos, mpi, debug, false)
  • GitHub Check: Self Hosted (cpu, none, gt)
  • GitHub Check: Github (macos, mpi, no-debug, false)
  • GitHub Check: Github (ubuntu, no-mpi, single, no-debug, false)
  • GitHub Check: Github (ubuntu, mpi, no-debug, false)
  • GitHub Check: Github (ubuntu, mpi, debug, true)
  • GitHub Check: Self Hosted (gpu, omp, gt)
  • GitHub Check: Github (ubuntu, mpi, debug, false)
  • GitHub Check: Self Hosted (cpu, none, frontier)
  • GitHub Check: Self Hosted (gpu, omp, frontier)
  • GitHub Check: Self Hosted (gpu, acc, gt)
  • GitHub Check: Coverage Test on CodeCov
  • GitHub Check: Code Cleanliness Check
  • GitHub Check: cubic · AI code reviewer
  • GitHub Check: Agent
  • GitHub Check: Build & Publish
🔇 Additional comments (1)
.github/workflows/frontier/test.sh (1)

19-19: Explicit --no-gpu flag in CPU branch aligns with the new GPU gating.

Passing --no-gpu here clearly keeps the non-GPU tests on a CPU-only path and matches the Frontier template’s semantics; change looks good.

@codecov
Copy link

codecov bot commented Dec 4, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.16%. Comparing base (ab9a1fd) to head (3d4ff7d).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1076   +/-   ##
=======================================
  Coverage   44.16%   44.16%           
=======================================
  Files          71       71           
  Lines       20197    20197           
  Branches     1970     1970           
=======================================
  Hits         8921     8921           
  Misses      10148    10148           
  Partials     1128     1128           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sbryngelson sbryngelson merged commit 6c380cd into MFlowCode:master Dec 5, 2025
64 of 85 checks passed
@sbryngelson sbryngelson deleted the fix-templates branch December 5, 2025 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Review effort 2/5 size:XS This PR changes 0-9 lines, ignoring generated files

Development

Successfully merging this pull request may close these issues.

1 participant