Skip to content

Conversation

@cademirch
Copy link
Contributor

@cademirch cademirch commented Jun 6, 2025

Based on some of the work in https://github.com/harvardinformatics/snakemake-executor-plugin-cannon, I've put together this PR to add dynamic partition selection based on job resource requirements. Relevant discussion in #106

This works by having the user provide a YML file that specifies their cluster's partitions and the resource limits of each via option --slurm-partition-config. The expected structure of the file is simple:

partitions:
  some_partition:
    max_runtime: 100
  another_partition:
    ...

I realize adding another config file isn't ideal given that workflow specific configs, workflow profiles, and global profiles already exist. But this approach avoids the complexity of having to determine cluster configurations through SLURM commands, which can vary wildly (as discussed in #106). It also sidesteps the need for complex expressions in set-resources to determine partitions, which can get unwieldy for big workflows with lots of rules. Ideally, users can craft a partitions.yml once and set it in their global profile.

I'm not super familiar with SLURM, and partition resource limits, so I came up with a list based on the Snakemake standard resources and SLURM resources:

The following limits can be defined for each partition:

Parameter Type Description Default
max_runtime int Maximum walltime in minutes unlimited
max_mem_mb int Maximum total memory in MB unlimited
max_mem_mb_per_cpu int Maximum memory per CPU in MB unlimited
max_cpus_per_task int Maximum CPUs per task unlimited
max_nodes int Maximum number of nodes unlimited
max_tasks int Maximum number of tasks unlimited
max_tasks_per_node int Maximum tasks per node unlimited
max_gpu int Maximum number of GPUs 0
available_gpu_models list[str] List of available GPU models none
max_cpus_per_gpu int Maximum CPUs per GPU unlimited
supports_mpi bool Whether MPI jobs are supported true
max_mpi_tasks int Maximum MPI tasks unlimited
available_constraints list[str] List of available node constraints none

It could be possible to support any arbitrary resource though, by pattern matching: "max_{resource}". Though I've not implemented this in this PR.

To pick the "best" partition for a job, I went with a naive scoring approach that calculates a score by summing the ratios of requested resources to partition limits. Higher scores should indicate better resource utilization, for example: a job requesting 8 CPUs would prefer a 16-CPU partition (score 0.5) over a 64-CPU partition (score 0.125). Partitions that cannot satisfy a job's requirements are not considered.

This feature is opt in and respects the slurm_partition job resource, as well as existing fallback partition logic.

@cmeesters, @johanneskoester let me know what you think of this approach! I'm not particularly experienced with SLURM, so I've made decisions here (limits, partition specs, etc.) based on my limited experience and the available docs - so feedback is much appreciated.

Summary by CodeRabbit

  • New Features

    • Automatic per-job SLURM partition selection that respects explicit job partitions, supports YAML-driven multi-partition configs (via env var or CLI/file), scores compatible partitions, and falls back to the default when none match.
  • Documentation

    • Added docs detailing the partition YAML schema, example configs (standard/highmem/gpu), scoring rules, selection precedence, and fallback behavior.
  • Tests

    • Expanded test coverage for parsing, scoring/selection across CPU/GPU/MPI, constraints, multi-cluster scenarios, and error cases.
  • CI

    • Test run expanded to execute the full test suite.

✏️ Tip: You can customize this high-level summary in your review settings.

@cademirch cademirch requested a review from johanneskoester June 6, 2025 22:23
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jun 6, 2025

Walkthrough

Adds YAML-driven automatic SLURM partition selection: new partition parsing/scoring module, Executor wiring to load/select partitions at init/submission (env var or YAML), docs, and tests. Jobs with explicit slurm_partition bypass auto-selection.

Changes

Cohort / File(s) Change Summary
Documentation
docs/further.md
Adds "Automatic Partition Selection" docs: YAML schema, examples (standard/highmem/gpu), scoring rules, selection and fallback behavior.
Partition Management Module
snakemake_executor_plugin_slurm/partitions.py
New module: PartitionLimits and Partition dataclasses, YAML reader (read_partition_file), GPU/CPU parsers (parse_gpu_requirements, get_job_cpu_requirement, get_effective_threads), partition scoring (Partition.score_job_fit) and selector (get_best_partition). Validation and WorkflowError handling added.
Executor Integration
snakemake_executor_plugin_slurm/__init__.py
Adds public partition_config: Optional[Path]; initializes _partitions from env var or YAML in Executor.__post_init__; registers atexit cleanup; defers partition injection from run params and updates get_partition_arg to prefer explicit slurm_partition then query get_best_partition; preserves cached fallback behavior.
Tests & CI
tests/test_partition_selection.py, tests/tests.py, .github/workflows/ci.yml
Adds comprehensive partition parsing/selection tests and fixtures; imports new validation in tests; CI updated to run full tests/ directory.

Sequence Diagram(s)

sequenceDiagram
    participant Job as Job Submission
    participant Exec as Executor
    participant Parts as Partitions Module
    participant SLURM as SLURM

    Job->>Exec: run_job(job)
    Exec->>Exec: assemble job params/resources
    alt explicit `slurm_partition` in job
        Exec->>Exec: use explicit partition (skip auto-select)
    else partitions configured (env or YAML)
        Exec->>Parts: get_best_partition(candidate_partitions, job, logger)
        Parts->>Parts: score_job_fit(job) per partition
        Parts-->>Exec: best_partition_name or None
        alt match found
            Exec->>Exec: set partition arg = best_partition_name
        else no match
            Exec->>Exec: fallback to default/cached partition logic
        end
    else no partitions configured
        Exec->>Exec: use cached/default partition behavior
    end
    Exec->>SLURM: submit job with selected partition
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Focus review on:
    • snakemake_executor_plugin_slurm/partitions.py — scoring math, GPU/GRES parsing and error paths, CPU requirement derivation.
    • snakemake_executor_plugin_slurm/init.py — init ordering, env vs YAML precedence, atexit hook, and interaction with existing fallback logic.
    • tests/test_partition_selection.py — ensure test assumptions match scoring/fallback semantics.

Possibly related PRs

Suggested reviewers

  • johanneskoester
  • dlaehnemann

Poem

🐇 I nibble YAML leaves at dawn,

I count GPUs, threads, and memory drawn,
I score each partition, pick the best,
I hop to SLURM and tuck jobs to rest,
Happy hops — the pipeline's on! 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat!: proposal for dynamic partition selection' directly and specifically describes the main change: introducing a new opt-in feature for automatic SLURM partition selection based on job resource requirements.
Docstring Coverage ✅ Passed Docstring coverage is 88.68% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat-automatic-partition-selection

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@cademirch cademirch requested a review from cmeesters June 6, 2025 22:23
@cademirch cademirch changed the title feat proposal for automatic partition selection feat: proposal for automatic partition selection Jun 6, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (3)
snakemake_executor_plugin_slurm/partitions.py (3)

44-53: Use appropriate logging levels.

The function uses logger.warning for informational messages about partition selection. These are normal operational messages, not warnings.

-        logger.warning(
+        logger.info(
             f"Auto-selected partition '{partition}' for job {job.name} "
             f"with score {best_score:.3f}"
         )
         return partition
     else:
         logger.warning(
             f"No suitable partition found for job {job.name} based on "
             f"resource requirements. Falling back to default behavior."
         )

248-250: Simplify nested if statements.

As suggested by static analysis, these nested if statements can be combined for better readability.

-            if gpu_model and self.limits.available_gpu_models:
-                if gpu_model not in self.limits.available_gpu_models:
-                    return None
+            if gpu_model and self.limits.available_gpu_models and gpu_model not in self.limits.available_gpu_models:
+                return None
🧰 Tools
🪛 Ruff (0.11.9)

248-249: Use a single if statement instead of nested if statements

(SIM102)


184-223: Consider documenting and refining the scoring algorithm.

The current scoring algorithm simply sums the ratios of requested resources to limits, giving equal weight to all resources. This may not reflect real-world partition selection priorities.

Consider:

  1. Adding weights to different resources based on their importance
  2. Documenting the scoring rationale in the method docstring
  3. Making the scoring configurable through the partition configuration

For example, memory and GPU availability might be more critical than other factors in partition selection.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 184-184: Too many return statements (8/6)

(R0911)


[refactor] 184-184: Too many branches (21/12)

(R0912)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec82a70 and f5511ff.

📒 Files selected for processing (5)
  • docs/further.md (1 hunks)
  • snakemake_executor_plugin_slurm/__init__.py (6 hunks)
  • snakemake_executor_plugin_slurm/partitions.py (1 hunks)
  • snakemake_executor_plugin_slurm/submit_string.py (0 hunks)
  • tests/tests.py (2 hunks)
💤 Files with no reviewable changes (1)
  • snakemake_executor_plugin_slurm/submit_string.py
🧰 Additional context used
🧬 Code Graph Analysis (2)
snakemake_executor_plugin_slurm/__init__.py (1)
snakemake_executor_plugin_slurm/partitions.py (2)
  • read_partition_file (13-27)
  • get_best_partition (30-54)
tests/tests.py (4)
snakemake_executor_plugin_slurm/__init__.py (1)
  • ExecutorSettings (36-119)
snakemake_executor_plugin_slurm/utils.py (1)
  • set_gres_string (51-123)
snakemake_executor_plugin_slurm/submit_string.py (1)
  • get_submit_command (5-69)
snakemake_executor_plugin_slurm/partitions.py (2)
  • read_partition_file (13-27)
  • get_best_partition (30-54)
🪛 Pylint (3.3.7)
tests/tests.py

[refactor] 485-485: Too many public methods (28/20)

(R0904)

snakemake_executor_plugin_slurm/partitions.py

[refactor] 41-54: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 78-88: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 85-88: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 115-144: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 93-93: Too many branches (14/12)

(R0912)


[refactor] 150-150: Too many instance attributes (13/7)

(R0902)


[refactor] 184-184: Too many return statements (8/6)

(R0911)


[refactor] 184-184: Too many branches (21/12)

(R0912)

🪛 Ruff (0.11.9)
snakemake_executor_plugin_slurm/partitions.py

248-249: Use a single if statement instead of nested if statements

(SIM102)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: testing
🔇 Additional comments (4)
docs/further.md (1)

67-139: Well-documented feature with clear examples!

The documentation for the automatic partition selection feature is comprehensive and well-structured. The table of configurable limits, example YAML configuration, and explanation of the scoring algorithm provide users with all the necessary information to effectively use this feature.

tests/tests.py (1)

485-974: Excellent test coverage for partition selection feature!

The test suite comprehensively covers all aspects of the partition selection feature:

  • Various partition configuration scenarios (minimal, comprehensive, invalid)
  • Edge cases (empty partitions, missing keys, invalid YAML)
  • Different job resource requirements (CPU, GPU, MPI, constraints)
  • Proper error handling and logging verification

The use of pytest fixtures and parametrization makes the tests maintainable and easy to extend.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 485-485: Too many public methods (28/20)

(R0904)

snakemake_executor_plugin_slurm/__init__.py (1)

639-652: Well-implemented partition selection priority logic!

The enhanced get_partition_arg method correctly implements the partition selection priority:

  1. Explicit partition specification (via slurm_partition resource)
  2. Automatic selection based on loaded partition config
  3. Fallback to default partition behavior

This maintains backward compatibility while adding the new automatic selection feature.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 649-652: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

snakemake_executor_plugin_slurm/partitions.py (1)

149-175: Well-designed dataclass for partition limits.

The PartitionLimits dataclass appropriately models SLURM partition resource constraints with sensible defaults (infinity for unlimited, 0 for GPUs).

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 150-150: Too many instance attributes (13/7)

(R0902)

@cademirch cademirch changed the title feat: proposal for automatic partition selection feat: proposal for dynamic partition selection Jun 7, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
snakemake_executor_plugin_slurm/partitions.py (3)

13-27: Add error handling for file operations and YAML parsing.

The function lacks error handling for several potential failures as previously noted.


81-89: Add error handling for gres GPU parsing.

The gres parsing logic needs error handling for integer conversion failures and format validation as previously noted.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 85-88: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


125-126: Consider raising an error for negative CPU values.

When cpus_per_task is negative, raising an error would better signal the configuration issue.

🧹 Nitpick comments (2)
snakemake_executor_plugin_slurm/partitions.py (2)

248-250: Simplify nested if statements.

Combine the nested conditions for better readability.

-            if gpu_model and self.limits.available_gpu_models:
-                if gpu_model not in self.limits.available_gpu_models:
-                    return None
+            if gpu_model and self.limits.available_gpu_models and gpu_model not in self.limits.available_gpu_models:
+                return None
🧰 Tools
🪛 Ruff (0.11.9)

248-249: Use a single if statement instead of nested if statements

(SIM102)


190-223: The resource ratio scoring approach promotes efficient partition utilization.

The scoring algorithm effectively:

  • Favors partitions that closely match job requirements
  • Prevents small jobs from monopolizing large partitions
  • Handles unlimited resources gracefully with isinf checks

This should lead to better overall cluster utilization.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f5511ff and 01a94c6.

📒 Files selected for processing (3)
  • snakemake_executor_plugin_slurm/__init__.py (7 hunks)
  • snakemake_executor_plugin_slurm/partitions.py (1 hunks)
  • tests/tests.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • snakemake_executor_plugin_slurm/init.py
🧰 Additional context used
🪛 Pylint (3.3.7)
tests/tests.py

[refactor] 486-486: Too many public methods (28/20)

(R0904)

snakemake_executor_plugin_slurm/partitions.py

[refactor] 41-54: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 78-88: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 85-88: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 115-144: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 93-93: Too many branches (14/12)

(R0912)


[refactor] 150-150: Too many instance attributes (13/7)

(R0902)


[refactor] 184-184: Too many return statements (8/6)

(R0911)


[refactor] 184-184: Too many branches (21/12)

(R0912)

🪛 Ruff (0.11.9)
snakemake_executor_plugin_slurm/partitions.py

248-249: Use a single if statement instead of nested if statements

(SIM102)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: testing
🔇 Additional comments (4)
tests/tests.py (2)

6-9: LGTM! Appropriate imports for partition testing functionality.

The new imports support the comprehensive test suite for the partition selection feature.

Also applies to: 13-16


486-976: Excellent test coverage for the partition selection feature!

The test suite comprehensively covers:

  • Various partition configurations (basic, minimal, comprehensive)
  • Different job types (CPU-only, GPU, MPI)
  • Edge cases and error conditions
  • Partition scoring and selection logic

The tests ensure robust validation of the new automatic partition selection functionality.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 486-486: Too many public methods (28/20)

(R0904)

snakemake_executor_plugin_slurm/partitions.py (2)

30-55: Well-implemented partition selection logic!

The function correctly:

  • Filters out incompatible partitions
  • Selects the best-scoring partition
  • Provides informative logging for both success and failure cases
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 41-54: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


149-175: Well-structured partition limits dataclass!

The dataclass appropriately:

  • Uses inf for unlimited numerical resources
  • Defaults max_gpu to 0 for non-GPU partitions
  • Organizes resources by type (standard, SLURM-specific, GPU, MPI, constraints)
🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 150-150: Too many instance attributes (13/7)

(R0902)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
tests/tests.py (1)

547-550: Improve test case for missing required fields.

The current test uses an empty partition name "" but doesn't truly test missing required fields. Consider testing with a partition that completely lacks the name field.

 @pytest.fixture
 def missing_name_config(self):
     """Configuration with missing name field."""
-    return {"partitions": {"": {}}}  # Empty partition name
+    return {"partitions": {None: {}}}  # None as key to test missing name handling
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 01a94c6 and bb8c471.

📒 Files selected for processing (1)
  • tests/tests.py (2 hunks)
🧰 Additional context used
🪛 Pylint (3.3.7)
tests/tests.py

[refactor] 486-486: Too many public methods (28/20)

(R0904)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: testing
🔇 Additional comments (6)
tests/tests.py (6)

6-8: LGTM! Necessary imports for partition testing.

The added imports for tempfile, yaml, and Path are appropriate for creating temporary configuration files and testing the YAML parsing functionality.


13-16: LGTM! Proper imports for partition functionality.

The imports for read_partition_file and get_best_partition are correctly importing the new partition selection functions being tested.


486-486: Comprehensive test coverage is appropriate despite static analysis warning.

The pylint warning about too many public methods (28/20) is a false positive for test classes. Comprehensive test suites naturally require many test methods to cover all scenarios, and this level of coverage is beneficial for the new partition selection feature.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 486-486: Too many public methods (28/20)

(R0904)


557-567: LGTM! Well-designed helper fixture for temporary files.

The temp_yaml_file fixture provides a clean way to create temporary YAML files for testing. The implementation properly handles file creation and returns the path for cleanup.


569-709: Excellent coverage of file reading scenarios.

The tests comprehensively cover various file reading scenarios including:

  • Valid partition configurations with different complexity levels
  • Invalid YAML syntax
  • Missing files and required keys
  • Proper cleanup of temporary files

The error handling validation ensures the partition parsing is robust.


734-970: Comprehensive partition selection test coverage.

The partition selection tests effectively validate the core functionality including:

  • CPU and GPU job matching
  • MPI support validation
  • Constraint compatibility checking
  • Resource limit enforcement
  • Edge cases and error scenarios

The tests properly verify both successful partition selection and appropriate handling when no suitable partition exists.

@cmeesters
Copy link
Member

@cademirch Hoi Cade,

this is most impressive and looks like a major step forward! Alas, I will be on a conference for the next week and this PR really needs to be thoroughly tested. So, please allow me a bit more time than usual.

On the plus side: I will add a '!' to the title, as this might be worth a major release.

@cmeesters cmeesters changed the title feat: proposal for dynamic partition selection feat!: proposal for dynamic partition selection Jun 8, 2025
@cademirch
Copy link
Contributor Author

cademirch commented Jun 8, 2025

@cademirch Hoi Cade,

this is most impressive and looks like a major step forward! Alas, I will be on a conference for the next week and this PR really needs to be thoroughly tested. So, please allow me a bit more time than usual.

On the plus side: I will add a '!' to the title, as this might be worth a major release.
Hi @cmeesters!

Of course, I implemented various tests (albeit with LLM assistance, though I looked them over carefully and they all mostly make sense). I expect there are likely use-cases and SLURM configurations I've overlooked - on which your feedback will be very helpful.

Enjoy your conference!

@cademirch
Copy link
Contributor Author

@cmeesters Gentle bump here, would be great to discuss this.

@cmeesters
Copy link
Member

cmeesters commented Jun 29, 2025

@cademirch I'm sorry - I forgot to notify you: I didn't have a cluster for the last week (cooling issues). Hopefully, by next week, we open again. Yet, I do not know how much time I will find: holidays are approaching. I really want to test this. It would be a wonderful addition!

update: still no working cluster for me. As I will be on holiday until mid-July, there will be little progress, I'm afraid. I want to attempt a first test, once the cluster is up again, but whether this leave me time to do something substantial, I do not know.

@johanneskoester
Copy link
Contributor

Sounds like a good idea in principle. I believe the same could be achieved by providing a python expression for slurm_partition as a default resource. But I understand that complex clusters might turn such an expression into a very complex thing. A thought: what about discussing the logic here with the dev(s) (@gwct, https://github.com/harvardinformatics/snakemake-executor-plugin-cannon) of the cannon plugin, which is a fork of the slurm plugin specific for the Harvard cannon cluster. As far as I know, it does auto-selection of partitions. Maybe the outcome could be that what is developed here could replace the cannon plugin or make it at least easier to maintain.

@cmeesters
Copy link
Member

regarding cannon: interesting. And sad, that they forked without trying to provide something upstream.

I was imaging to parse the partition definitions of SLURM. That is a nightmare, but feasible - no quick implementation on the horizon. The hardest part would be to get an idea of preferences. That a "parallel" partition is intended to receive jobs for a whole or multiple of nodes might be understandable for us, albeit not deducible from a name as "parallel" might have quite different translations. The same applies for "smp" (which is the limit? This can be part of the partition definitions or the qos) or "a100dl" (which kind of GPU applications is to run there?).

Perhaps, it is worth to invest time, compare outputs of certain cluster definitions (scontrol show partition ...) and sacctmgr commands for qos. I think one person alone will always develop in ignorance of other clusters. That could be part of our regular meetings. I am willing to invest some time in this. What do you guys think (@cademirch , @johanneskoester )?

@cademirch
Copy link
Contributor Author

cademirch commented Aug 22, 2025

A thought: what about discussing the logic here with the dev(s) (@gwct, https://github.com/harvardinformatics/snakemake-executor-plugin-cannon) of the cannon plugin

@gwct's cannon plugin is what inspired me to write this PR. Him and I have collaborated on other work, I'm sure he'll be happy to chat about this.

Perhaps, it is worth to invest time, compare outputs of certain cluster definitions (scontrol show partition ...) and sacctmgr commands for qos. I think one person alone will always develop in ignorance of other clusters. That could be part of our regular meetings

If I understand correctly, you are wanting to explore parsing partition limits from scontrol/sacctmgr? I'm open to this, but would defer to you on this idea since I have very little experience with SLURM clusters in general. In any case, I think some sort of manual way to specify cluster limits will be needed in case a user's cluster setup does not work with parsing limits from scontrol or the like.

@johanneskoester
Copy link
Contributor

I agree, both automatic and guided like proposed here makes sense.

@cmeesters
Copy link
Member

ok, let's focus here on the guided approach. I will be travelling, but might(!) find some time in between. Or else, next week.

@gwct
Copy link

gwct commented Aug 25, 2025

Hey there all,
Sorry for the silence, I've been traveling the past week.

The Cannon plugin was a quick hackathon project. While I definitely think it could be expanded upon, I didn't want to clutter up your PRs with something that has a lot of stuff hard coded for our cluster specifically.

Basically all is just a little decision tree in the get_partition_arg() function to assign partitions based first on memory requested, then CPUs and time. Of course if GPUs are required that takes precedence on our cluster since there is only one GPU partition. The rest of the additions are just helper functions in cannon.py.

As I think we all realize, a universal parser for something like scontrol would be ideal, but extremely difficult. Even in the cannon plugin, we ran into some inconsistencies, notably the only working way to specify GPUs was with slurm_extra, even though both gres: and gpu: seem to be possible options. I'm not sure what about our cluster makes this so, but this is probably a good example of one of the quirks we might run into when developing a general partition selector.

That being said, I think @cademirch's guided approach above, with a YAML file, is probably a good place to start, though still requires a lot of user involvement in generating the YAML file. Perhaps after that's in place, the next step would be to write a separate parsing function that generates the YAML file automatically from scontrol, that way we can just build on top of it and the user can still provide their own YAML if needed.

Happy to discuss more and open to help testing as well! I'll also try to take a better look at Cade's code in this PR, since it seems more robust than our simple decision tree.

@johanneskoester
Copy link
Contributor

That sounds like a reasonable plan @gwct. In particular, I like that the user/admin can then review the yaml that has been autogenerated from scontrol in order to refine or extend stuff.

@dlaehnemann
Copy link
Member

After reading through this discussion quickly, I agree that a config yaml setup as a first step makes sense. And automatic parsing would be great, but sounds hard.

One thing that came to mind, because I also use the lsf executor plugin, would be to make this more generic (maybe implement this in the snakemake-interface-executor-plugins)? Because there is a similar concept in LSF clusters, although it is not called a partition, but a queue. For example, see these docs:
https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=job-view-available-queues

And if this is to be a more generic functionality, one important thing to come up with would be generic terminology for this. Not sure, maybe something like node-set-configurations:, as these define node sets and their resource restrictions?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
snakemake_executor_plugin_slurm/partitions.py (2)

160-175: Treat negative cpus_per_task as a configuration error instead of silently disabling CPU requirements.

For a negative cpus_per_task, the function returns (0, "none"):

if cpus_per_task < 0:
    return (0, "none")

This effectively ignores an invalid resource declaration that is almost certainly a user error. It also prevents the job’s CPU requirements from influencing partition feasibility or scoring.

Prefer raising a WorkflowError (or at least logging loudly) so misconfigured jobs fail fast and visibly:

-        if cpus_per_task < 0:
-            return (0, "none")
+        if cpus_per_task < 0:
+            raise WorkflowError(
+                f"Invalid negative value for cpus_per_task: {cpus_per_task}"
+            )

83-116: Harden GPU requirement parsing to avoid uncaught ValueErrors and partial parsing.

parse_gpu_requirements will currently:

  • Call int() on gpu_spec[1] / gpu_spec[2] without protection, so a malformed gres like gpu:foo will raise ValueError and bubble up out of partition scoring.
  • Only consider the first matching gpu:* entry in a comma‑separated gres string.

That means a single bad GPU spec can crash dynamic partition selection with an unhelpful traceback, and multi‑GPU gres strings are only partially interpreted.

Consider:

  • Wrapping the int() calls in try/except ValueError and re‑raising a WorkflowError that clearly points to the offending gres value.
  • Optionally iterating all gpu:* parts and aggregating counts (or explicitly rejecting multiple GPU specs, depending on intended semantics).

Example sketch:

-        if gpu_parts:
-            gpu_spec = gpu_parts[0].strip().split(":")
-            if len(gpu_spec) == 2:  # gpu:<number>
-                return int(gpu_spec[1]), None
-            elif len(gpu_spec) == 3:  # gpu:<model>:<number>
-                return int(gpu_spec[2]), gpu_spec[1]
+        if gpu_parts:
+            gpu_spec = gpu_parts[0].strip().split(":")
+            try:
+                if len(gpu_spec) == 2:  # gpu:<number>
+                    return int(gpu_spec[1]), None
+                elif len(gpu_spec) == 3:  # gpu:<model>:<number>
+                    return int(gpu_spec[2]), gpu_spec[1]
+            except ValueError:
+                raise WorkflowError(f"Invalid GPU count in gres: {gpu_parts[0]}")
+            # Optionally: raise on any other unexpected format
+            raise WorkflowError(f"Invalid gres GPU format: {gpu_parts[0]}")
🧹 Nitpick comments (5)
snakemake_executor_plugin_slurm/partitions.py (2)

13-34: Narrow the broad except Exception and simplify error propagation in read_partition_file.

The function already has specific handlers for FileNotFoundError and yaml.YAMLError, but then falls back to except Exception as e: and wraps everything else in a generic WorkflowError. This:

  • Can hide programmer bugs (e.g. TypeError from a bad PartitionLimits signature) behind a generic “Unexpected error” message.
  • Triggers Ruff’s BLE001/B904/TRY003 warnings.

Two small improvements:

  • Restrict the generic catch to expected I/O problems (OSError, PermissionError) if you want to keep a fallback, and let truly unexpected exceptions propagate.
  • Optionally preserve the original exception as context via raise WorkflowError(...) from e if you want better tracebacks (outside __init__.py, there is no repo‑specific constraint against chaining).

For example:

-    except Exception as e:
-        raise WorkflowError(
-            f"Unexpected error reading partition file {partition_file}: {e}"
-        )
+    except OSError as e:
+        raise WorkflowError(
+            f"Error reading partition file {partition_file}: {e}"
+        ) from e

Also, Ruff flags the long error strings as TRY003; if you care about keeping the linter clean, you might want to reduce duplication in the messages or centralize formatting.


239-243: Remove unused # noqa: E501 pragmas on comment-only lines.

The # noqa: E501 directives on these long-comment lines are reported as unused (RUF100) because E501 is not enabled. They also don’t affect runtime.

You can safely drop them to appease Ruff and keep the file cleaner.

-        # try to score how closely a job matches a partition's limits, in order to handle case where multiple partitions can run a given job # noqa: E501
-        # naive approach here is to just sum the ratio of requested resource to limit, of course this limits us to only consider numerical resources # noqa: E501
-        # here a higher score indicates a better fit
-        # TODO decide how to handle unspecified limits, for now we assume inf for numerical limits, none for others. # noqa: E501
+        # try to score how closely a job matches a partition's limits, in order
+        # to handle case where multiple partitions can run a given job.
+        # naive approach here is to sum the ratio of requested resource to limit
+        # (only numerical resources considered). Higher score = better fit.
+        # TODO: decide how to handle unspecified limits (currently: inf/None).
tests/tests.py (1)

24-25: Tighten the noqa codes and reconsider the * import.

Two points here:

  1. Ruff reports F401 as non‑enabled, so # noqa: F401 is unnecessary noise and triggers RUF100. If you keep the star import, F403 alone is sufficient:
-from .test_partition_selection import *  # noqa: F401, F403
+from .test_partition_selection import *  # noqa: F403
  1. Importing all tests from test_partition_selection into this module will typically make pytest run those tests twice (once in their own module, once via this import), assuming normal test discovery. That’s harmless but redundant and can make failures and timings harder to interpret.

If CI already discovers tests/test_partition_selection.py directly, consider dropping the * import entirely and letting pytest handle that file on its own.

snakemake_executor_plugin_slurm/__init__.py (1)

220-239: Partition loading and env‑var vs CLI precedence look correct; consider logging CLI-based config as well.

The new __post_init__ logic:

  • Uses SNAKEMAKE_SLURM_PARTITIONS when set and no partition_config was provided.
  • Otherwise reads from ExecutorSettings.partition_config.
  • Stores result in self._partitions and registers clean_old_logs via atexit.

This matches the documented precedence and keeps the partition parsing in one place. One small nicety you might consider is logging when the CLI option is used as well, similar to the env‑var branch, to make it clearer in logs which partition config file was active.

For example:

        else:
-            self._partitions = (
-                read_partition_file(self.workflow.executor_settings.partition_config)
-                if self.workflow.executor_settings.partition_config
-                else None
-            )
+            if self.workflow.executor_settings.partition_config:
+                partition_file = self.workflow.executor_settings.partition_config
+                self.logger.info(
+                    f"Reading SLURM partition configuration from CLI file: {partition_file}"
+                )
+                self._partitions = read_partition_file(partition_file)
+            else:
+                self._partitions = None
tests/test_partition_selection.py (1)

220-229: Docstring for test_read_file_missing_partitions_key is outdated.

The docstring still says:

"""Test reading a file without 'partitions' key raises KeyError."""

but the test now correctly asserts on a WorkflowError:

with pytest.raises(WorkflowError, match="missing 'partitions' section"):
    read_partition_file(temp_path)

Consider updating the docstring to reflect the actual expected exception type to avoid confusion during future maintenance.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 982dfee and 744b8b3.

📒 Files selected for processing (5)
  • docs/further.md (1 hunks)
  • snakemake_executor_plugin_slurm/__init__.py (6 hunks)
  • snakemake_executor_plugin_slurm/partitions.py (1 hunks)
  • tests/test_partition_selection.py (1 hunks)
  • tests/tests.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/further.md
🧰 Additional context used
🧠 Learnings (18)
📓 Common learnings
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.
📚 Learning: 2025-03-10T15:20:51.829Z
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-03-31T11:52:05.373Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2024-12-11T14:17:08.749Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 178
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2024-12-11T14:17:08.749Z
Learning: In the `snakemake-executor-plugin-slurm` project, when handling exceptions in `snakemake_executor_plugin_slurm/__init__.py`, prefer concise error messages and avoid unnecessary verbosity or exception chaining when it's not necessary.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-03-10T13:34:59.270Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:34:59.270Z
Learning: In the Snakemake executor plugin for SLURM, all GPU resources (even simple numeric ones) should be translated to the `--gpus` flag (plural) to match SLURM's expectations, not `--gpu` (singular).

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-02-18T14:49:42.624Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: snakemake_executor_plugin_slurm/utils.py:75-100
Timestamp: 2025-02-18T14:49:42.624Z
Learning: In the Snakemake SLURM executor plugin, users must specify either `gres` or `gpus` for GPU resources, but not both simultaneously, as these are mutually exclusive ways to request GPU resources.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T13:34:59.270Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:34:59.270Z
Learning: In the Snakemake executor plugin for SLURM, all GPU resources (both simple numeric values and model:number specifications) should be translated to the `--gpus` flag (plural form) to match SLURM's command line interface expectations.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T13:22:54.081Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:22:54.081Z
Learning: In the Snakemake executor plugin for SLURM, when using GPU resources, the function returns `--gpu=n` (singular) when only a number is provided, but `--gpus=model:n` (plural) when a model is specified in the implementation.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-02-18T14:40:27.064Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2025-02-18T14:40:27.064Z
Learning: In the Snakemake executor plugin for SLURM, the GPU resource is specified using the "gpus" (plural) resource name, not "gpu" (singular).

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-01-13T09:54:22.950Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-01-20T09:13:26.443Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 193
File: .github/workflows/post_to_mastodon.yml:26-26
Timestamp: 2025-01-20T09:13:26.443Z
Learning: In the snakemake-executor-plugin-slurm repository, release PRs follow the naming pattern "chore(main): release X.Y.Z" where X.Y.Z is the version number.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-06-24T12:41:31.837Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: tests/tests.py:62-70
Timestamp: 2025-06-24T12:41:31.837Z
Learning: In the snakemake-executor-plugin-slurm efficiency report feature, the executor is responsible for creating the efficiency_report_path directory when generating efficiency reports. Tests should not manually check for directory existence but should rely on the executor to properly create necessary directories as part of its normal operation.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-07-22T17:31:41.629Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 333
File: .github/workflows/announce-release.yml:0-0
Timestamp: 2025-07-22T17:31:41.629Z
Learning: In the snakemake-executor-plugin-slurm repository, cmeesters prefers using `main` references for GitHub Actions to always get the latest version, with security provided through protected branches and mandatory CI checks rather than commit SHA pinning.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-06-03T17:10:51.292Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2025-06-03T17:10:51.292Z
Learning: In the snakemake-executor-plugin-slurm repository, the create_efficiency_report method in snakemake_executor_plugin_slurm/__init__.py contains temporary testing/debugging code that logs directory contents using os.listdir() and directory paths. This code was inserted for testing purposes only and should be removed from the production implementation.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-01-17T17:27:32.446Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 187
File: .github/workflows/post_to_mastodon.yml:0-0
Timestamp: 2025-01-17T17:27:32.446Z
Learning: In the mastodon publishing workflow for snakemake-executor-plugin-slurm, the PR_TITLE environment variable is required by the post_to_mastodon.sh script and should be preserved.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-09-15T12:00:31.977Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 355
File: snakemake_executor_plugin_slurm/submit_string.py:23-30
Timestamp: 2025-09-15T12:00:31.977Z
Learning: In the snakemake-executor-plugin-slurm project, the parameters params.run_uuid, params.slurm_logfile, and params.comment_str are always provided and never None, so validation for None values is not necessary for these specific parameters in the get_submit_command function.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-05-26T12:22:07.005Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: snakemake_executor_plugin_slurm/efficiency_report.py:34-49
Timestamp: 2025-05-26T12:22:07.005Z
Learning: In the parse_reqmem function in snakemake_executor_plugin_slurm/efficiency_report.py, the conditional check for per_unit == "/node" is necessary for proper conversion of memory values in multi-node jobs, not redundant as it might initially appear.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-09-16T17:25:38.450Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 355
File: snakemake_executor_plugin_slurm/submit_string.py:0-0
Timestamp: 2025-09-16T17:25:38.450Z
Learning: In the snakemake-executor-plugin-slurm project, constraint and qos parameters follow a three-state logic: None (not set, no flag added), empty string (explicitly set to empty, adds flag with empty quotes like --constraint ''), and non-empty string (adds flag with quoted value). The correct implementation uses `constraint = job.resources.get("constraint")` followed by `if constraint is not None:` to distinguish between "not provided" and "explicitly set to empty".

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
🧬 Code graph analysis (3)
tests/tests.py (1)
snakemake_executor_plugin_slurm/__init__.py (1)
  • ExecutorSettings (44-176)
tests/test_partition_selection.py (1)
snakemake_executor_plugin_slurm/partitions.py (2)
  • read_partition_file (13-50)
  • get_best_partition (53-80)
snakemake_executor_plugin_slurm/__init__.py (1)
snakemake_executor_plugin_slurm/partitions.py (2)
  • read_partition_file (13-50)
  • get_best_partition (53-80)
🪛 Ruff (0.14.5)
tests/tests.py

25-25: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

snakemake_executor_plugin_slurm/partitions.py

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


19-19: Avoid specifying long messages outside the exception class

(TRY003)


21-21: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


21-21: Avoid specifying long messages outside the exception class

(TRY003)


22-22: Do not catch blind exception: Exception

(BLE001)


23-25: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


23-25: Avoid specifying long messages outside the exception class

(TRY003)


27-29: Avoid specifying long messages outside the exception class

(TRY003)


32-34: Avoid specifying long messages outside the exception class

(TRY003)


38-38: Avoid specifying long messages outside the exception class

(TRY003)


100-102: Avoid specifying long messages outside the exception class

(TRY003)


101-101: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


239-239: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


240-240: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


242-242: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: testing
🔇 Additional comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)

733-755: Partition resolution order is sensible; confirm behavior when automatic selection fails.

get_partition_arg now:

  1. Honors an explicit slurm_partition resource.
  2. Otherwise, when self._partitions is set, calls get_best_partition(self._partitions, job, self.logger).
  3. Falls back to get_default_partition(job) when no partition has been selected.
  4. Returns "" when even the default partition cannot be determined.

This preserves existing fallback semantics while inserting dynamic selection in the middle, which is exactly what the PR description promises.

The only behavioral nuance to double‑check is whether you want to fall back to the default partition when get_best_partition returns None (no suitable partition found according to the limits). Today that “failure to match” is logged by get_best_partition but still results in scheduling on the cluster’s default partition.

If the intent is to treat “no matching partition” as a hard error instead, you’d need to change this to raise a WorkflowError instead of silently falling back. Otherwise, the current behavior is consistent with the “falling back to default behavior” message in get_best_partition.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (2)
snakemake_executor_plugin_slurm/partitions.py (2)

81-114: GPU requirement parsing can misclassify some GPU jobs and lacks validation

parse_gpu_requirements correctly covers gpu as a numeric count and gres="gpu[:model]:n", but there are two gaps:

  • A gpu resource like gpu="tesla:2" (which set_gres_string currently accepts and turns into --gpus=tesla:2) is treated here as non-GPU: the int() cast fails, gpu_required is set to 0, and no gres is present. Such jobs will be considered CPU-only for partition scoring, so they may be scheduled on partitions with max_gpu == 0 while still submitting --gpus=tesla:2. That’s a real mis-scheduling risk.
  • For gres, only the first gpu entry in a comma-separated list is processed and integer conversion is unguarded; malformed values or unexpected formats silently fall back to (0, None) instead of surfacing a configuration error. (This was already raised in a previous review.)

I recommend:

  • Extending the gpu_required parsing to support the "model:n" form (deriving both count and model when gpu_model isn’t provided).
  • Adding try/except ValueError around all numeric conversions and raising WorkflowError with a clear message on invalid counts or unsupported formats.
  • Optionally iterating over all gpu parts in gres if you want to support multiple GPU resources, or at least validating and rejecting unexpected additional entries instead of ignoring them.

137-191: Invalid CPU resource values are silently ignored instead of failing fast

get_job_cpu_requirement currently maps:

  • cpus_per_task < 0 to (0, "none")
  • cpus_per_gpu <= 0 to (0, "none")

Returning (0, "none") effectively drops the CPU requirement and lets partition scoring and submission proceed, even though the configuration is clearly invalid. This can hide user mistakes and lead to confusing behavior (e.g. partitions appearing compatible when they should be rejected), and it was already flagged in an earlier review for cpus_per_task.

Consider instead:

  • Raising a WorkflowError with an explicit message when cpus_per_task < 0 or cpus_per_gpu <= 0, so misconfigurations fail early and clearly.
  • Optionally adding a small guard around the int() casts to raise on non-numeric input rather than coercing to 0.

Example for the negative cpus_per_task branch:

-        if cpus_per_task < 0:
-            return (0, "none")
+        if cpus_per_task < 0:
+            raise WorkflowError(
+                f"Invalid negative value for cpus_per_task: {cpus_per_task}"
+            )

Apply a similar pattern for cpus_per_gpu <= 0.

🧹 Nitpick comments (1)
snakemake_executor_plugin_slurm/partitions.py (1)

13-35: Minor lint/exception-style cleanups (optional)

A few small polish items you may want to address later:

  • read_partition_file catches bare Exception and re-raises WorkflowError; narrowing this to I/O-related exceptions (e.g. OSError) would avoid accidentally masking unrelated programmer errors.
  • Several long strings use # noqa: E501 even though E501 is disabled in Ruff, triggering RUF100 “unused noqa” warnings (e.g. at the GPU mutual-exclusion error and the long explanatory comments near score_job_fit). Those noqas can be dropped without changing behavior.

These are not functional issues but will keep linters quieter.

Also applies to: 97-101, 237-241

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 982dfee and be655d1.

📒 Files selected for processing (5)
  • docs/further.md (1 hunks)
  • snakemake_executor_plugin_slurm/__init__.py (6 hunks)
  • snakemake_executor_plugin_slurm/partitions.py (1 hunks)
  • tests/test_partition_selection.py (1 hunks)
  • tests/tests.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/further.md
🧰 Additional context used
🧠 Learnings (18)
📓 Common learnings
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.
📚 Learning: 2025-03-31T11:52:05.373Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T15:20:51.829Z
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2024-12-11T14:17:08.749Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 178
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2024-12-11T14:17:08.749Z
Learning: In the `snakemake-executor-plugin-slurm` project, when handling exceptions in `snakemake_executor_plugin_slurm/__init__.py`, prefer concise error messages and avoid unnecessary verbosity or exception chaining when it's not necessary.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-01-13T09:54:22.950Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-01-20T09:13:26.443Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 193
File: .github/workflows/post_to_mastodon.yml:26-26
Timestamp: 2025-01-20T09:13:26.443Z
Learning: In the snakemake-executor-plugin-slurm repository, release PRs follow the naming pattern "chore(main): release X.Y.Z" where X.Y.Z is the version number.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-06-24T12:41:31.837Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: tests/tests.py:62-70
Timestamp: 2025-06-24T12:41:31.837Z
Learning: In the snakemake-executor-plugin-slurm efficiency report feature, the executor is responsible for creating the efficiency_report_path directory when generating efficiency reports. Tests should not manually check for directory existence but should rely on the executor to properly create necessary directories as part of its normal operation.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-07-22T17:31:41.629Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 333
File: .github/workflows/announce-release.yml:0-0
Timestamp: 2025-07-22T17:31:41.629Z
Learning: In the snakemake-executor-plugin-slurm repository, cmeesters prefers using `main` references for GitHub Actions to always get the latest version, with security provided through protected branches and mandatory CI checks rather than commit SHA pinning.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-06-03T17:10:51.292Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2025-06-03T17:10:51.292Z
Learning: In the snakemake-executor-plugin-slurm repository, the create_efficiency_report method in snakemake_executor_plugin_slurm/__init__.py contains temporary testing/debugging code that logs directory contents using os.listdir() and directory paths. This code was inserted for testing purposes only and should be removed from the production implementation.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-01-17T17:27:32.446Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 187
File: .github/workflows/post_to_mastodon.yml:0-0
Timestamp: 2025-01-17T17:27:32.446Z
Learning: In the mastodon publishing workflow for snakemake-executor-plugin-slurm, the PR_TITLE environment variable is required by the post_to_mastodon.sh script and should be preserved.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-09-15T12:00:31.977Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 355
File: snakemake_executor_plugin_slurm/submit_string.py:23-30
Timestamp: 2025-09-15T12:00:31.977Z
Learning: In the snakemake-executor-plugin-slurm project, the parameters params.run_uuid, params.slurm_logfile, and params.comment_str are always provided and never None, so validation for None values is not necessary for these specific parameters in the get_submit_command function.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-05-26T12:22:07.005Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: snakemake_executor_plugin_slurm/efficiency_report.py:34-49
Timestamp: 2025-05-26T12:22:07.005Z
Learning: In the parse_reqmem function in snakemake_executor_plugin_slurm/efficiency_report.py, the conditional check for per_unit == "/node" is necessary for proper conversion of memory values in multi-node jobs, not redundant as it might initially appear.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-09-16T17:25:38.450Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 355
File: snakemake_executor_plugin_slurm/submit_string.py:0-0
Timestamp: 2025-09-16T17:25:38.450Z
Learning: In the snakemake-executor-plugin-slurm project, constraint and qos parameters follow a three-state logic: None (not set, no flag added), empty string (explicitly set to empty, adds flag with empty quotes like --constraint ''), and non-empty string (adds flag with quoted value). The correct implementation uses `constraint = job.resources.get("constraint")` followed by `if constraint is not None:` to distinguish between "not provided" and "explicitly set to empty".

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-03-10T13:34:59.270Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:34:59.270Z
Learning: In the Snakemake executor plugin for SLURM, all GPU resources (even simple numeric ones) should be translated to the `--gpus` flag (plural) to match SLURM's expectations, not `--gpu` (singular).

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-02-18T14:49:42.624Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: snakemake_executor_plugin_slurm/utils.py:75-100
Timestamp: 2025-02-18T14:49:42.624Z
Learning: In the Snakemake SLURM executor plugin, users must specify either `gres` or `gpus` for GPU resources, but not both simultaneously, as these are mutually exclusive ways to request GPU resources.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T13:34:59.270Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:34:59.270Z
Learning: In the Snakemake executor plugin for SLURM, all GPU resources (both simple numeric values and model:number specifications) should be translated to the `--gpus` flag (plural form) to match SLURM's command line interface expectations.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T13:22:54.081Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:22:54.081Z
Learning: In the Snakemake executor plugin for SLURM, when using GPU resources, the function returns `--gpu=n` (singular) when only a number is provided, but `--gpus=model:n` (plural) when a model is specified in the implementation.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-02-18T14:40:27.064Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2025-02-18T14:40:27.064Z
Learning: In the Snakemake executor plugin for SLURM, the GPU resource is specified using the "gpus" (plural) resource name, not "gpu" (singular).

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
🧬 Code graph analysis (2)
snakemake_executor_plugin_slurm/__init__.py (1)
snakemake_executor_plugin_slurm/partitions.py (2)
  • read_partition_file (13-50)
  • get_best_partition (53-78)
tests/test_partition_selection.py (1)
snakemake_executor_plugin_slurm/partitions.py (2)
  • read_partition_file (13-50)
  • get_best_partition (53-78)
🪛 GitHub Actions: CI
tests/test_partition_selection.py

[error] 646-646: flake8 E501: line too long (89 > 88 characters).

tests/tests.py

[error] 25-25: ImportError: attempted relative import with no known parent package

🪛 Ruff (0.14.5)
snakemake_executor_plugin_slurm/partitions.py

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


19-19: Avoid specifying long messages outside the exception class

(TRY003)


21-21: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


21-21: Avoid specifying long messages outside the exception class

(TRY003)


22-22: Do not catch blind exception: Exception

(BLE001)


23-25: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


23-25: Avoid specifying long messages outside the exception class

(TRY003)


27-29: Avoid specifying long messages outside the exception class

(TRY003)


32-34: Avoid specifying long messages outside the exception class

(TRY003)


38-38: Avoid specifying long messages outside the exception class

(TRY003)


98-100: Avoid specifying long messages outside the exception class

(TRY003)


99-99: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


237-237: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


238-238: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


240-240: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)

tests/tests.py

25-25: Unused noqa directive (non-enabled: F401)

Remove unused noqa directive

(RUF100)

🔇 Additional comments (1)
snakemake_executor_plugin_slurm/__init__.py (1)

118-131: Partition configuration wiring and selection integration verified as correct

The new executor wiring around partition_config and _partitions is coherent:

  • ExecutorSettings.partition_config is optional and clearly documented, with explicit mention of the SNAKEMAKE_SLURM_PARTITIONS env var and precedence (flag overrides env var).
  • __post_init__ initializes _partitions once, reading from either the env var or the CLI-specified YAML, and leaves it as None when no configuration is provided.
  • get_partition_arg:
    • Honours an explicit slurm_partition resource first.
    • Falls back to get_best_partition(self._partitions, job, self.logger) when partitions are configured.
    • Uses get_default_partition(job) only if no explicit or auto-selected partition is available, preserving previous behavior.

get_submit_command still receives a fully formatted -p ... or an empty string, in line with existing account handling. Overall this is a clean integration of the new partition selection without breaking the old fallback paths.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
tests/test_partition_selection.py (1)

267-307: Fix logger expectations in success-path partition selection tests (use mock_logger.info, not warning).

get_best_partition logs successful auto-selection via logger.info and only uses logger.warning when no suitable partition is found. All success-path tests here still assert on mock_logger.warning, so they will fail once executed.

Update success-path tests (those expecting an “Auto-selected partition …” message or any non-None selection) to assert on mock_logger.info instead of mock_logger.warning, while keeping warning checks only for failure paths that verify “No suitable partition found …”.

Example pattern (apply to all success cases):

-            # Should select 'default' partition as it supports CPU jobs
-            assert selected_partition == "default"
-            # Check that the final call contains the auto-selection message
-            assert mock_logger.warning.call_count >= 1
-            assert (
-                "Auto-selected partition 'default'"
-                in mock_logger.warning.call_args_list[-1][0][0]
-            )
+            # Should select 'default' partition as it supports CPU jobs
+            assert selected_partition == "default"
+            # Check that the final call contains the auto-selection message
+            assert mock_logger.info.call_count >= 1
+            assert (
+                "Auto-selected partition 'default'"
+                in mock_logger.info.call_args_list[-1][0][0]
+            )

Similarly, in tests that only check that logging occurred (e.g., cluster/max_threads success paths), switch:

-            assert mock_logger.warning.call_count >= 1
+            assert mock_logger.info.call_count >= 1

Leave the negative tests (selected_partition is None + “No suitable partition found …”) on mock_logger.warning as they match the implementation.

Also applies to: 337-361, 389-407, 456-475, 478-496, 580-603, 605-621, 624-641, 665-681, 749-752, 780-813, 816-821

tests/tests.py (1)

24-30: Remove the star-import wiring for test_partition_selection to avoid Ruff F403 and unnecessary indirection.

This block reintroduces import * from test_partition_selection, now wrapped in a try/except:

# Import partition selection tests
try:
    from .test_partition_selection import *
except ImportError:
    # When run directly, the relative import may fail
    from test_partition_selection import *

Pytest will already discover and run tests/test_partition_selection.py directly; re-exporting its contents here is unnecessary, triggers Ruff F403 on both imports, and risks confusing test collection.

Simplest fix: drop this block entirely and rely on normal test discovery:

-# Import partition selection tests
-try:
-    from .test_partition_selection import *
-except ImportError:
-    # When run directly, the relative import may fail
-    from test_partition_selection import *
🧹 Nitpick comments (2)
tests/test_partition_selection.py (1)

219-226: Align docstring with actual exception type in test_read_file_missing_partitions_key.

The docstring says this case “raises KeyError”, but read_partition_file raises WorkflowError for a missing 'partitions' section and the test already asserts WorkflowError. Consider updating the docstring to avoid confusion.

tests/tests.py (1)

646-669: Fix typo in test name test_taks.

The method name is test_taks, which looks like a typo for test_tasks. Renaming improves readability and makes intent clearer:

-    def test_taks(self, mock_job):
+    def test_tasks(self, mock_job):
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be655d1 and fd2336e.

📒 Files selected for processing (2)
  • tests/test_partition_selection.py (1 hunks)
  • tests/tests.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.
🧬 Code graph analysis (1)
tests/test_partition_selection.py (1)
snakemake_executor_plugin_slurm/partitions.py (2)
  • read_partition_file (13-50)
  • get_best_partition (53-78)
🪛 Ruff (0.14.5)
tests/tests.py

26-26: from .test_partition_selection import * used; unable to detect undefined names

(F403)


29-29: from test_partition_selection import * used; unable to detect undefined names

(F403)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: testing

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fd2336e and adb276c.

📒 Files selected for processing (1)
  • tests/tests.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 193
File: .github/workflows/post_to_mastodon.yml:26-26
Timestamp: 2025-01-20T09:13:26.443Z
Learning: In the snakemake-executor-plugin-slurm repository, release PRs follow the naming pattern "chore(main): release X.Y.Z" where X.Y.Z is the version number.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.
🪛 Ruff (0.14.5)
tests/tests.py

24-24: Undefined name sys

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: testing

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
snakemake_executor_plugin_slurm/__init__.py (2)

118-131: Partition config setting/help text are coherent; consider using metadata-driven env var instead of manual getenv.

The field definition and help string now match the implemented behavior (CLI flag taking precedence over SNAKEMAKE_SLURM_PARTITIONS). If ExecutorSettingsBase supports specifying a custom environment variable name via metadata["env_var"] (e.g. "SNAKEMAKE_SLURM_PARTITIONS"), you could potentially drop the manual os.getenv handling in __post_init__ and let the settings layer resolve precedence for you.

Please double-check the executor interface docs to confirm whether per-field env var names are supported before refactoring.


344-345: Clarify the comment about when partition selection happens.

Given that the sbatch string is now assembled inside get_submit_command, the note about “benefit from resource checking as the call is built up” is a bit ambiguous; consider rephrasing to say explicitly that partition selection is deferred until after all job resources are known but before composing the final sbatch command.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between adb276c and 64e99b9.

📒 Files selected for processing (2)
  • snakemake_executor_plugin_slurm/__init__.py (6 hunks)
  • tests/tests.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/tests.py
🧰 Additional context used
🧠 Learnings (14)
📓 Common learnings
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.
📚 Learning: 2025-03-31T11:52:05.373Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2024-12-11T14:17:08.749Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 178
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2024-12-11T14:17:08.749Z
Learning: In the `snakemake-executor-plugin-slurm` project, when handling exceptions in `snakemake_executor_plugin_slurm/__init__.py`, prefer concise error messages and avoid unnecessary verbosity or exception chaining when it's not necessary.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-03-10T15:20:51.829Z
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-01-20T09:13:26.443Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 193
File: .github/workflows/post_to_mastodon.yml:26-26
Timestamp: 2025-01-20T09:13:26.443Z
Learning: In the snakemake-executor-plugin-slurm repository, release PRs follow the naming pattern "chore(main): release X.Y.Z" where X.Y.Z is the version number.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-06-24T12:41:31.837Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: tests/tests.py:62-70
Timestamp: 2025-06-24T12:41:31.837Z
Learning: In the snakemake-executor-plugin-slurm efficiency report feature, the executor is responsible for creating the efficiency_report_path directory when generating efficiency reports. Tests should not manually check for directory existence but should rely on the executor to properly create necessary directories as part of its normal operation.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-01-13T09:54:22.950Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-07-22T17:31:41.629Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 333
File: .github/workflows/announce-release.yml:0-0
Timestamp: 2025-07-22T17:31:41.629Z
Learning: In the snakemake-executor-plugin-slurm repository, cmeesters prefers using `main` references for GitHub Actions to always get the latest version, with security provided through protected branches and mandatory CI checks rather than commit SHA pinning.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-06-03T17:10:51.292Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2025-06-03T17:10:51.292Z
Learning: In the snakemake-executor-plugin-slurm repository, the create_efficiency_report method in snakemake_executor_plugin_slurm/__init__.py contains temporary testing/debugging code that logs directory contents using os.listdir() and directory paths. This code was inserted for testing purposes only and should be removed from the production implementation.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-01-17T17:27:32.446Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 187
File: .github/workflows/post_to_mastodon.yml:0-0
Timestamp: 2025-01-17T17:27:32.446Z
Learning: In the mastodon publishing workflow for snakemake-executor-plugin-slurm, the PR_TITLE environment variable is required by the post_to_mastodon.sh script and should be preserved.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-09-15T12:00:31.977Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 355
File: snakemake_executor_plugin_slurm/submit_string.py:23-30
Timestamp: 2025-09-15T12:00:31.977Z
Learning: In the snakemake-executor-plugin-slurm project, the parameters params.run_uuid, params.slurm_logfile, and params.comment_str are always provided and never None, so validation for None values is not necessary for these specific parameters in the get_submit_command function.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-05-26T12:22:07.005Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: snakemake_executor_plugin_slurm/efficiency_report.py:34-49
Timestamp: 2025-05-26T12:22:07.005Z
Learning: In the parse_reqmem function in snakemake_executor_plugin_slurm/efficiency_report.py, the conditional check for per_unit == "/node" is necessary for proper conversion of memory values in multi-node jobs, not redundant as it might initially appear.

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-09-16T17:25:38.450Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 355
File: snakemake_executor_plugin_slurm/submit_string.py:0-0
Timestamp: 2025-09-16T17:25:38.450Z
Learning: In the snakemake-executor-plugin-slurm project, constraint and qos parameters follow a three-state logic: None (not set, no flag added), empty string (explicitly set to empty, adds flag with empty quotes like --constraint ''), and non-empty string (adds flag with quoted value). The correct implementation uses `constraint = job.resources.get("constraint")` followed by `if constraint is not None:` to distinguish between "not provided" and "explicitly set to empty".

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
📚 Learning: 2025-03-10T13:34:59.270Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:34:59.270Z
Learning: In the Snakemake executor plugin for SLURM, all GPU resources (even simple numeric ones) should be translated to the `--gpus` flag (plural) to match SLURM's expectations, not `--gpu` (singular).

Applied to files:

  • snakemake_executor_plugin_slurm/__init__.py
🧬 Code graph analysis (1)
snakemake_executor_plugin_slurm/__init__.py (1)
snakemake_executor_plugin_slurm/partitions.py (2)
  • read_partition_file (13-50)
  • get_best_partition (53-78)
🔇 Additional comments (4)
snakemake_executor_plugin_slurm/__init__.py (4)

6-6: Atexit import correctly fixes prior NameError risk.

Importing atexit here matches its later use in __post_init__ and prevents runtime failures when registering clean_old_logs.


39-39: Partition helpers import looks correct and keeps logic modular.

Bringing in read_partition_file and get_best_partition here cleanly separates partition parsing/scoring from the executor wiring with no obvious issues.


220-239: Partition config loading + precedence logic in __post_init__ LGTM.

The env-var vs CLI precedence matches the help text, errors are delegated to read_partition_file for consistent WorkflowError wrapping, and registering clean_old_logs with atexit gives you best-effort cleanup without breaking the explicit shutdown path.


739-745: Dynamic partition selection integrates cleanly with explicit and fallback behavior.

The logic correctly prefers an explicit slurm_partition resource, otherwise consults configured partitions via get_best_partition, and only then falls back to the default partition / no -p, which preserves existing behavior while enabling the new scoring-based selection.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
snakemake_executor_plugin_slurm/partitions.py (2)

15-25: Optional: Add exception chaining for better error traceability.

The error handling is comprehensive and catches all relevant failure modes. However, adding exception chaining (raise ... from e) would preserve the original exception context, making debugging easier when configuration issues occur.

Apply this diff to add exception chaining:

     except FileNotFoundError:
-        raise WorkflowError(f"Partition file not found: {partition_file}")
+        raise WorkflowError(f"Partition file not found: {partition_file}") from None
     except yaml.YAMLError as e:
-        raise WorkflowError(f"Error parsing partition file {partition_file}: {e}")
+        raise WorkflowError(f"Error parsing partition file {partition_file}: {e}") from e
     except Exception as e:
         raise WorkflowError(
             f"Unexpected error reading partition file {partition_file}: {e}"
-        )
+        ) from e

99-99: Cleanup: Remove unused noqa: E501 directives.

Static analysis detected several unused noqa directives for the E501 (line-too-long) rule, which is not enabled in your linting configuration.

Apply this diff to clean up:

         raise WorkflowError(
-            "GPU resource specified in both 'gpu' and 'gres'. These are mutually exclusive."  # noqa: E501
+            "GPU resource specified in both 'gpu' and 'gres'. These are mutually exclusive."
         )

 def get_job_cpu_requirement(job: JobExecutorInterface) -> tuple[int, str]:
     """
-    This uses the same logic as snakemake_executor_plugin_slurm_jobstep.get_cpu_setting, but returns a tuple instead of a arg string. # noqa: E501
+    This uses the same logic as snakemake_executor_plugin_slurm_jobstep.get_cpu_setting, but returns a tuple instead of a arg string.
     """

-        # try to score how closely a job matches a partition's limits, in order to handle case where multiple partitions can run a given job # noqa: E501
-        # naive approach here is to just sum the ratio of requested resource to limit, of course this limits us to only consider numerical resources # noqa: E501
+        # try to score how closely a job matches a partition's limits, in order to handle case where multiple partitions can run a given job
+        # naive approach here is to just sum the ratio of requested resource to limit, of course this limits us to only consider numerical resources
         # here a higher score indicates a better fit
-        # TODO decide how to handle unspecified limits, for now we assume inf for numerical limits, none for others. # noqa: E501
+        # TODO decide how to handle unspecified limits, for now we assume inf for numerical limits, none for others.

Also applies to: 139-139, 237-238, 240-240

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64e99b9 and 5b53def.

📒 Files selected for processing (1)
  • snakemake_executor_plugin_slurm/partitions.py (1 hunks)
🧰 Additional context used
🧠 Learnings (13)
📓 Common learnings
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.
📚 Learning: 2025-03-10T15:20:51.829Z
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-31T11:52:05.373Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2024-12-11T14:17:08.749Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 178
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2024-12-11T14:17:08.749Z
Learning: In the `snakemake-executor-plugin-slurm` project, when handling exceptions in `snakemake_executor_plugin_slurm/__init__.py`, prefer concise error messages and avoid unnecessary verbosity or exception chaining when it's not necessary.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T13:34:59.270Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:34:59.270Z
Learning: In the Snakemake executor plugin for SLURM, all GPU resources (even simple numeric ones) should be translated to the `--gpus` flag (plural) to match SLURM's expectations, not `--gpu` (singular).

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T13:34:59.270Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:34:59.270Z
Learning: In the Snakemake executor plugin for SLURM, all GPU resources (both simple numeric values and model:number specifications) should be translated to the `--gpus` flag (plural form) to match SLURM's command line interface expectations.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-02-18T14:49:42.624Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: snakemake_executor_plugin_slurm/utils.py:75-100
Timestamp: 2025-02-18T14:49:42.624Z
Learning: In the Snakemake SLURM executor plugin, users must specify either `gres` or `gpus` for GPU resources, but not both simultaneously, as these are mutually exclusive ways to request GPU resources.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T13:22:54.081Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:22:54.081Z
Learning: In the Snakemake executor plugin for SLURM, when using GPU resources, the function returns `--gpu=n` (singular) when only a number is provided, but `--gpus=model:n` (plural) when a model is specified in the implementation.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-02-18T14:40:27.064Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2025-02-18T14:40:27.064Z
Learning: In the Snakemake executor plugin for SLURM, the GPU resource is specified using the "gpus" (plural) resource name, not "gpu" (singular).

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-01-13T09:54:22.950Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-09-16T17:25:38.450Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 355
File: snakemake_executor_plugin_slurm/submit_string.py:0-0
Timestamp: 2025-09-16T17:25:38.450Z
Learning: In the snakemake-executor-plugin-slurm project, constraint and qos parameters follow a three-state logic: None (not set, no flag added), empty string (explicitly set to empty, adds flag with empty quotes like --constraint ''), and non-empty string (adds flag with quoted value). The correct implementation uses `constraint = job.resources.get("constraint")` followed by `if constraint is not None:` to distinguish between "not provided" and "explicitly set to empty".

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-05-26T12:22:07.005Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: snakemake_executor_plugin_slurm/efficiency_report.py:34-49
Timestamp: 2025-05-26T12:22:07.005Z
Learning: In the parse_reqmem function in snakemake_executor_plugin_slurm/efficiency_report.py, the conditional check for per_unit == "/node" is necessary for proper conversion of memory values in multi-node jobs, not redundant as it might initially appear.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-06-16T08:54:07.957Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: snakemake_executor_plugin_slurm/efficiency_report.py:0-0
Timestamp: 2025-06-16T08:54:07.957Z
Learning: In the context of SLURM executor plugin development, user cmeesters (experienced with SLURM) noted they are not aware of SLURM typically reporting memory values with decimals (like 1.5G), but accepted defensive programming to handle such cases anyway.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
🪛 Ruff (0.14.5)
snakemake_executor_plugin_slurm/partitions.py

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


19-19: Avoid specifying long messages outside the exception class

(TRY003)


21-21: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


21-21: Avoid specifying long messages outside the exception class

(TRY003)


22-22: Do not catch blind exception: Exception

(BLE001)


23-25: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


23-25: Avoid specifying long messages outside the exception class

(TRY003)


27-29: Avoid specifying long messages outside the exception class

(TRY003)


32-34: Avoid specifying long messages outside the exception class

(TRY003)


38-38: Avoid specifying long messages outside the exception class

(TRY003)


98-100: Avoid specifying long messages outside the exception class

(TRY003)


99-99: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


170-170: Avoid specifying long messages outside the exception class

(TRY003)


237-237: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


238-238: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


240-240: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)

🔇 Additional comments (2)
snakemake_executor_plugin_slurm/partitions.py (2)

137-192: LGTM: CPU requirement logic is sound.

The function correctly handles all CPU requirement scenarios:

  • Explicit cpus_per_task with proper validation (now raises WorkflowError for negative values)
  • GPU-specific cpus_per_gpu requirements
  • Fallback to effective thread count

The negative value validation at lines 169-170 properly addresses the concern from previous reviews.


231-349: LGTM: Comprehensive job-to-partition scoring logic.

The score_job_fit method implements a thorough multi-criteria evaluation:

  • Cluster compatibility validation
  • Numerical resource limit checking with ratio-based scoring
  • Thread and CPU requirement validation (both task-based and GPU-based)
  • GPU count and model matching
  • MPI support and constraint satisfaction

The scoring approach (higher score = tighter fit) correctly filters incompatible partitions while ranking suitable ones by resource utilization.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
snakemake_executor_plugin_slurm/partitions.py (1)

117-138: I need to verify this issue without importing the module. Let me create a standalone test that reproduces the exact logic:

Based on web search confirmation that Python 3 raises TypeError when ordering comparisons are performed on incomparable types, the issue identified in the review comment is valid and real.

Fix the potential TypeError at line 132.

The code has a logic flaw where if threads is None and resource_threads is a non-numeric string:

  1. Line 127: resource_threads is not None → True
  2. Line 129-130: Attempt int(resource_threads) with ValueError
  3. Line 131: Set resource_threads = threads (None)
  4. Line 132: Attempt resource_threads > 1 → TypeError raised when comparing None to int

The fallback at lines 135-137 never executes because the exception occurs at line 132.

Add type checking before the comparison:

         if resource_threads is not None:
             try:
                 resource_threads = int(resource_threads)
             except ValueError:
                 resource_threads = threads
-            threads = resource_threads if resource_threads > 1 else threads
+            if isinstance(resource_threads, int) and resource_threads > 1:
+                threads = resource_threads
🧹 Nitpick comments (1)
snakemake_executor_plugin_slurm/partitions.py (1)

234-352: LGTM with minor cleanup opportunity!

The score_job_fit method implements comprehensive partition matching logic. The scoring approach (summing resource ratios, higher = tighter fit) is sound, and the early-return pattern for exceeded limits is efficient.

Remove unused noqa directives:

-        # try to score how closely a job matches a partition's limits, in order to handle case where multiple partitions can run a given job # noqa: E501
+        # try to score how closely a job matches a partition's limits, in order to handle case where multiple partitions can run a given job
-        # naive approach here is to just sum the ratio of requested resource to limit, of course this limits us to only consider numerical resources # noqa: E501
+        # naive approach here is to just sum the ratio of requested resource to limit, of course this limits us to only consider numerical resources
         # here a higher score indicates a better fit
-        # TODO decide how to handle unspecified limits, for now we assume inf for numerical limits, none for others. # noqa: E501
+        # TODO decide how to handle unspecified limits, for now we assume inf for numerical limits, none for others.

The line length enforcement (E501) is disabled in your Ruff config, making these directives unnecessary.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5b53def and eb9933d.

📒 Files selected for processing (4)
  • .github/workflows/ci.yml (1 hunks)
  • snakemake_executor_plugin_slurm/partitions.py (1 hunks)
  • tests/test_partition_selection.py (1 hunks)
  • tests/tests.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_partition_selection.py
🧰 Additional context used
🧠 Learnings (18)
📓 Common learnings
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 193
File: .github/workflows/post_to_mastodon.yml:26-26
Timestamp: 2025-01-20T09:13:26.443Z
Learning: In the snakemake-executor-plugin-slurm repository, release PRs follow the naming pattern "chore(main): release X.Y.Z" where X.Y.Z is the version number.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.
📚 Learning: 2025-01-13T09:54:22.950Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.

Applied to files:

  • tests/tests.py
  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T15:20:51.829Z
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.

Applied to files:

  • tests/tests.py
  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-06-24T12:41:31.837Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: tests/tests.py:62-70
Timestamp: 2025-06-24T12:41:31.837Z
Learning: In the snakemake-executor-plugin-slurm efficiency report feature, the executor is responsible for creating the efficiency_report_path directory when generating efficiency reports. Tests should not manually check for directory existence but should rely on the executor to properly create necessary directories as part of its normal operation.

Applied to files:

  • tests/tests.py
📚 Learning: 2025-06-03T17:10:51.292Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2025-06-03T17:10:51.292Z
Learning: In the snakemake-executor-plugin-slurm repository, the create_efficiency_report method in snakemake_executor_plugin_slurm/__init__.py contains temporary testing/debugging code that logs directory contents using os.listdir() and directory paths. This code was inserted for testing purposes only and should be removed from the production implementation.

Applied to files:

  • tests/tests.py
📚 Learning: 2025-07-22T17:31:41.629Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 333
File: .github/workflows/announce-release.yml:0-0
Timestamp: 2025-07-22T17:31:41.629Z
Learning: In the snakemake-executor-plugin-slurm repository, cmeesters prefers using `main` references for GitHub Actions to always get the latest version, with security provided through protected branches and mandatory CI checks rather than commit SHA pinning.

Applied to files:

  • tests/tests.py
📚 Learning: 2024-12-11T14:17:08.749Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 178
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2024-12-11T14:17:08.749Z
Learning: In the `snakemake-executor-plugin-slurm` project, when handling exceptions in `snakemake_executor_plugin_slurm/__init__.py`, prefer concise error messages and avoid unnecessary verbosity or exception chaining when it's not necessary.

Applied to files:

  • tests/tests.py
  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-05-26T12:22:07.005Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: snakemake_executor_plugin_slurm/efficiency_report.py:34-49
Timestamp: 2025-05-26T12:22:07.005Z
Learning: In the parse_reqmem function in snakemake_executor_plugin_slurm/efficiency_report.py, the conditional check for per_unit == "/node" is necessary for proper conversion of memory values in multi-node jobs, not redundant as it might initially appear.

Applied to files:

  • tests/tests.py
📚 Learning: 2025-02-18T14:53:09.504Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: snakemake_executor_plugin_slurm/__init__.py:29-31
Timestamp: 2025-02-18T14:53:09.504Z
Learning: The snakemake-executor-plugin-slurm tests require system dependencies (gcc, python3-dev) to build the datrie package, which is a dependency of snakemake.

Applied to files:

  • tests/tests.py
📚 Learning: 2025-03-31T11:52:05.373Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.

Applied to files:

  • tests/tests.py
  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T13:34:59.270Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:34:59.270Z
Learning: In the Snakemake executor plugin for SLURM, all GPU resources (even simple numeric ones) should be translated to the `--gpus` flag (plural) to match SLURM's expectations, not `--gpu` (singular).

Applied to files:

  • tests/tests.py
  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T13:34:59.270Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:34:59.270Z
Learning: In the Snakemake executor plugin for SLURM, all GPU resources (both simple numeric values and model:number specifications) should be translated to the `--gpus` flag (plural form) to match SLURM's command line interface expectations.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-02-18T14:49:42.624Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: snakemake_executor_plugin_slurm/utils.py:75-100
Timestamp: 2025-02-18T14:49:42.624Z
Learning: In the Snakemake SLURM executor plugin, users must specify either `gres` or `gpus` for GPU resources, but not both simultaneously, as these are mutually exclusive ways to request GPU resources.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-03-10T13:22:54.081Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: tests/tests.py:79-86
Timestamp: 2025-03-10T13:22:54.081Z
Learning: In the Snakemake executor plugin for SLURM, when using GPU resources, the function returns `--gpu=n` (singular) when only a number is provided, but `--gpus=model:n` (plural) when a model is specified in the implementation.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-02-18T14:40:27.064Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: snakemake_executor_plugin_slurm/__init__.py:0-0
Timestamp: 2025-02-18T14:40:27.064Z
Learning: In the Snakemake executor plugin for SLURM, the GPU resource is specified using the "gpus" (plural) resource name, not "gpu" (singular).

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-06-16T08:54:07.957Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 221
File: snakemake_executor_plugin_slurm/efficiency_report.py:0-0
Timestamp: 2025-06-16T08:54:07.957Z
Learning: In the context of SLURM executor plugin development, user cmeesters (experienced with SLURM) noted they are not aware of SLURM typically reporting memory values with decimals (like 1.5G), but accepted defensive programming to handle such cases anyway.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-01-20T09:13:26.443Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 193
File: .github/workflows/post_to_mastodon.yml:26-26
Timestamp: 2025-01-20T09:13:26.443Z
Learning: In the snakemake-executor-plugin-slurm repository, release PRs follow the naming pattern "chore(main): release X.Y.Z" where X.Y.Z is the version number.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
📚 Learning: 2025-01-17T17:27:32.446Z
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 187
File: .github/workflows/post_to_mastodon.yml:0-0
Timestamp: 2025-01-17T17:27:32.446Z
Learning: In the mastodon publishing workflow for snakemake-executor-plugin-slurm, the PR_TITLE environment variable is required by the post_to_mastodon.sh script and should be preserved.

Applied to files:

  • snakemake_executor_plugin_slurm/partitions.py
🪛 Ruff (0.14.5)
snakemake_executor_plugin_slurm/partitions.py

19-19: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


19-19: Avoid specifying long messages outside the exception class

(TRY003)


21-21: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


21-21: Avoid specifying long messages outside the exception class

(TRY003)


22-22: Do not catch blind exception: Exception

(BLE001)


23-25: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)


23-25: Avoid specifying long messages outside the exception class

(TRY003)


27-29: Avoid specifying long messages outside the exception class

(TRY003)


32-34: Avoid specifying long messages outside the exception class

(TRY003)


38-38: Avoid specifying long messages outside the exception class

(TRY003)


98-100: Avoid specifying long messages outside the exception class

(TRY003)


99-99: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


173-173: Avoid specifying long messages outside the exception class

(TRY003)


240-240: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


241-241: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


243-243: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: testing
🔇 Additional comments (7)
tests/tests.py (1)

17-17: LGTM!

The import is necessary for the TestSlurmExtraValidation test class and is properly integrated.

.github/workflows/ci.yml (1)

101-101: LGTM!

Expanding the test run to cover the entire tests/ directory properly includes the new partition selection tests introduced in this PR.

snakemake_executor_plugin_slurm/partitions.py (5)

13-50: LGTM!

The function includes comprehensive error handling for file operations and YAML parsing, with clear validation of the configuration structure. The use of .pop("cluster", None) on line 41 to extract the optional cluster field before passing remaining config to PartitionLimits is clean.


53-79: LGTM!

The partition scoring and selection logic is well-structured. Logging successful auto-selection at info level and failures at warning level is appropriate.


81-114: LGTM!

The function correctly enforces mutual exclusion between gpu and gres specifications and handles both GPU specification formats. The ValueError propagation for invalid formats is acceptable as configuration data is validated once at setup.


140-195: LGTM!

The function correctly handles multiple CPU specification methods (cpus_per_task, cpus_per_gpu, and thread-based fallback). The negative value check for cpus_per_task (line 172-173) and the minimum 1 CPU guarantee (line 175) properly address configuration errors.


197-224: LGTM!

The dataclass design is sound. Using inf for unlimited numerical resources and sensible defaults (GPUs unavailable by default, MPI supported by default) aligns well with the scoring logic.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/test_partition_selection.py (2)

14-96: Partition config and temp-file fixtures are comprehensive; optional cleanup simplification

The fixtures for basic/minimal/comprehensive/multicluster/max-threads configs plus temp_yaml_file give good coverage of the YAML surface and keep tests readable. If you ever want to simplify the try/finally + unlink() pattern, you could switch temp_yaml_file to use pytest’s tmp_path fixture (e.g. write to tmp_path / "partitions.yaml"), but the current approach is perfectly valid and explicit.


242-266: Consider aligning mock_job fixture with existing pattern to future‑proof resource access

This fixture correctly wires job.resources.get(...), job.threads, and basic metadata, which is sufficient for the current score_job_fit implementation. In tests/tests.py (snippet in context), the helper also sets attributes on mock_resources (e.g. setattr(mock_resources, key, value)) and defines job.wildcards = {} to mirror how real jobs are accessed.

To reduce the risk of future code using attribute-style access on job.resources or relying on job.wildcards, you might optionally mirror that pattern here:

         def _create_job(threads=1, **resources):
             mock_resources = MagicMock()
             mock_resources.get.side_effect = lambda key, default=None: resources.get(
                 key, default
             )
+            for key, value in resources.items():
+                setattr(mock_resources, key, value)
 
             mock_job = MagicMock()
             mock_job.resources = mock_resources
             mock_job.threads = threads
             mock_job.name = "test_job"
+            mock_job.wildcards = {}
             mock_job.is_group.return_value = False
             mock_job.jobid = 1
             return mock_job

Not required for correctness today, but it keeps fixtures consistent and resilient to refactors in partition-scoring helpers.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between eb9933d and 7c662bb.

📒 Files selected for processing (1)
  • tests/test_partition_selection.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 0
File: :0-0
Timestamp: 2025-01-13T09:54:22.950Z
Learning: PR #173 (adding gres resource specification) depends on PR #28 in snakemake-executor-plugin-slurm-jobstep repository, as changes were required in the cpu-settings function of the jobstep-Executor module.
Learnt from: johanneskoester
Repo: snakemake/snakemake-executor-plugin-slurm PR: 173
File: docs/further.md:96-96
Timestamp: 2025-03-10T15:20:51.829Z
Learning: PR #173 in snakemake-executor-plugin-slurm implements GPU job support by adding resources: `gres` for generic resource specifications (e.g., 'gpu:1'), `gpu`/`gpus` for specifying GPU counts, and `gpu_model`/`gpu_manufacturer` for specifying GPU types, allowing users to request GPU resources directly rather than having to use slurm_extra.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 193
File: .github/workflows/post_to_mastodon.yml:26-26
Timestamp: 2025-01-20T09:13:26.443Z
Learning: In the snakemake-executor-plugin-slurm repository, release PRs follow the naming pattern "chore(main): release X.Y.Z" where X.Y.Z is the version number.
Learnt from: cmeesters
Repo: snakemake/snakemake-executor-plugin-slurm PR: 249
File: snakemake_executor_plugin_slurm/submit_string.py:21-29
Timestamp: 2025-03-31T11:52:05.373Z
Learning: In the snakemake-executor-plugin-slurm project, account and partition parameters are validated in the submit function before reaching the get_submit_command function. By design, these parameters are either empty strings or already properly formatted with SLURM parameter tags, making additional validation in get_submit_command unnecessary.
🧬 Code graph analysis (1)
tests/test_partition_selection.py (2)
snakemake_executor_plugin_slurm/partitions.py (2)
  • read_partition_file (13-50)
  • get_best_partition (53-78)
tests/tests.py (6)
  • mock_job (225-246)
  • mock_job (456-477)
  • mock_job (794-815)
  • _create_job (228-244)
  • _create_job (459-475)
  • _create_job (797-813)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: testing
🔇 Additional comments (3)
tests/test_partition_selection.py (3)

267-521: Core partition-selection tests correctly exercise CPU/GPU/MPI/constraint paths and logging

The main selection tests (test_basic_partition_selection_cpu_job, GPU/constraint/MPI/GPU-model mismatch/empty-partitions/gres/cpus_per_task/runtime) are well structured and match the behavior of get_best_partition:

  • CPU vs GPU routing and GPU-model filtering behave as expected given the configs.
  • MPI selection respects supports_mpi.
  • Constraint mismatch, oversized memory/runtime, and GPU-model mismatch all correctly expect None and a warning.
  • Success paths now assert on mock_logger.info for “Auto-selected partition …” while failure paths assert on mock_logger.warning for “No suitable partition found …”, matching the logging contract in get_best_partition.

No changes needed here.


523-683: Multicluster and cluster-filtering behavior is well covered

The multicluster fixtures and tests (test_cluster_specification_via_slurm_cluster, ..._via_clusters, ..._via_cluster, test_cluster_mismatch_excludes_partitions, test_job_without_cluster_uses_any_partition) accurately encode the intended semantics:

  • Partitions are correctly tagged with cluster and filtered by slurm_cluster / clusters / cluster resources.
  • GPU-only partitions in a different cluster are excluded, leading to None and a warning in the mismatch test.
  • Jobs without any cluster spec can select from all partitions.

Assertions focus on selected partition names and (where relevant) that at least one info or warning log was emitted, which is sufficient and consistent with the implementation.


684-822: max_threads and combined multicluster+max_threads scenarios are modeled correctly

The max_threads_partition_config and related tests (test_max_threads_with_threads_resource, test_max_threads_with_cpus_per_task_resource, test_max_threads_excludes_too_small_partitions, test_max_threads_job_exceeds_all_partitions) validate:

  • Preference for the “tightest fit” partition given threads/memory.
  • Correct exclusion of partitions whose max_threads is below the effective CPU requirement (including cpus_per_task cases and edge values like 127 vs 128).
  • Returning None and warning when no partition can satisfy the thread requirement.

test_multicluster_with_max_threads then combines cluster filtering with max_threads to ensure only cluster-matching, sufficiently large partitions are chosen. This matches the described scoring and filtering behavior for PartitionLimits.max_threads.

Looks good as-is.

Copy link
Member

@cmeesters cmeesters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested. Extended. Working.

@cmeesters
Copy link
Member

@cademirch thank you once more for this fantastic contribution! I will attempt a release tomorrow morning.

@cmeesters cmeesters merged commit 95821f9 into main Nov 20, 2025
6 checks passed
@cmeesters cmeesters deleted the feat-automatic-partition-selection branch November 20, 2025 16:02
cmeesters pushed a commit that referenced this pull request Nov 25, 2025
🤖 I have created a release *beep* *boop*
---


##
[2.0.0](v1.9.2...v2.0.0)
(2025-11-24)


### ⚠ BREAKING CHANGES

* proposal for dynamic partition selection
([#321](#321))

### Features

* partition time handling
([#378](#378))
([53be508](53be508))
* proposal for dynamic partition selection
([#321](#321))
([95821f9](95821f9))


### Bug Fixes

* naming SLURM logs, SLURM logs not just "log files"
([#372](#372))
([ce7cc4b](ce7cc4b))


### Documentation

* Replace incorrect Markdown with a blockquote
([#377](#377))
([b45709f](b45709f))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **BREAKING CHANGES**
  * Version 2.0.0 introduces breaking changes
* **New Features**
  * Dynamic partition selection functionality
* **Bug Fixes**
  * Multiple bug fixes included
* **Documentation**
  * Updated documentation for version 2.0.0

<sub>✏️ Tip: You can customize this high-level summary in your review
settings.</sub>

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants