|
1 | | -# ClusterManagers.jl |
| 1 | +# HTCondorClusterManager.jl |
2 | 2 |
|
3 | | -The `ClusterManagers.jl` package implements code for different job queue systems commonly used on compute clusters. |
| 3 | +The `HTCondorClusterManager.jl` package implements code for HTCondor clusters. |
4 | 4 |
|
5 | | -> [!WARNING] |
6 | | -> This package is not currently being actively maintained or tested. |
7 | | -> |
8 | | -> We are in the process of splitting this package up into multiple smaller packages, with a separate package for each job queue systems. |
9 | | -> |
10 | | -> We are seeking maintainers for these new packages. If you are an active user of any of the job queue systems listed below and are interested in being a maintainer, please open a GitHub issue - say that you are interested in being a maintainer, and specify which job queue system you use. |
11 | | -
|
12 | | -## Available job queue systems |
13 | | - |
14 | | -### In this package |
15 | | - |
16 | | -The following managers are implemented in this package (the `ClusterManagers.jl` package): |
17 | | - |
18 | | -| Job queue system | Command to add processors | |
19 | | -| ---------------- | ------------------------- | |
20 | | -| Local manager with CPU affinity setting | `addprocs(LocalAffinityManager(;np=CPU_CORES, mode::AffinityMode=BALANCED, affinities=[]); kwargs...)` | |
21 | | - |
22 | | -### Implemented in external packages |
23 | | - |
24 | | -| Job queue system | External package | Command to add processors | |
25 | | -| ---------------- | ---------------- | ------------------------- | |
26 | | -| Slurm | [SlurmClusterManager.jl](https://github.com/JuliaParallel/SlurmClusterManager.jl) | `addprocs(SlurmManager(); kwargs...)` | |
27 | | -| Load Sharing Facility (LSF) | [LSFClusterManager.jl](https://github.com/JuliaParallel/LSFClusterManager.jl) | `addprocs_lsf(np::Integer; bsub_flags=``, ssh_cmd=``)` or `addprocs(LSFManager(np, bsub_flags, ssh_cmd, retry_delays, throttle))` | |
28 | | -| Kubernetes (K8s) | [K8sClusterManagers.jl](https://github.com/beacon-biosignals/K8sClusterManagers.jl) | `addprocs(K8sClusterManager(np; kwargs...))` | |
29 | | -| Azure scale-sets | [AzManagers.jl](https://github.com/ChevronETC/AzManagers.jl) | `addprocs(vmtemplate, n; kwargs...)` | |
30 | | - |
31 | | -### Not currently being actively maintained |
32 | | - |
33 | | -> [!WARNING] |
34 | | -> The following managers are not currently being actively maintained or tested. |
35 | | -> |
36 | | -> We are seeking maintainers for the following managers. If you are an active user of any of the following job queue systems listed and are interested in being a maintainer, please open a GitHub issue - say that you are interested in being a maintainer, and specify which job queue system you use. |
37 | | -> |
| 5 | +Implemented in this package: |
38 | 6 |
|
39 | 7 | | Job queue system | Command to add processors | |
40 | 8 | | ---------------- | ------------------------- | |
41 | | -| Sun Grid Engine (SGE) via `qsub` | `addprocs_sge(np::Integer; qsub_flags=``)` or `addprocs(SGEManager(np, qsub_flags))` | |
42 | | -| Sun Grid Engine (SGE) via `qrsh` | `addprocs_qrsh(np::Integer; qsub_flags=``)` or `addprocs(QRSHManager(np, qsub_flags))` | |
43 | | -| PBS (Portable Batch System) | `addprocs_pbs(np::Integer; qsub_flags=``)` or `addprocs(PBSManager(np, qsub_flags))` | |
44 | | -| Scyld | `addprocs_scyld(np::Integer)` or `addprocs(ScyldManager(np))` | |
45 | 9 | | HTCondor | `addprocs_htc(np::Integer)` or `addprocs(HTCManager(np))` | |
46 | 10 |
|
47 | | -### Custom managers |
48 | | - |
49 | | -You can also write your own custom cluster manager; see the instructions in the [Julia manual](https://docs.julialang.org/en/v1/manual/distributed-computing/#ClusterManagers). |
50 | | - |
51 | | -## Notes on specific managers |
52 | | - |
53 | | -### Slurm: please see [SlurmClusterManager.jl](https://github.com/JuliaParallel/SlurmClusterManager.jl) |
54 | | - |
55 | | -For Slurm, please see the [SlurmClusterManager.jl](https://github.com/JuliaParallel/SlurmClusterManager.jl) package. |
56 | | - |
57 | | -### Using `LocalAffinityManager` (for pinning local workers to specific cores) |
58 | | - |
59 | | -- Linux only feature. |
60 | | -- Requires the Linux `taskset` command to be installed. |
61 | | -- Usage : `addprocs(LocalAffinityManager(;np=CPU_CORES, mode::AffinityMode=BALANCED, affinities=[]); kwargs...)`. |
62 | | - |
63 | | -where |
64 | | - |
65 | | -- `np` is the number of workers to be started. |
66 | | -- `affinities`, if specified, is a list of CPU IDs. As many workers as entries in `affinities` are launched. Each worker is pinned |
67 | | -to the specified CPU ID. |
68 | | -- `mode` (used only when `affinities` is not specified, can be either `COMPACT` or `BALANCED`) - `COMPACT` results in the requested number |
69 | | -of workers pinned to cores in increasing order, For example, worker1 => CPU0, worker2 => CPU1 and so on. `BALANCED` tries to spread |
70 | | -the workers. Useful when we have multiple CPU sockets, with each socket having multiple cores. A `BALANCED` mode results in workers |
71 | | -spread across CPU sockets. Default is `BALANCED`. |
72 | | - |
73 | | -### Using `ElasticManager` (dynamically adding workers to a cluster) |
74 | | - |
75 | | -The `ElasticManager` is useful in scenarios where we want to dynamically add workers to a cluster. |
76 | | -It achieves this by listening on a known port on the master. The launched workers connect to this |
77 | | -port and publish their own host/port information for other workers to connect to. |
78 | | - |
79 | | -On the master, you need to instantiate an instance of `ElasticManager`. The constructors defined are: |
80 | | - |
81 | | -```julia |
82 | | -ElasticManager(;addr=IPv4("127.0.0.1"), port=9009, cookie=nothing, topology=:all_to_all, printing_kwargs=()) |
83 | | -ElasticManager(port) = ElasticManager(;port=port) |
84 | | -ElasticManager(addr, port) = ElasticManager(;addr=addr, port=port) |
85 | | -ElasticManager(addr, port, cookie) = ElasticManager(;addr=addr, port=port, cookie=cookie) |
86 | | -``` |
87 | | - |
88 | | -You can set `addr=:auto` to automatically use the host's private IP address on the local network, which will allow other workers on this network to connect. You can also use `port=0` to let the OS choose a random free port for you (some systems may not support this). Once created, printing the `ElasticManager` object prints the command which you can run on workers to connect them to the master, e.g.: |
89 | | - |
90 | | -```julia |
91 | | -julia> em = ElasticManager(addr=:auto, port=0) |
92 | | -ElasticManager: |
93 | | - Active workers : [] |
94 | | - Number of workers to be added : 0 |
95 | | - Terminated workers : [] |
96 | | - Worker connect command : |
97 | | - /home/user/bin/julia --project=/home/user/myproject/Project.toml -e 'using ClusterManagers; ClusterManagers.elastic_worker("4cOSyaYpgSl6BC0C","127.0.1.1",36275)' |
98 | | -``` |
99 | | - |
100 | | -By default, the printed command uses the absolute path to the current Julia executable and activates the same project as the current session. You can change either of these defaults by passing `printing_kwargs=(absolute_exename=false, same_project=false))` to the first form of the `ElasticManager` constructor. |
101 | | - |
102 | | -Once workers are connected, you can print the `em` object again to see them added to the list of active workers. |
103 | | - |
104 | | -### Sun Grid Engine (SGE) |
105 | | - |
106 | | -See [`docs/sge.md`](docs/sge.md) |
| 11 | +The functionality in this package originally used to live in [ClusterManagers.jl](https://github.com/JuliaParallel/ClusterManagers.jl). |
0 commit comments