|
2 | 2 |
|
3 | 3 | ## v9.0.5 |
4 | 4 |
|
| 5 | +### qtelemetry (Developer Preview) |
| 6 | + |
| 7 | +This release introduces **qtelemetry**, a new metrics exporter for Gridware Cluster Scheduler (GCS). It allows administrators to easily collect and expose cluster metrics for monitoring and observability purposes. |
| 8 | + |
| 9 | +**Features:** |
| 10 | + |
| 11 | +- Simple integration with Prometheus and Grafana |
| 12 | + |
| 13 | +- Export cluster metrics, including: |
| 14 | + |
| 15 | +- Host metrics (CPU load, GPU availability, memory usage, and many more) |
| 16 | + |
| 17 | +- Job metrics (queued, running, errored, waiting time, and many more) |
| 18 | + |
| 19 | +- qmaster statistics (CPU/memory usage of `sge_qmaster`, spooling filesystem information) |
| 20 | + |
| 21 | +- Optional per-job metric export for detailed insights (recommended only for very small workloads) |
| 22 | + |
| 23 | +- Built-in support for pre-configured Grafana dashboard: |
| 24 | + |
| 25 | +- [Grafana dashboard example](https://grafana.com/grafana/dashboards/23208-gridware-cluster-scheduler-org/). |
| 26 | + |
| 27 | +**Quick Start:** |
| 28 | + |
| 29 | +By default, `qtelemetry` exports metrics on port `9464` from the `/metrics` endpoint: |
| 30 | + |
| 31 | +```shell |
| 32 | + |
| 33 | +./qtelemetry start |
| 34 | + |
| 35 | +``` |
| 36 | + |
| 37 | +Enable additional metrics sources using command-line flags: |
| 38 | + |
| 39 | +```shell |
| 40 | + |
| 41 | +# Export exec host and qmaster metrics |
| 42 | + |
| 43 | +./qtelemetry start --enableExecd --enableMaster |
| 44 | + |
| 45 | +# Export individual job-level metrics (for smaller systems) |
| 46 | + |
| 47 | +./qtelemetry start --singleJobs |
| 48 | + |
| 49 | +``` |
| 50 | + |
| 51 | +#### Advanced Configuration (Environment Variables): |
| 52 | + |
| 53 | +- **Change default binding address or port:** |
| 54 | + |
| 55 | +Set `OTEL_EXPORTER_PROMETHEUS_ENDPOINT`. |
| 56 | + |
| 57 | +Example: |
| 58 | + |
| 59 | +```shell |
| 60 | + |
| 61 | +export OTEL_EXPORTER_PROMETHEUS_ENDPOINT=":9000" # port only |
| 62 | + |
| 63 | +export OTEL_EXPORTER_PROMETHEUS_ENDPOINT="1.2.3.4:9000" # IP and port |
| 64 | + |
| 65 | +``` |
| 66 | + |
| 67 | +- **Enable Basic Authentication:** |
| 68 | + |
| 69 | +Set `QTELEMETRY_USERNAME` and `QTELEMETRY_PASSWORD`. |
| 70 | + |
| 71 | +```shell |
| 72 | + |
| 73 | +export QTELEMETRY_USERNAME="your-user" |
| 74 | + |
| 75 | +export QTELEMETRY_PASSWORD="your-password" |
| 76 | + |
| 77 | +``` |
| 78 | + |
| 79 | +- **Enable TLS:** |
| 80 | + |
| 81 | +Set paths to TLS certificate and key with `QTELEMETRY_TLS_CERT` and `QTELEMETRY_TLS_KEY`. |
| 82 | + |
| 83 | +```shell |
| 84 | + |
| 85 | +export QTELEMETRY_TLS_CERT="/path/to/cert.pem" |
| 86 | + |
| 87 | +export QTELEMETRY_TLS_KEY="/path/to/key.pem" |
| 88 | + |
| 89 | +``` |
| 90 | + |
| 91 | +#### Recommended Monitoring Stack Setup: |
| 92 | + |
| 93 | +1. Start `qtelemetry` with the endpoint exposed (as above). |
| 94 | + |
| 95 | +2. Configure Prometheus to scrape metrics from `qtelemetry`. |
| 96 | + |
| 97 | +3. Set up Grafana and connect it to your Prometheus database. |
| 98 | + |
| 99 | +4. Import the provided Grafana dashboard from [here](https://grafana.com/grafana/dashboards/23208-gridware-cluster-scheduler-org/) for immediate insights. |
| 100 | + |
| 101 | +**Supported Platforms:** |
| 102 | + |
| 103 | +- Linux (`lx-amd64` and `lx-arm64`) |
| 104 | + |
| 105 | +**Important Note:** |
| 106 | + |
| 107 | +This release of qtelemetry is currently a Developer Preview. Metric structure, naming, and availability are subject to change in future releases based on development progress and user feedback. We strongly encourage feedback and suggestions to shape this tool’s evolution. |
| 108 | + |
| 109 | + |
| 110 | + |
5 | 111 | ### Out of the Box Support of various MPI Distributions |
6 | 112 |
|
7 | 113 | The `$SGE_ROOT/mpi` directory contains templates of the PE configuration for the following MPI distributions: |
|
0 commit comments