diff --git a/.github/workflows/linkspector.yml b/.github/workflows/linkspector.yml new file mode 100644 index 000000000..75d766990 --- /dev/null +++ b/.github/workflows/linkspector.yml @@ -0,0 +1,15 @@ +name: Linkspector +on: [pull_request] +jobs: + check-links: + name: runner / linkspector + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - name: Run linkspector + uses: umbrelladocs/action-linkspector@v1 + with: + github_token: ${{ secrets.github_token }} + reporter: github-pr-check + fail_level: any + filter_mode: added diff --git a/Makefile b/Makefile index 8a3a5deee..483758d5f 100644 --- a/Makefile +++ b/Makefile @@ -313,7 +313,7 @@ docs/coconut: docs/grpc: @echo -e "generating gRPC API documentation \033[1;33m==>\033[0m \033[1;34m./docs\033[0m" @cd apricot/protos && PATH="$(ROOT_DIR)/tools:$$PATH" protoc --doc_out="$(ROOT_DIR)/docs" --doc_opt=markdown,apidocs_apricot.md "apricot.proto" - @cd core/protos && PATH="$(ROOT_DIR)/tools:$$PATH" protoc -I=. -I=../../common --doc_out="$(ROOT_DIR)/docs" --doc_opt=markdown,apidocs_aliecs.md "o2control.proto" + @cd core/protos && PATH="$(ROOT_DIR)/tools:$$PATH" protoc -I=. -I=../../common --doc_out="$(ROOT_DIR)/docs" --experimental_allow_proto3_optional --doc_opt=markdown,apidocs_aliecs.md o2control.proto ../../common/protos/events.proto ../../common/protos/common.proto @cd occ/protos && PATH="$(ROOT_DIR)/tools:$$PATH" protoc --doc_out="$(ROOT_DIR)/docs" --doc_opt=markdown,apidocs_occ.md "occ.proto" docs/swaggo: diff --git a/README.md b/README.md index c72abca5b..79d3ca6f3 100644 --- a/README.md +++ b/README.md @@ -2,54 +2,193 @@ [![godoc](https://img.shields.io/badge/godoc-Reference-5272B4.svg)](https://godoc.org/github.com/AliceO2Group/Control) # AliECS -The ALICE Experiment Control System +The ALICE Experiment Control System (**AliECS**) is the piece of software to drive and control data taking activities in the experiment. +It is a distributed system that combines state of the art cluster resource management and experiment control functionalities into a single comprehensive solution. -## Install instructions +Please refer to the [CHEP 2023 paper](https://doi.org/10.1051/epjconf/202429502027) for the latest design overview. -What is your use case? +## How to get started -* I want to **run AliECS** and other O²/FLP software +Regardless of your particular interests, it is recommended to get acquainted with the main [AliECS concepts](docs/handbook/concepts.md). - :arrow_right: [O²/FLP Suite deployment instructions](https://alice-flp.docs.cern.ch/system-configuration/utils/o2-flp-setup/) +After that, please find your concrete use case: - These instructions apply to both single-node and multi-node deployments. +### I want to **run AliECS** and other O²/FLP software - Contact [alice-o2-flp-support](mailto:alice-o2-flp-support@cern.ch) for assistance with provisioning and deployment. - -* I want to ensure AliECS can **run and control my process** +See [O²/FLP Suite deployment instructions](https://alice-flp.docs.cern.ch/system-configuration/utils/o2-flp-setup/) - * My software is based on FairMQ and/or O² DPL - - :palm_tree: Nothing to do, AliECS natively supports FairMQ (and DPL) devices. - - * My software does not use FairMQ and/or DPL, but should be controlled through a state machine - - :telescope: See [the OCC documentation](occ/README.md) to learn how to integrate the O² Control and Configuration library with your software. [Readout](https://github.com/AliceO2Group/Readout) is currently the only example of this setup. - - * My software is a command line utility with no state machine - - :palm_tree: Nothing to do, AliECS natively supports generic commands. Make sure the task template for your command sets the control mode to `basic` ([see example](https://github.com/AliceO2Group/ControlWorkflows/blob/basic-tasks/tasks/sleep.yaml)). - -* I want to build and run AliECS for **development** purposes +These instructions apply to both single-node and multi-node deployments. +Contact [alice-o2-flp-support](mailto:alice-o2-flp-support@cern.ch) for assistance with provisioning and deployment. - :hammer_and_wrench: [Building instructions](https://alice-flp.docs.cern.ch/aliecs/building/) - - :arrow_right: [Running instructions](https://alice-flp.docs.cern.ch/aliecs/running/) +There are two ways of interacting with AliECS: -* I want to communicate with AliECS via one of the plugins - - * [Receive updates on running environments via Kafka](docs/kafka.md) +- The AliECS GUI (a.k.a. Control GUI, COG) - not in this repository, but included in most deployments, recommended -## Using AliECS + :arrow_right: [AliECS GUI documentation](hacking/COG.md) -There are two ways of interacting with AliECS: - -* The AliECS GUI - not in this repository, but included in most deployments, recommended +- `coconut` - the command-line control and configuration utility, included with AliECS core, typically for developers and advanced users + + :arrow_right: [Using `coconut`](https://alice-flp.docs.cern.ch/aliecs/coconut/) - :arrow_right: [AliECS GUI documentation](hacking/COG.md) + :arrow_right: [`coconut` command reference](https://alice-flp.docs.cern.ch/aliecs/coconut/doc/coconut/) -* `coconut` - the command-line control and configuration utility, included with AliECS core +### I want to ensure AliECS can **run and control my process** - :arrow_right: [Using `coconut`](https://alice-flp.docs.cern.ch/aliecs/coconut/) +* **My software is based on FairMQ and/or O² DPL (Data Processing Later)** + + AliECS natively supports FairMQ (and DPL) devices. + Head to [ControlWorkflows](https://github.com/AliceO2Group/ControlWorkflows) for instructions on how to configure your software to be controlled by AliECS. + +* **My software does not use FairMQ and/or DPL, but should be controlled through a state machine** + + See [the OCC documentation](occ/README.md) to learn how to integrate the O² Control and Configuration library with your software. [Readout](https://github.com/AliceO2Group/Readout) is an example of this setup. + + Once ready, head to [ControlWorkflows](https://github.com/AliceO2Group/ControlWorkflows) for instructions on how to configure it to be controlled by AliECS. + +* **My software is a command line utility with no state machine** + + AliECS natively supports generic commands. + Head to [ControlWorkflows](https://github.com/AliceO2Group/ControlWorkflows) for instructions to have your command ran by AliECS. + Make sure the task template for your command sets the control mode to `basic` ([see example](https://github.com/AliceO2Group/ControlWorkflows/blob/master/tasks/o2-roc-cleanup.yaml)). - :arrow_right: [`coconut` command reference](https://alice-flp.docs.cern.ch/aliecs/coconut/doc/coconut/) +### I want to develop AliECS + +:hammer_and_wrench: Welcome to the team, please head to [contributing instructions](/docs/CONTRIBUTING.md) + +### I want to receive updates about environments or services controlled by AliECS + +:pager: Learn more about the [kafka event service](/docs/kafka.md) + +### I want my application to send requests to AliECS + +:scroll: See the API docs of AliECS components: + +- [core gRPC server](/docs/apidocs_aliecs.md) +- [apricot gRPC server](/docs/apidocs_apricot.md) +- [apricot HTTP server](/apricot/docs/apricot_http_service.md) + +### I want my service to be sent requests by AliECS + +:electric_plug: Learn more about the [plugin system](/core/integration/README.md) + +## Table of Contents + +* Introduction + * [Basic Concepts](/docs/handbook/concepts.md#basic-concepts) + * [Tasks](/docs/handbook/concepts.md#tasks) + * [Workflows, roles and environments](/docs/handbook/concepts.md#workflows-roles-and-environments) + * [Design Overview](/docs/handbook/overview.md#design-overview) + * [AliECS Structure](/docs/handbook/overview.md#aliecs-structure) + * [Resource Management](/docs/handbook/overview.md#resource-management) + * [FairMQ](/docs/handbook/overview.md#fairmq) + * [State machines](/docs/handbook/overview.md#state-machines) + +* Component reference + * AliECS GUI + * [AliECS GUI overview](/hacking/COG.md) + * AliECS core + * [Workflow Configuration](/docs/handbook/configuration.md#workflow-configuration) + * [The AliECS workflow template language](/docs/handbook/configuration.md#the-aliecs-workflow-template-language) + * [Workflow template structure](/docs/handbook/configuration.md#workflow-template-structure) + * [Task roles](/docs/handbook/configuration.md#task-roles) + * [Call roles](/docs/handbook/configuration.md#call-roles) + * [Aggregator roles](/docs/handbook/configuration.md#aggregator-roles) + * [Iterator roles](/docs/handbook/configuration.md#iterator-roles) + * [Include roles](/docs/handbook/configuration.md#include-roles) + * [Template expressions](/docs/handbook/configuration.md#template-expressions) + * [Task Configuration](/docs/handbook/configuration.md#task-configuration) + * [Task template structure](/docs/handbook/configuration.md#task-template-structure) + * [Variables pushed to controlled tasks](/docs/handbook/configuration.md#variables-pushed-to-controlled-tasks) + * [Resource wants and limits](/docs/handbook/configuration.md#resource-wants-and-limits) + * [Integration plugins](/core/integration/README.md#integration-plugins) + * [Plugin system overview](/core/integration/README.md#plugin-system-overview) + * [Integrated service operations](/core/integration/README.md#integrated-service-operations) + * [Bookkeeping](/core/integration/README.md#bookkeeping) + * [CCDB](/core/integration/README.md#ccdb) + * [DCS](/core/integration/README.md#dcs) + * [DCS operations](/core/integration/README.md#dcs-operations) + * [DCS PrepareForRun behaviour](/core/integration/README.md#dcs-prepareforrun-behaviour) + * [DCS StartOfRun behaviour](/core/integration/README.md#dcs-startofrun-behaviour) + * [DCS EndOfRun behaviour](/core/integration/README.md#dcs-endofrun-behaviour) + * [DD Scheduler](/core/integration/README.md#dd-scheduler) + * [Kafka (legacy)](/core/integration/README.md#kafka-legacy) + * [ODC](/core/integration/README.md#odc) + * [Test plugin](/core/integration/README.md#test-plugin) + * [Trigger](/core/integration/README.md#trigger) + * [Environment operation order](/docs/handbook/operation_order.md#environment-operation-order) + * [State machine triggers](/docs/handbook/operation_order.md#state-machine-triggers) + * [START_ACTIVITY (Start Of Run)](/docs/handbook/operation_order.md#start_activity-start-of-run) + * [STOP_ACTIVITY (End Of Run)](/docs/handbook/operation_order.md#stop_activity-end-of-run) + * [Protocol documentation](/docs/apidocs_aliecs.md) + * coconut + * [The O² control and configuration utility overview](/coconut/README.md#the-o-control-and-configuration-utility-overview) + * [Configuration file](/coconut/README.md#configuration-file) + * [Using coconut](/coconut/README.md#using-coconut) + * [Creating an environment](/coconut/README.md#creating-an-environment) + * [Controlling an environment](/coconut/README.md#controlling-an-environment) + * [Command reference](/coconut/doc/coconut.md) + * apricot + * [ALICE configuration service overview](/apricot/README.md#alice-configuration-service-overview) + * [HTTP service](/apricot/docs/apricot_http_service.md#apricot-http-service) + * [Configuration](/apricot/docs/apricot_http_service.md#configuration) + * [Usage and options](/apricot/docs/apricot_http_service.md#usage-and-options) + * [Examples](/apricot/docs/apricot_http_service.md#examples) + * [Protocol documentation](/docs/apidocs_apricot.md) + * [Command reference](/apricot/docs/apricot.md) + * occ + * [O² Control and Configuration Components](/occ/README.md#o-control-and-configuration-components) + * [Developer quick start instructions for OCClib](/occ/README.md#developer-quick-start-instructions-for-occlib) + * [Manual build instructions](/occ/README.md#manual-build-instructions) + * [Run example](/occ/README.md#run-example) + * [The OCC state machine](/occ/README.md#the-occ-state-machine) + * [Single process control with peanut](/occ/README.md#single-process-control-with-peanut) + * [OCC API debugging with grpcc](/occ/README.md#occ-api-debugging-with-grpcc) + * [Dummy process example for OCC library](/occ/occlib/examples/dummy-process/README.md#dummy-process-example-for-occ-library) + * [Protocol documentation](/docs/apidocs_occ.md) + * peanut + * [Process control and execution utility overview](/occ/peanut/README.md) + * Event service + * [Kafka producer functionality in AliECS core](/docs/kafka.md#kafka-producer-functionality-in-aliecs-core) + * [Making sure that AliECS sends messages](/docs/kafka.md#making-sure-that-aliecs-sends-messages) + * [Currently available topics](/docs/kafka.md#currently-available-topics) + * [Decoding the messages](/docs/kafka.md#decoding-the-messages) + * [Legacy events: Kafka plugin](/docs/kafka.md#legacy-events-kafka-plugin) + * [Making sure that AliECS sends messages](/docs/kafka.md#making-sure-that-aliecs-sends-messages-1) + * [Currently available topics](/docs/kafka.md#currently-available-topics-1) + * [Decoding the messages](/docs/kafka.md#decoding-the-messages-1) + * [Getting Start of Run and End of Run notifications](/docs/kafka.md#getting-start-of-run-and-end-of-run-notifications) + * [Using Kafka debug tools](/docs/kafka.md#using-kafka-debug-tools) + +* Developer documentation + * [Contributing](/docs/CONTRIBUTING.md) + * [Package pkg.go.dev documentation](https://pkg.go.dev/github.com/AliceO2Group/Control) + * [Building AliECS](/docs/building.md#building-aliecs) + * [Overview](/docs/building.md#overview) + * [Building with aliBuild](/docs/building.md#building-with-alibuild) + * [Manual build](/docs/building.md#manual-build) + * [Go environment](/docs/building.md#go-environment) + * [Clone and build (Go components only)](/docs/building.md#clone-and-build-go-components-only) + * [Makefile reference](/docs/makefile_reference.md) + * [Component Configuration](/docs/handbook/appconfiguration.md#component-configuration) + * [Apache Mesos](/docs/handbook/appconfiguration.md#apache-mesos) + * [Connectivity to controlled nodes](/docs/handbook/appconfiguration.md#connectivity-to-controlled-nodes) + * [Running AliECS as a developer](/docs/running.md#running-aliecs-as-a-developer) + * [Running the AliECS core](/docs/running.md#running-the-aliecs-core) + * [Running AliECS in production](/docs/running.md#running-aliecs-in-production) + * [Health checks](/docs/running.md#health-checks) + * [Development Information](/docs/development.md#development-information) + * [Release Procedure](/docs/development.md#release-procedure) + * [Metrics in ECS](/docs/metrics.md#metrics-in-ecs) + * [Overview and simple usage](/docs/metrics.md#overview-and-simple-usage) + * [Types and aggregation of metrics](/docs/metrics.md#types-and-aggregation-of-metrics) + * [Metric types](/docs/metrics.md#metric-types) + * [Aggregation](/docs/metrics.md#aggregation) + * [Implementation details](/docs/metrics.md#implementation-details) + * [Event loop](/docs/metrics.md#event-loop) + * [Hashing to aggregate](/docs/metrics.md#hashing-to-aggregate) + * [Sampling reservoir](/docs/metrics.md#sampling-reservoir) + * [OCC API debugging with grpcc](/docs/using_grpcc_occ.md#occ-api-debugging-with-grpcc) + +* Resources + * T. Mrnjavac et. al, [AliECS: A New Experiment Control System for the ALICE Experiment](https://doi.org/10.1051/epjconf/202429502027), CHEP23 + diff --git a/apricot/README.md b/apricot/README.md index 243b580d1..f3a63b6e5 100644 --- a/apricot/README.md +++ b/apricot/README.md @@ -1,14 +1,10 @@ -# `APRICOT` +# ALICE configuration service overview -**A** **p**rocessor and **r**epos**i**tory for **co**nfiguration **t**emplates +**A** **p**rocessor and **r**epos**i**tory for **co**nfiguration **t**emplates, or apricot, implements the configuration service for the ALICE data taking activities. +It adds templating, load balancing and caching on top of the configuration store. -The `o2-apricot` binary implements a centralized configuration (micro)service for ALICE O². +See also: -``` -Usage of bin/o2-apricot: - --backendUri string URI of the Consul server or YAML configuration file (default "consul://127.0.0.1:8500") - --listenPort int Port of apricot server (default 32101) - --verbose Verbose logging -``` - -Protofile: [apricot.proto](apricot/protos/apricot.proto) +* [apricot HTTP service](docs/apricot_http_service.md) - make essential cluster information available via a web server +* Protofile: [apricot.proto](protos/apricot.proto) +* [Command reference](docs/apricot.md) diff --git a/apricot/docs/apricot.md b/apricot/docs/apricot.md index e0448f6d8..43fd6027b 100644 --- a/apricot/docs/apricot.md +++ b/apricot/docs/apricot.md @@ -13,8 +13,4 @@ Usage of bin/o2-apricot: --backendUri string URI of the Consul server or YAML configuration file (default "consul://127.0.0.1:8500") --listenPort int Port of apricot server (default 32101) --verbose Verbose logging -``` - -### SEE ALSO - -* [apricot HTTP service](apricot_http_service.md) - make essential cluster information available via a web server +``` \ No newline at end of file diff --git a/apricot/docs/apricot_http_service.md b/apricot/docs/apricot_http_service.md index 15dd91a97..fd4609736 100644 --- a/apricot/docs/apricot_http_service.md +++ b/apricot/docs/apricot_http_service.md @@ -46,4 +46,4 @@ Besides configuration retrieval, the API also includes calls for browsing the co Getting a template-processed configuration payload for a component (entry `tpc-full-qcmn` for component `qc`, with `list_of_detectors` and `run_type` passed as template variables): * In a browser: `http://localhost:32188/components/qc/ANY/any/tpc-full-qcmn?process=true&list_of_detectors=tpc,its&run_type=PHYSICS` -* With `curl`: `curl http://127.0.0.1:32188/components/qc/ANY/any/tpc-full-qcmn\?process\=true\&list_of_detectors\=tpc,its\&run_type\=PHYSICS` \ No newline at end of file +* With `curl`: `curl http://127.0.0.1:32188/components/qc/ANY/any/tpc-full-qcmn\?process\=true\&list_of_detectors\=tpc,its\&run_type\=PHYSICS` diff --git a/coconut/README.md b/coconut/README.md index f20c2bfa6..086de07f0 100644 --- a/coconut/README.md +++ b/coconut/README.md @@ -1,4 +1,4 @@ -# `coconut` - the O² control and configuration utility +# The O² control and configuration utility overview The O² **co**ntrol and **con**figuration **ut**ility is a command line program for interacting with the AliECS core. @@ -98,6 +98,7 @@ A valid workflow template (sometimes called simply "workflow" for brevity) must Workflows and tasks are managed with a git based configuration system, so the workflow template may be provided simply by name or with repository and branch/tag/hash constraints. Examples: + * `coconut env create -w myworkflow` - loads workflow `myworkflow` from default configuration repository at HEAD of master branch * `coconut env create -w github.com/AliceO2Group/MyConfRepo/myworkflow` - loads a workflow from a specific git repository, HEAD of master branch * `coconut env create -w myworkflow@rev` - loads a workflow from default repository, on branch, tag or revision `rev` diff --git a/coconut/doc/coconut_environment_create.md b/coconut/doc/coconut_environment_create.md index ce39b7349..62ec30471 100644 --- a/coconut/doc/coconut_environment_create.md +++ b/coconut/doc/coconut_environment_create.md @@ -13,6 +13,7 @@ A valid workflow template (sometimes called simply "workflow" for brevity) must Workflows and tasks are managed with a git based configuration system, so the workflow template may be provided simply by name or with repository and branch/tag/hash constraints. Examples: + * `coconut env create -w myworkflow` - loads workflow `myworkflow` from default configuration repository at HEAD of master branch * `coconut env create -w github.com/AliceO2Group/MyConfRepo/myworkflow` - loads a workflow from a specific git repository, HEAD of master branch * `coconut env create -w myworkflow@rev` - loads a workflow from default repository, on branch, tag or revision `rev` diff --git a/coconut/doc/coconut_repository.md b/coconut/doc/coconut_repository.md index 32156d56d..7ba96cc14 100644 --- a/coconut/doc/coconut_repository.md +++ b/coconut/doc/coconut_repository.md @@ -9,6 +9,7 @@ The repository command performs operations on the repositories used for task and A valid workflow configuration repository must contain the directories `tasks` and `workflows` in its `master` branch. When referencing a repository, the clone method should never be prepended. Supported repo backends and their expected format are: + - https: [hostname]/[repo_path] - ssh: [hostname]:[repo_path] - local [repo_path] (local repo entries are ephemeral and will not survive a core restart) diff --git a/coconut/doc/coconut_repository_add.md b/coconut/doc/coconut_repository_add.md index 90db22e45..7b24c4f80 100644 --- a/coconut/doc/coconut_repository_add.md +++ b/coconut/doc/coconut_repository_add.md @@ -16,6 +16,7 @@ the ensuing list is followed until a valid revision has been identified: Exhaustion of the aforementioned list results in a repo add failure. `coconut repo add` can be called with + 1) a repository identifier 2) a repository identifier coupled with the `--default-revision` flag (see examples below) diff --git a/coconut/doc/coconut_role_query.md b/coconut/doc/coconut_role_query.md index 6d5626791..ee3759272 100644 --- a/coconut/doc/coconut_role_query.md +++ b/coconut/doc/coconut_role_query.md @@ -17,6 +17,7 @@ walk through the role tree of the given environment, starting from the root role per https://github.com/gobwas/glob syntax. Examples: + * `coconut role query 2rE9AV3m1HL readout-dataflow` - queries the role `readout-dataflow` in environment `2rE9AV3m1HL`, prints the full tree, along with the variables defined in the root role * `coconut role query 2rE9AV3m1HL readout-dataflow.host-aido2-bld4-lab102` - queries the role `readout-dataflow.host-aido2-bld4-lab102`, prints the subtree of that role, along with the variables defined in it * `coconut role query 2rE9AV3m1HL readout-dataflow.host-aido2-bld4-lab102.data-distribution.stfs` - queries the role at the given path, it is a task role so there is no subtree, prints the variables defined in that role diff --git a/coconut/doc/coconut_template_list.md b/coconut/doc/coconut_template_list.md index 47853a119..7d2fdda29 100644 --- a/coconut/doc/coconut_template_list.md +++ b/coconut/doc/coconut_template_list.md @@ -7,7 +7,8 @@ list available workflow templates The template list command shows a list of available workflow templates. These workflow templates can then be loaded to create an environment. -`coconut templ list` can be called with +`coconut templ list` can be called with + 1) a combination of the `--repo` , `--revision` , `--all-branches` , `--all-tags` , `--all-workflows` flags, or with 2) an argument in the form of [repo-pattern]@[revision-pattern], where the patterns are globbing. diff --git a/core/integration/README.md b/core/integration/README.md new file mode 100644 index 000000000..d86622233 --- /dev/null +++ b/core/integration/README.md @@ -0,0 +1,159 @@ +# Integration plugins + +The integration plugins allow AliECS to communicate with other ALICE services. +A plugin can register a set of callback which can be invoked upon defined environment events (state transitions). + +## Plugin system overview + +All plugins should implement the [`Plugin`](https://github.com/AliceO2Group/Control/blob/master/core/integration/plugin.go) interface. +See the existing plugins for examples. + +In order to have the plugin loaded by the AliECS, one has to: + +- add `RegisterPlugin` to the `init()` function in [AliECS core main source](https://github.com/AliceO2Group/Control/blob/master/cmd/o2-aliecs-core/main.go) +- add plugin name in the `integrationPlugins` list and set the endpoint in the AliECS configuration file (typically at `/o2/components/aliecs/ANY/any/settings` in the configuration store) + +# Integrated service operations + +In this chapter we list and describe the integrated service plugins. + +## Bookkeeping + +The legacy Bookkeeping plugin sends updates to Bookkeeping about the state of data taking runs. +As of May 2025, Bookkeeping has transitioned into consuming input from the Kafka event service and the only call in use is "FillInfo", which allows ECS to retrieve LHC fill information. + +## CCDB + +CCDB plugin calls PDP-provided executable which creates a General Run Parameters (GRP) object at each run start and stop. + +## DCS + +DCS plugin communicates with the ALICE Detector Control System (DCS). + +### DCS operations + +The DCS integration plugin exposes to the workflow template (WFT) context the +following operations. Their associated transitions in this table refer +to the [readout-dataflow](https://github.com/AliceO2Group/ControlWorkflows/blob/master/workflows/readout-dataflow.yaml) workflow template. + +| **DCS operation** | **WFT call** | **Call timing** | **Critical** | **Contingent on detector state** | +|-----------------------|---------------------|---------------------------|--------------|----------------------------------| +| Prepare For Run (PFR) | `dcs.PrepareForRun` | during `CONFIGURE` | `false` | yes | +| Start Of Run (SOR) | `dcs.StartOfRun` | early in `START_ACTIVITY` | `true` | yes | +| End Of Run (EOR) | `dcs.EndOfRun` | late in `STOP_ACTIVITY` | `true` | no | + +The DCS integration plugin subscribes to the [DCS service](https://github.com/AliceO2Group/Control/blob/master/core/integration/dcs/protos/dcs.proto) and continually +receives information on operation-state compatibility for all +detectors. +When a given environment reaches a DCS call, the relevant DCS operation +will be called only if the DCS service reports that all detectors in that +environment are compatible with this operation, except EOR, which is +always called. + +### DCS PrepareForRun behaviour + +Unlike SOR and EOR, which are mandatory if `dcs_enabled` is set to `true`, +an impossibility to run PFR or a PFR failure will not prevent the +environment from transitioning forward. + +#### DCS PFR incompatibility + +When `dcs.PrepareForRun` is called, if at least one detector is in a +state that is incompatible with PFR as reported by the DCS service, +a grace period of 10 seconds is given for the detector(s) to become +compatible with PFR, with 1Hz polling frequency. As soon as all +detectors become compatible with PFR, the PFR operation is requested +to the DCS service. + +If the grace period ends and at least one detector +included in the environment is still incompatible with PFR, the PFR +operation will be performed for the PFR-compatible detectors. + +Despite some detectors not having performed PFR, the environment +can still transition forward towards the `RUNNING` state, and any DCS +activities that would have taken place in PFR will instead happen +during SOR. Only at that point, if at least one detector is not +compatible with SOR (or if it is but SOR fails), will the environment +declare a failure. + +#### DCS PFR failure + +When `dcs.PrepareForRun` is called, if all detectors are compatible +with PFR as reported by the DCS service (or become compatible during +the grace period), the PFR operation is immediately requested to the +DCS service. + +`dcs.PrepareForRun` call fails if no detectors are PFR-compatible +or PFR fails for all those which were PFR-compatible, +but since it is non-critical the environment may still reach the +`CONFIGURED` state and transition forward towards `RUNNING`. + +As in the case of an impossibility to run PFR, any DCS activities that +would have taken place in PFR will instead be done during SOR. + +### DCS StartOfRun behaviour + +The SOR operation is mandatory if `dcs_enabled` is set to `true` +(AliECS GUI "DCS" switched on). + +#### DCS SOR incompatibility + +When `dcs.StartOfRun` is called, if at least one detector is in a +state that is incompatible with SOR as reported by the DCS service, +or if after a grace period of 10 seconds at least one detector is +still incompatible with SOR, the SOR operation **will not run for any +detector**. + +The environment will then declare a **failure**, the +`START_ACTIVITY` transition will be blocked and the environment +will move to `ERROR`. + +#### DCS SOR failure + +When `dcs.StartOfRun` is called, if all detectors are compatible +with SOR as reported by the DCS service (or become compatible during +the grace period), the SOR operation is immediately requested to the +DCS service. + +If this operation fails for one or more detectors, the +`dcs.StartOfRun` call as a whole is considered to have failed. + +The environment will then declare a **failure**, the +`START_ACTIVITY` transition will be blocked and the environment +will move to `ERROR` + +### DCS EndOfRun behaviour + +The EOR operation is mandatory if `dcs_enabled` is set to `true` +(AliECS GUI "DCS" switched on). However, unlike with PFR and SOR, there +is **no check for compatibility** with the EOR operation. The EOR +request will always be sent to the DCS service during `STOP_ACTIVITY`. + +#### DCS EOR failure + +If this operation fails for one or more detectors, the +`dcs.EndOfRun` call as a whole is considered to have failed. + +The environment will then declare a **failure**, the +`STOP_ACTIVITY` transition will be blocked and the environment +will move to `ERROR`. + +## DD Scheduler + +DD scheduler plugin informs the Data Distribution software about the pool of FLPs taking part in data taking. + +## Kafka (legacy) + +See [Legacy events: Kafka plugin](/docs/kafka.md#legacy-events-kafka-plugin) + +## ODC + +ODC plugin communicates with the [Online Device Control (ODC)](https://github.com/FairRootGroup/ODC) instance of the ALICE experiment, which controls the event processing farm used in data taking and offline processing. + +## Test plugin + +Test plugin serves as an example of a plugin and is used for testing the plugin system. + +## Trigger + +Trigger plugin communicates with the ALICE trigger system. diff --git a/core/protos/o2control.proto b/core/protos/o2control.proto index 17698b13c..df3069b90 100644 --- a/core/protos/o2control.proto +++ b/core/protos/o2control.proto @@ -30,11 +30,16 @@ option go_package = "github.com/AliceO2Group/Control/core/protos;pb"; import public "protos/events.proto"; +// The Control service is the main interface to AliECS service Control { rpc GetFrameworkInfo (GetFrameworkInfoRequest) returns (GetFrameworkInfoReply) {} rpc GetEnvironments (GetEnvironmentsRequest) returns (GetEnvironmentsReply) {} + // Creates a new environment which automatically follows one STANDBY->RUNNING->DONE cycle in the state machine. + // It returns only once the environment reaches the CONFIGURED state or upon any earlier failure. rpc NewAutoEnvironment (NewAutoEnvironmentRequest) returns (NewAutoEnvironmentReply) {} + // Creates a new environment. + // It returns only once the environment reaches the CONFIGURED state or upon any earlier failure. rpc NewEnvironment (NewEnvironmentRequest) returns (NewEnvironmentReply) {} rpc GetEnvironment (GetEnvironmentRequest) returns (GetEnvironmentReply) {} rpc ControlEnvironment (ControlEnvironmentRequest) returns (ControlEnvironmentReply) {} @@ -42,6 +47,9 @@ service Control { rpc GetActiveDetectors (Empty) returns (GetActiveDetectorsReply) {} rpc GetAvailableDetectors (Empty) returns (GetAvailableDetectorsReply) {} + // Creates a new environment. + // It returns once an environment ID is created and continues the creation asynchronously to the call. + // The environment will be listed in GetEnvironments() only once the workflow is loaded and deployment starts. rpc NewEnvironmentAsync (NewEnvironmentRequest) returns (NewEnvironmentReply) {} // rpc SetEnvironmentProperties (SetEnvironmentPropertiesRequest) returns (SetEnvironmentPropertiesReply) {} diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md new file mode 100644 index 000000000..54611fd45 --- /dev/null +++ b/docs/CONTRIBUTING.md @@ -0,0 +1,63 @@ +# Contributing + +Thank you for your interest in contributing to the project. +This document provides guidelines and information to help you contribute effectively. + +If you are not in contact with the project maintainers, please reach out to them before proposing any changes. +We use JIRA for issue tracking and project management. +This software component is part of the O²/FLP project in the ALICE experiment. + +## Getting started + +Getting acquainted with the introduction chapters is absolutely essential, glossing over the whole documentation is highly advised. + +A development environment setup will be necessary for compiling binaries and running unit tests, see [Building](/docs/building.md) for details. + +## Testing + +Run unit tests in the Control project with `make test`. +To obtain test coverage reports, run `make coverage`. + +Typically, you will also want to prepare a test setup in form of an [FLP suite deployment](https://alice-flp.docs.cern.ch/system-configuration/utils/o2-flp-setup/) on a virtual machine. +Since AliECS interacts with many other project components, the last testing step might involve replacing the modified binary on the test VM and trying out the new functionality or the fix. + +The binaries are installed at `/opt/o2/bin`. + +`o2-aliecs-core` and `o2-apricot` are ran as systemd services, so you will need to restart them after replacing the binary. + +`o2-aliecs-executor` is started by `mesos-slave` if it is not running already at environment creation. +To make sure that the replaced binary is used, kill the running process (`pkill -f o2-aliecs-executor`). + +## Pull requests guidelines + +- Make sure your work has a corresponding JIRA ticket and it is assigned to yourself. +Trivial PRs are acceptable without a ticket. + +- Work on your changes in your fork on a dedicated branch with a descriptive name. + +- Make focused, logically atomic commits with clear messages and descriptions explaining the design choices. +Multiple commits per pull request are allowed. +However, please make sure that the project can be built and the tests pass at any commit. + +- Commit message or description should include the JIRA ticket number + +- Add tests for your changes whenever possible. +Gomega/Ginkgo tests are preferred, but other style of tests are also welcome. + +- Add documentation for new features. + +- Your contribution will be reviewed by the project maintainers once the PR is marked as ready for review. + +## Documentation guidelines + +The markdown documentation is aimed to be browsed on GitHub, but it also on the aggregated [FLP documentation](https://alice-flp.docs.cern.ch) based on [MkDocs](https://www.mkdocs.org/). +Consequently, any changes in the documentation structure should be reflected in the Table of Contents in the main README.md, as well as `mkdocs.yml` and `mkdocs.yml`. + +The AliECS MkDocs documentation is split into two aforementioned files to follow the split between "Products" and "Developers" tabs in the FLP documentation. +The `mkdocs-dev.yml` uses a symlink `aliecs-dev` to `aliecs` directory to avoid complaints about duplicated site names. + +Because of the dual target of the documentation, the points below are important to keep in mind: + +- Absolute paths in links to other files do not always work, they should be avoided. +- When referencing source files in the repository, use full URIs to GitHub. +- In MkDocs layouts, one cannot reference specific sections within markdown files. Only links to entire markdown files are possible. \ No newline at end of file diff --git a/docs/apidocs_aliecs.md b/docs/apidocs_aliecs.md index 6783e0b13..7004c9050 100644 --- a/docs/apidocs_aliecs.md +++ b/docs/apidocs_aliecs.md @@ -88,6 +88,26 @@ - [Control](#o2control-Control) +- [protos/events.proto](#protos_events-proto) + - [Ev_CallEvent](#events-Ev_CallEvent) + - [Ev_EnvironmentEvent](#events-Ev_EnvironmentEvent) + - [Ev_EnvironmentEvent.VarsEntry](#events-Ev_EnvironmentEvent-VarsEntry) + - [Ev_IntegratedServiceEvent](#events-Ev_IntegratedServiceEvent) + - [Ev_MetaEvent_CoreStart](#events-Ev_MetaEvent_CoreStart) + - [Ev_MetaEvent_FrameworkEvent](#events-Ev_MetaEvent_FrameworkEvent) + - [Ev_MetaEvent_MesosHeartbeat](#events-Ev_MetaEvent_MesosHeartbeat) + - [Ev_RoleEvent](#events-Ev_RoleEvent) + - [Ev_RunEvent](#events-Ev_RunEvent) + - [Ev_TaskEvent](#events-Ev_TaskEvent) + - [Event](#events-Event) + - [Traits](#events-Traits) + + - [OpStatus](#events-OpStatus) + +- [protos/common.proto](#protos_common-proto) + - [User](#common-User) + - [WorkflowTemplateInfo](#common-WorkflowTemplateInfo) + - [Scalar Value Types](#scalar-value-types) @@ -1453,20 +1473,20 @@ Not implemented yet ### Control - +The Control service is the main interface to AliECS | Method Name | Request Type | Response Type | Description | | ----------- | ------------ | ------------- | ------------| | GetFrameworkInfo | [GetFrameworkInfoRequest](#o2control-GetFrameworkInfoRequest) | [GetFrameworkInfoReply](#o2control-GetFrameworkInfoReply) | | | GetEnvironments | [GetEnvironmentsRequest](#o2control-GetEnvironmentsRequest) | [GetEnvironmentsReply](#o2control-GetEnvironmentsReply) | | -| NewAutoEnvironment | [NewAutoEnvironmentRequest](#o2control-NewAutoEnvironmentRequest) | [NewAutoEnvironmentReply](#o2control-NewAutoEnvironmentReply) | | -| NewEnvironment | [NewEnvironmentRequest](#o2control-NewEnvironmentRequest) | [NewEnvironmentReply](#o2control-NewEnvironmentReply) | | +| NewAutoEnvironment | [NewAutoEnvironmentRequest](#o2control-NewAutoEnvironmentRequest) | [NewAutoEnvironmentReply](#o2control-NewAutoEnvironmentReply) | Creates a new environment which automatically follows one STANDBY->RUNNING->DONE cycle in the state machine. It returns only once the environment reaches the CONFIGURED state or upon any earlier failure. | +| NewEnvironment | [NewEnvironmentRequest](#o2control-NewEnvironmentRequest) | [NewEnvironmentReply](#o2control-NewEnvironmentReply) | Creates a new environment. It returns only once the environment reaches the CONFIGURED state or upon any earlier failure. | | GetEnvironment | [GetEnvironmentRequest](#o2control-GetEnvironmentRequest) | [GetEnvironmentReply](#o2control-GetEnvironmentReply) | | | ControlEnvironment | [ControlEnvironmentRequest](#o2control-ControlEnvironmentRequest) | [ControlEnvironmentReply](#o2control-ControlEnvironmentReply) | | | DestroyEnvironment | [DestroyEnvironmentRequest](#o2control-DestroyEnvironmentRequest) | [DestroyEnvironmentReply](#o2control-DestroyEnvironmentReply) | | | GetActiveDetectors | [Empty](#o2control-Empty) | [GetActiveDetectorsReply](#o2control-GetActiveDetectorsReply) | | | GetAvailableDetectors | [Empty](#o2control-Empty) | [GetAvailableDetectorsReply](#o2control-GetAvailableDetectorsReply) | | -| NewEnvironmentAsync | [NewEnvironmentRequest](#o2control-NewEnvironmentRequest) | [NewEnvironmentReply](#o2control-NewEnvironmentReply) | | +| NewEnvironmentAsync | [NewEnvironmentRequest](#o2control-NewEnvironmentRequest) | [NewEnvironmentReply](#o2control-NewEnvironmentReply) | Creates a new environment. It returns once an environment ID is created and continues the creation asynchronously to the call. The environment will be listed in GetEnvironments() only once the workflow is loaded and deployment starts. | | GetTasks | [GetTasksRequest](#o2control-GetTasksRequest) | [GetTasksReply](#o2control-GetTasksReply) | | | GetTask | [GetTaskRequest](#o2control-GetTaskRequest) | [GetTaskReply](#o2control-GetTaskReply) | | | CleanupTasks | [CleanupTasksRequest](#o2control-CleanupTasksRequest) | [CleanupTasksReply](#o2control-CleanupTasksReply) | | @@ -1488,6 +1508,321 @@ Not implemented yet + +

Top

+ +## protos/events.proto + + + + + +### Ev_CallEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| func | [string](#string) | | name of the function being called, within the workflow template context | +| callStatus | [OpStatus](#events-OpStatus) | | progress or success/failure state of the call | +| return | [string](#string) | | return value of the function | +| traits | [Traits](#events-Traits) | | | +| output | [string](#string) | | any additional output of the function | +| error | [string](#string) | | error value, if returned | +| environmentId | [string](#string) | | | +| path | [string](#string) | | path to the parent callRole of this call within the environment | + + + + + + + + +### Ev_EnvironmentEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| environmentId | [string](#string) | | | +| state | [string](#string) | | | +| runNumber | [uint32](#uint32) | | only when the environment is in the running state | +| error | [string](#string) | | | +| message | [string](#string) | | any additional message concerning the current state or transition | +| transition | [string](#string) | | | +| transitionStep | [string](#string) | | | +| transitionStatus | [OpStatus](#events-OpStatus) | | | +| vars | [Ev_EnvironmentEvent.VarsEntry](#events-Ev_EnvironmentEvent-VarsEntry) | repeated | consolidated environment variables at the root role of the environment | +| lastRequestUser | [common.User](#common-User) | | | +| workflowTemplateInfo | [common.WorkflowTemplateInfo](#common-WorkflowTemplateInfo) | | | + + + + + + + + +### Ev_EnvironmentEvent.VarsEntry + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| key | [string](#string) | | | +| value | [string](#string) | | | + + + + + + + + +### Ev_IntegratedServiceEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| name | [string](#string) | | name of the context, usually the path of the callRole that calls a given integrated service function e.g. readout-dataflow.dd-scheduler.terminate | +| error | [string](#string) | | error message, if any | +| operationName | [string](#string) | | name of the operation, usually the name of the integrated service function being called e.g. ddsched.PartitionTerminate()" | +| operationStatus | [OpStatus](#events-OpStatus) | | progress or success/failure state of the operation | +| operationStep | [string](#string) | | if the operation has substeps, this is the name of the current substep, like an API call or polling phase | +| operationStepStatus | [OpStatus](#events-OpStatus) | | progress or success/failure state of the current substep | +| environmentId | [string](#string) | | | +| payload | [string](#string) | | any additional payload, depending on the integrated service; there is no schema, it can even be the raw return structure of a remote API call | + + + + + + + + +### Ev_MetaEvent_CoreStart + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| frameworkId | [string](#string) | | | + + + + + + + + +### Ev_MetaEvent_FrameworkEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| frameworkId | [string](#string) | | | +| message | [string](#string) | | | + + + + + + + + +### Ev_MetaEvent_MesosHeartbeat + + + + + + + + + +### Ev_RoleEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| name | [string](#string) | | role name | +| status | [string](#string) | | posible values: ACTIVE/INACTIVE/PARTIAL/UNDEFINED/UNDEPLOYABLE as defined in status.go. Derived from the state of child tasks, calls or other roles | +| state | [string](#string) | | state machine state for this role | +| rolePath | [string](#string) | | path to this role within the environment | +| environmentId | [string](#string) | | | + + + + + + + + +### Ev_RunEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| environmentId | [string](#string) | | | +| runNumber | [uint32](#uint32) | | | +| state | [string](#string) | | | +| error | [string](#string) | | | +| transition | [string](#string) | | | +| transitionStatus | [OpStatus](#events-OpStatus) | | | +| lastRequestUser | [common.User](#common-User) | | | + + + + + + + + +### Ev_TaskEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| name | [string](#string) | | task name, based on the name of the task class | +| taskid | [string](#string) | | task id, unique | +| state | [string](#string) | | state machine state for this task | +| status | [string](#string) | | posible values: ACTIVE/INACTIVE/PARTIAL/UNDEFINED/UNDEPLOYABLE as defined in status.go. | +| hostname | [string](#string) | | | +| className | [string](#string) | | name of the task class from which this task was spawned | +| traits | [Traits](#events-Traits) | | | +| environmentId | [string](#string) | | | +| path | [string](#string) | | path to the parent taskRole of this task within the environment | + + + + + + + + +### Event + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| timestamp | [int64](#int64) | | | +| timestampNano | [int64](#int64) | | | +| environmentEvent | [Ev_EnvironmentEvent](#events-Ev_EnvironmentEvent) | | | +| taskEvent | [Ev_TaskEvent](#events-Ev_TaskEvent) | | | +| roleEvent | [Ev_RoleEvent](#events-Ev_RoleEvent) | | | +| callEvent | [Ev_CallEvent](#events-Ev_CallEvent) | | | +| integratedServiceEvent | [Ev_IntegratedServiceEvent](#events-Ev_IntegratedServiceEvent) | | | +| runEvent | [Ev_RunEvent](#events-Ev_RunEvent) | | | +| frameworkEvent | [Ev_MetaEvent_FrameworkEvent](#events-Ev_MetaEvent_FrameworkEvent) | | | +| mesosHeartbeatEvent | [Ev_MetaEvent_MesosHeartbeat](#events-Ev_MetaEvent_MesosHeartbeat) | | | +| coreStartEvent | [Ev_MetaEvent_CoreStart](#events-Ev_MetaEvent_CoreStart) | | | + + + + + + + + +### Traits + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| trigger | [string](#string) | | | +| await | [string](#string) | | | +| timeout | [string](#string) | | | +| critical | [bool](#bool) | | | + + + + + + + + + + +### OpStatus + + +| Name | Number | Description | +| ---- | ------ | ----------- | +| NULL | 0 | | +| STARTED | 1 | | +| ONGOING | 2 | | +| DONE_OK | 3 | | +| DONE_ERROR | 4 | | +| DONE_TIMEOUT | 5 | | + + + + + + + + + + + +

Top

+ +## protos/common.proto + + + + + +### User + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| externalId | [int32](#int32) | optional | The unique CERN identifier of this user. | +| id | [int32](#int32) | optional | The unique identifier of this entity. | +| name | [string](#string) | | Name of the user. | + + + + + + + + +### WorkflowTemplateInfo + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| name | [string](#string) | | | +| description | [string](#string) | | | +| path | [string](#string) | | | +| public | [bool](#bool) | | whether the environment is public or not | + + + + + + + + + + + + + + + ## Scalar Value Types | .proto Type | Notes | C++ | Java | Python | Go | C# | PHP | Ruby | diff --git a/docs/building.md b/docs/building.md index ffd5fbe1b..0a4243ec9 100644 --- a/docs/building.md +++ b/docs/building.md @@ -1,6 +1,6 @@ # Building AliECS -> **WARNING**: The building instructions described in this page are **for development purposes only**. Users interested in deploying, running and controlling O²/FLP software or their own software with AliECS should refer to the [O²/FLP Suite instructions](../../installation/) instead. +> **WARNING**: The building instructions described in this page are **for development purposes only**. Users interested in deploying, running and controlling O²/FLP software or their own software with AliECS should refer to the [O²/FLP Suite instructions](https://alice-flp.docs.cern.ch/Operations/Experts/system-configuration/utils/o2-flp-setup/) instead. ## Overview @@ -84,6 +84,6 @@ You should find several executables including `o2control-core`, `o2control-execu For subsequent builds (after the first one), plain `make` (instead of `make all`) is sufficient. See the [Makefile reference](makefile_reference.md) for more information. -If you wish to also build the process control library and/or plugin, see [the OCC readme](./occ/README.md). +If you wish to also build the process control library and/or plugin, see [the OCC readme](../occ/README.md). This build of AliECS can be run locally and connected to an existing O²/FLP Suite cluster by passing a `--mesosUrl` parameter. If you do this, remember to `systemctl stop o2-aliecs-core` on the head node, in order to stop the core that came with the O²/FLP Suite and use your own. diff --git a/docs/development.md b/docs/development.md index bfccde00a..d0373f09f 100644 --- a/docs/development.md +++ b/docs/development.md @@ -4,7 +4,7 @@ Generated API documentation is available on [pkg.go.dev](https:///pkg.go.dev/git The release log is managed via [GitHub](https://github.com/AliceO2Group/Control/releases/). -Bugs go to [JIRA](https://alice.its.cern.ch/jira/browse/OCTRL). +Bugs go to [JIRA](https://its.cern.ch/jira/projects/OCTRL/issues). ## Release Procedure @@ -13,7 +13,6 @@ Bugs go to [JIRA](https://alice.its.cern.ch/jira/browse/OCTRL). 3. Run `hacking/release_notes.sh HEAD` to get a formatted commit message list since the last tag, copy it. 4. Paste the above into a [new GitHub release draft](https://github.com/AliceO2Group/Control/releases/new). Sort, categorize, add summary on top. 5. Pick a version number. Numbers `x.x.80`-`x.x.89` are reserved for Alpha pre-releases. Numbers `x.x.90`-`x.x.99` are reserved for Beta and RC pre-releases. If doing a pre-release, don't forget to tick `This is a pre-release`. When ready, hit `Publish release`. -6. Go to your local clone of [`alice-flp/documentation`](https://gitlab.cern.ch/alice-flp/documentation), descend into `docs/aliecs`. `git pull --rebase` to ensure the submodule points to the tag created just now. Commit and push (or merge request). -7. Go to your local clone of `alidist`, ensure that the branch is `master` and that it's up to date. Then branch out into `aliecs-bump` (`git branch aliecs-bump`). -8. Bump the version in `control.sh`, `control-core.sh`, `control-occplugin.sh` and `coconut.sh`. Commit and push to `origin/aliecs-bump` (`git push -u origin aliecs-bump`). -9. Submit pull request with the above to `alisw/alidist`. +6. Go to your local clone of `alidist`, ensure that the branch is `master` and that it's up to date. Then branch out into `aliecs-bump` (`git branch aliecs-bump`). +7. Bump the version in `control.sh`, `control-core.sh`, `control-occplugin.sh` and `coconut.sh`. Commit and push to `origin/aliecs-bump` (`git push -u origin aliecs-bump`). +8. Submit pull request with the above to `alisw/alidist`. diff --git a/docs/faq.md b/docs/faq.md deleted file mode 100644 index d39935874..000000000 --- a/docs/faq.md +++ /dev/null @@ -1,2 +0,0 @@ -# Frequently Asked Questions - diff --git a/docs/handbook/AliECS-environment.png b/docs/handbook/AliECS-environment.png new file mode 100644 index 000000000..e94444806 Binary files /dev/null and b/docs/handbook/AliECS-environment.png differ diff --git a/docs/handbook/appconfiguration.md b/docs/handbook/appconfiguration.md index 5b5892771..ed96e462b 100644 --- a/docs/handbook/appconfiguration.md +++ b/docs/handbook/appconfiguration.md @@ -1,6 +1,10 @@ # Component Configuration -## Connectivity to controlled nodes +## Apache Mesos + +Apache Mesos is installed as a part of the FLP Suite. + +### Connectivity to controlled nodes ECS relies on Mesos to know the state of the controlled nodes. Thus, losing connection to a Mesos slave can be treated as a node being down or unresponsive. @@ -9,4 +13,4 @@ Then, the environment is transitioned to ERROR. Mesos slave health check can be configured with `MESOS_MAX_AGENT_PING_TIMEOUTS` (`--max_agent_ping_timeouts`) and `MESOS_AGENT_PING_TIMEOUT` (`--agent_ping_timeout`) parameters for Mesos. Effectively, the factor of the two parameters is the time needed to consider a slave/agent as lost. -Please refer to Mesos documentation for more details. \ No newline at end of file +Please refer to [Mesos documentation](https://mesos.apache.org/documentation/latest/) for more details. \ No newline at end of file diff --git a/docs/handbook/concepts.md b/docs/handbook/concepts.md index c5eace31e..45b34fcfd 100644 --- a/docs/handbook/concepts.md +++ b/docs/handbook/concepts.md @@ -2,10 +2,37 @@ From a logical point of view of data processing deployment and control, AliECS deals with concepts such as **environments**, **roles** and **tasks**, the understanding of which is paramount for using AliECS effectively. +
+ AliECS Environment +
+ +## Tasks + The basic unit of scheduling in AliECS is a **task**. A task generally corresponds to a process. Sometimes this is a process that can receive and respond to OCC-compatible control messages (also called a **stateful task**), and other times this is simply a shell script or command line tool invocation (also called a **stateless task** or **basic task**). +## Workflows, roles and environments + All AliECS **workflows** are collections of tasks, which together form a coherent data processing chain. -Tasks are the leaves in a tree of roles. A **role** is a runtime subdivision of the complete system, it represents a kind of operation along with its resources (but less than a complete data processing chain). Each task implements one or more roles. Roles allow binding tasks or groups of tasks to specific host attributes, detectors and configuration values. Each role represents either a single task, or a group of child roles. If tasks are leaves, roles are all the other nodes in the control tree of an environment. +Tasks are the leaves in a tree of roles. +A **role** is a runtime subdivision of the complete system, it represents a kind of operation along with its resources (but less than a complete data processing chain). +Each task implements one or more roles. +Roles allow binding tasks or groups of tasks to specific host attributes, detectors and configuration values. +Each role represents either a single task, or a group of child roles. While tasks are leaves, roles are all the other nodes in the control tree of an environment. + +These novel, more flexible and more easily deployable abstractions represent the evolution of Run 2 abstractions such as ECS partitions. +In memory, a tree of O² roles, along with their tasks and their configuration is a **workflow**. +A workflow aggregates the collective state of its constituent O2 roles. +A running workflow, along with associated detectors and other hardware and software resources required for experiment operation constitutes an **environment**. + +## Activities and runs + +**Activity** and (data-taking) **run** are used interchangeably in AliECS. +Run is a term present in ALICE from the beginning, while activity was introduced in the early days of the O2 project when it was not clear how the idea of a run would evolve. + +A run is a period of data taking, which is defined by a start and end time, typically lasting several hours at most. +It is identified by a run number, which is a monotonically increasing integer. +It is also associated with a set of configuration parameters, which are used to configure the data processing chain. -These novel, more flexible and more easily deployable abstractions represent the evolution of Run 2 abstractions such as ECS partitions. In memory, a tree of O² roles, along with their tasks and their configuration is a **workflow**. A workflow aggregates the collective state of its constituent O2 roles. A running workflow, along with associated detectors and other hardware and software resources required for experiment operation constitutes an **environment**. +Run has also a second meaning at CERN, which is understood as a period of LHC operations, lasting a few years and separated by Long Shutdowns. +These operational runs are not to be mistaken with data-taking runs. diff --git a/docs/handbook/configuration.md b/docs/handbook/configuration.md index cf0024fe7..9f9376529 100644 --- a/docs/handbook/configuration.md +++ b/docs/handbook/configuration.md @@ -12,6 +12,7 @@ while still being powerful enough to express complex workflows. To instantiate a data taking activity, or environment, two kinds of files are needed: + * workflow templates * task templates @@ -167,21 +168,6 @@ roles: In the absence of an explicit `critical` trait for a given task role, the assumed default value is `critical: true`. -#### State machine callbacks moments - -The underlying state machine library allows us to add callbacks upon entering and leaving states as well as before and after events (transitions). -This is the order of callback execution upon a state transition: -1. `before_` - called before event named `` -2. `before_event` - called before all events -3. `leave_` - called before leaving `` -4. `leave_state` - called before leaving all states -5. `enter_`, `` - called after entering `` -6. `enter_state` - called after entering all states -7. `after_`, `` - called after event named `` -8. `after_event` - called after all events - -Callback execution is further refined with integer indexes, with the syntax `±index`, e.g. `before_CONFIGURE+2`, `enter_CONFIGURED-666`. An expression with no index is assumed to be indexed `+0`. These indexes do not correspond to timestamps, they are discrete labels that allow more granularity in callbacks, ensuring a strict ordering of callback opportunities within a given callback moment. Thus, `before_CONFIGURE+2` will complete execution strictly after `before_CONFIGURE` runs, but strictly before `enter_CONFIGURED-666` is executed. - ### Call roles Call roles represent calls to integrated services. They must contain a `call` @@ -211,8 +197,9 @@ for examples of call roles that reference a variety of integration plugins. #### Workflow hook call structure The state machine callback moments are exposed to the AliECS workflow template interface and can be used as triggers or synchronization points for integration plugin function calls. The `call` block can be used for this purpose, with similar syntax to the `task` block used for controllable tasks. Its fields are as follows. + * `func` - mandatory, it parses as an [`antonmedv/expr`](https://github.com/antonmedv/expr) expression that corresponds to a call to a function that belongs to an integration plugin object (e.g. `bookkeeping.StartOfRun()`, `dcs.EndOfRun()`, etc.). -* `trigger` - mandatory, the expression at `func` will be executed once the state machine reaches this moment. +* `trigger` - mandatory, the expression at `func` will be executed once the state machine reaches this moment. For possible values, see [State machine triggers](operation_order.md#state-machine-triggers) * `await` - optional, if absent it defaults to the same as `trigger`, the expression at `func` needs to finish by this moment, and the state machine will block until `func` completes. * `timeout` - optional, Go `time.Duration` expression, defaults to `30s`, the maximum time that `func` should take. The value is provided to the plugin via `varStack["__call_timeout"]` and the plugin should implement a timeout mechanism. The ECS will not abort the call upon reaching the timeout value! * `critical` - optional, it defaults to `true`, if `true` then a failure or timeout for `func` will send the environment state machine to `ERROR`. @@ -425,10 +412,12 @@ Variables whose availability to tasks is handled in some way by AliECS include * variables delivered to tasks explicitly via task templates. The latter can be + * sourced from Apricot with a query from the task template iself (e.g. `config.Get`), or * sourced from the variables available to the current AliECS environment, as defined in the workflow template (e.g. readout-dataflow.yaml) Depending on the specification in the task template (`command.env`, `command.arguments` or `properties`), the push to the given task can happen + * as system environment variables on task startup, * as command line parameters on task startup, or * as (FairMQ) key-values during `CONFIGURE`. @@ -456,6 +445,7 @@ In addition to the above, which varies depending on the configuration of the env * `pdp_override_run_start_time` The following values are pushed by AliECS during `STOP_ACTIVITY`: + * `run_end_time_ms` FairMQ task implementors should expect that these values are written to the FairMQ properties map right before the `RUN` and `STOP` transitions via `SetProperty` calls. diff --git a/docs/handbook/index.md b/docs/handbook/introduction.md similarity index 100% rename from docs/handbook/index.md rename to docs/handbook/introduction.md diff --git a/docs/handbook/operation_order.md b/docs/handbook/operation_order.md index fdf05fa7e..50dac7147 100644 --- a/docs/handbook/operation_order.md +++ b/docs/handbook/operation_order.md @@ -1,10 +1,26 @@ # Environment operation order This chapter attempts to document the order of important operations done during environment transitions. -Since AliECS is an evolving system, the information presented here might be out-of-date, thus please refer to event handling in [core/environment/environment.go](https://github.com/AliceO2Group/Control/blob/master/core/environment/environment.go) and plugin calls in [ControlWorkflows/workflows/readout-dataflow.yaml](https://github.com/AliceO2Group/ControlWorkflows/blob/master/workflows/readout-dataflow.yaml) for the ultimate source of truth. +Since AliECS is an evolving system, the information presented here might be out-of-date, thus please refer to event handling in [environment.go][https://github.com/AliceO2Group/Control/blob/master/core/environment/environment.go) and plugin calls in [ControlWorkflows/workflows/readout-dataflow.yaml](https://github.com/AliceO2Group/ControlWorkflows/blob/master/workflows/readout-dataflow.yaml) for the ultimate source of truth. Also, please report to the ECS developers any inaccuracies. -[State Machine Callbacks](configuration.md#State-machine-callbacks) documents the order of callbacks that can be associated with state machine transitions. +## State machine triggers + +The underlying state machine library allows us to add callbacks upon entering and leaving states as well as before and after events (transitions). +This is the order of callback execution upon a state transition: + +1. `before_` - called before event named `` +2. `before_event` - called before all events +3. `leave_` - called before leaving `` +4. `leave_state` - called before leaving all states +5. `enter_`, `` - called after entering `` +6. `enter_state` - called after entering all states +7. `after_`, `` - called after event named `` +8. `after_event` - called after all events + +Callback execution is further refined with integer indexes, with the syntax `±index`, e.g. `before_CONFIGURE+2`, `enter_CONFIGURED-666`. +An expression with no index is assumed to be indexed `+0`. These indexes do not correspond to timestamps, they are discrete labels that allow more granularity in callbacks, ensuring a strict ordering of callback opportunities within a given callback moment. +Thus, `before_CONFIGURE+2` will complete execution strictly after `before_CONFIGURE` runs, but strictly before `enter_CONFIGURED-666` is executed. ## START_ACTIVITY (Start Of Run) @@ -83,116 +99,4 @@ This is the order of actions happening at a healthy end of run. - `"run_end_completion_time_ms"` is set using current time. It is considered as the EOEOR timestamp. - `after_STOP_ACTIVITY` hooks with positive weights (incl. 0) are executed: - `ccdb.RunStop()` at `0` - - `bookkeeping.UpdateRunStop()`, `bookkeeping.UpdateEnv()` at `+100` - -# Integrated service operations - -## DCS - -### DCS operations - -The DCS integration plugin exposes to the workflow template (WFT) context the -following operations. Their associated transitions in this table refer -to the [readout-dataflow](https://github.com/AliceO2Group/ControlWorkflows/blob/master/workflows/readout-dataflow.yaml) workflow template. - -| **DCS operation** | **WFT call** | **Call timing** | **Critical** | **Contingent on detector state** | -|-----------------------|---------------------|---------------------------|--------------|----------------------------------| -| Prepare For Run (PFR) | `dcs.PrepareForRun` | during `CONFIGURE` | `false` | yes | -| Start Of Run (SOR) | `dcs.StartOfRun` | early in `START_ACTIVITY` | `true` | yes | -| End Of Run (EOR) | `dcs.EndOfRun` | late in `STOP_ACTIVITY` | `true` | no | - -The DCS integration plugin subscribes to the [DCS service](https://github.com/AliceO2Group/Control/blob/master/core/integration/dcs/protos/dcs.proto) and continually -receives information on operation-state compatibility for all -detectors. -When a given environment reaches a DCS call, the relevant DCS operation -will be called only if the DCS service reports that all detectors in that -environment are compatible with this operation, except EOR, which is -always called. - -### DCS PrepareForRun behaviour - -Unlike SOR and EOR, which are mandatory if `dcs_enabled` is set to `true`, -an impossibility to run PFR or a PFR failure will not prevent the -environment from transitioning forward. - -#### DCS PFR incompatibility - -When `dcs.PrepareForRun` is called, if at least one detector is in a -state that is incompatible with PFR as reported by the DCS service, -a grace period of 10 seconds is given for the detector(s) to become -compatible with PFR, with 1Hz polling frequency. As soon as all -detectors become compatible with PFR, the PFR operation is requested -to the DCS service. - -If the grace period ends and at least one detector -included in the environment is still incompatible with PFR, the PFR -operation will be performed for the PFR-compatible detectors. - -Despite some detectors not having performed PFR, the environment -can still transition forward towards the `RUNNING` state, and any DCS -activities that would have taken place in PFR will instead happen -during SOR. Only at that point, if at least one detector is not -compatible with SOR (or if it is but SOR fails), will the environment -declare a failure. - -#### DCS PFR failure - -When `dcs.PrepareForRun` is called, if all detectors are compatible -with PFR as reported by the DCS service (or become compatible during -the grace period), the PFR operation is immediately requested to the -DCS service. - -`dcs.PrepareForRun` call fails if no detectors are PFR-compatible -or PFR fails for all those which were PFR-compatible, -but since it is non-critical the environment may still reach the -`CONFIGURED` state and transition forward towards `RUNNING`. - -As in the case of an impossibility to run PFR, any DCS activities that -would have taken place in PFR will instead be done during SOR. - -### DCS StartOfRun behaviour - -The SOR operation is mandatory if `dcs_enabled` is set to `true` -(AliECS GUI "DCS" switched on). - -#### DCS SOR incompatibility - -When `dcs.StartOfRun` is called, if at least one detector is in a -state that is incompatible with SOR as reported by the DCS service, -or if after a grace period of 10 seconds at least one detector is -still incompatible with SOR, the SOR operation **will not run for any -detector**. - -The environment will then declare a **failure**, the -`START_ACTIVITY` transition will be blocked and the environment -will move to `ERROR`. - -#### DCS SOR failure - -When `dcs.StartOfRun` is called, if all detectors are compatible -with SOR as reported by the DCS service (or become compatible during -the grace period), the SOR operation is immediately requested to the -DCS service. - -If this operation fails for one or more detectors, the -`dcs.StartOfRun` call as a whole is considered to have failed. - -The environment will then declare a **failure**, the -`START_ACTIVITY` transition will be blocked and the environment -will move to `ERROR` - -### DCS EndOfRun behaviour - -The EOR operation is mandatory if `dcs_enabled` is set to `true` -(AliECS GUI "DCS" switched on). However, unlike with PFR and SOR, there -is **no check for compatibility** with the EOR operation. The EOR -request will always be sent to the DCS service during `STOP_ACTIVITY`. - -#### DCS EOR failure - -If this operation fails for one or more detectors, the -`dcs.EndOfRun` call as a whole is considered to have failed. - -The environment will then declare a **failure**, the -`STOP_ACTIVITY` transition will be blocked and the environment -will move to `ERROR`. + - `bookkeeping.UpdateRunStop()`, `bookkeeping.UpdateEnv()` at `+100` \ No newline at end of file diff --git a/docs/handbook/overview.md b/docs/handbook/overview.md index 2d3bcb481..d1cf1f9a2 100644 --- a/docs/handbook/overview.md +++ b/docs/handbook/overview.md @@ -1,6 +1,6 @@ # Design Overview -AliECS in a distributed application, using Apache Mesos as toolkit. It integrates a task scheduler component, a purpose-built distributed state machine system, a multi-source stateful process configuration mechanism, and a control plugin and library compatible with any data-driven O2 process. +AliECS is a distributed application, using Apache Mesos as toolkit. It integrates a task scheduler component, a purpose-built distributed state machine system, a multi-source stateful process configuration mechanism, and a control plugin and library compatible with any data-driven O2 process. ## AliECS Structure @@ -18,14 +18,13 @@ AliECS in a distributed application, using Apache Mesos as toolkit. It integrate | AliECS GUI | Instance of the user-facing web interface for AliECS (`cog`), running on the head node. This is the main entry point for regular users. | | AliECS CLI | The `coconut` command, provided by the package with the same name. This is the reference client for advanced users and developers. | - ## Resource Management Apache Mesos is a cluster resource management system. It greatly streamlines distributed application development by providing a unified distributed execution environment. Mesos facilitates the management of O²/FLP components, resources and tasks inside the O²/FLP facility, effectively enabling the developer to program against the datacenter (i.e., the O²/FLP facility at LHC Point 2) as if it was a single pool of resources. For AliECS, Mesos acts as an authoritative source of knowledge on the state of the cluster, as well as providing transport facilities for communication between the AliECS core and the executor. -You can view the state of the cluster as presented by Mesos via the Mesos web interface, served on port `5050` of your head node when deployed via the [O²/FLP Suite setup tool](../../installation/). +You can view the state of the cluster as presented by Mesos via the Mesos web interface, served on port `5050` of your head node when deployed via the O²/FLP Suite setup tool. ## FairMQ @@ -40,4 +39,4 @@ The main state machine of AliECS is the environment state machine, which represe ![](AliECS-envsm.svg) -While FairMQ devices use their own, FairMQ-specific state machine, non-FairMQ tasks based on the [OCC library](https://alice-flp-suite.docs.cern.ch/aliecs/occ/) use the same state machine as the AliECS environment state machine, the only difference being that the `START_ACTIVITY` transition is simply `START`, and the `STOP_ACTIVITY` transition is simply `STOP`. +While FairMQ devices use their own, FairMQ-specific state machine, non-FairMQ tasks based on the [OCC library](/occ/README.md) use the same state machine as the AliECS environment state machine, the only difference being that the `START_ACTIVITY` transition is simply `START`, and the `STOP_ACTIVITY` transition is simply `STOP`. diff --git a/docs/kafka.md b/docs/kafka.md index 53d3bd56d..292900813 100644 --- a/docs/kafka.md +++ b/docs/kafka.md @@ -6,7 +6,8 @@ As of 2024 the AliECS core integrates Kafka producer functionality independent o ### Making sure that AliECS sends messages -To enable the plugin, one should make sure that the following points are fullfiled. +To enable the plugin, one should make sure that the following points are fulfilled. + * The consul instance includes coordinates to the list of kafka brokers. Navigate to `o2/components/aliecs/ANY/any/settings` and make sure the following key value pairs are there: ``` @@ -24,7 +25,7 @@ Once the topics exist, no further messages can be lost and no action is necessar ### Currently available topics -See [events.proto](../common/protos/events.proto) for the protobuf definitions of the messages. +See [events.proto](/common/protos/events.proto) for the protobuf definitions of the messages. * `aliecs.core` - core events that don't concern a specific environment or task * `aliecs.environment` - events that concern an environment, e.g. environment state changes @@ -38,7 +39,7 @@ See [events.proto](../common/protos/events.proto) for the protobuf definitions o ### Decoding the messages -Messages are encoded with protobuf, with the aforementioned [events.proto](../common/protos/events.proto) file defining the schema. +Messages are encoded with protobuf, with the aforementioned [events.proto](/common/protos/events.proto) file defining the schema. Integraed service messages include a payload portion that is usually JSON-encoded, and has no predefined schema. To generate the precompiled protobuf interface, run `make fdset`. @@ -57,7 +58,8 @@ The messages are encoded with protobuf. ### Making sure that AliECS sends messages -To enable the plugin, one should make sure that the following points are fullfiled. +To enable the plugin, one should make sure that the following points are fulfilled. + * The consul instance includes coordinates to your kafka broker and enables the plugin. Navigate to `o2/components/aliecs/ANY/any/settings` and make sure the following key value pairs are there: ``` @@ -79,11 +81,12 @@ As for today, AliECS publishes on the following types of topics: ### Decoding the messages -Messages are encoded with protobuf. Please use [this](../core/integration/kafka/protos/kafka.proto) proto file to generate code which deserializes the messages. +Messages are encoded with protobuf. Please use [this](/core/integration/kafka/protos/kafka.proto) proto file to generate code which deserializes the messages. ### Getting Start of Run and End of Run notifications To get SOR and EOR notifications, please subscribe to the two corresponding topics: + * `aliecs.env_state.RUNNING` for Start of Run * `aliecs.env_leave_state.RUNNING` for End of Run diff --git a/docs/metrics.md b/docs/metrics.md index a925513f3..799b88a64 100644 --- a/docs/metrics.md +++ b/docs/metrics.md @@ -59,7 +59,7 @@ ECS and aliecs.run are values. 6) space - divides fields and timestamp 7) timestamp - (optional) int64 value of unix timestamp in ns -In order to provide support for this format we introduced Metric structure in [common/monitoring/metric.go](../common/monitoring/metric.go). +In order to provide support for this format we introduced Metric structure in [common/monitoring/metric.go](https://github.com/AliceO2Group/Control/blob/master/common/monitoring/metric.go). Following code shows how to create a Metric with `measurement` as measurement name, one tag `tag1=val1` and field `field1=42u`: @@ -70,7 +70,7 @@ m.SetFieldUInt64("field1", 42) ``` However we also need to be able to store metrics, so these can be scraped correctly. -This mechanism is implemented in [common/monitoring/monitoring.go](../common/monitoring/monitoring.go). +This mechanism is implemented in [common/monitoring/monitoring.go](https://github.com/AliceO2Group/Control/blob/master/common/monitoring/monitoring.go). Metrics endpoint is run by calling `Run(port, endpointName)`. As this method is blocking it is advised to call it from `goroutine`. After this method is called we can than send metrics via methods `Send` and `SendHistogrammable`. If you want to send simple metrics @@ -92,7 +92,7 @@ Example for this use-case is duration of some function, eg. measure sending batch of messages. If we want the best coverage of metrics possible we can combine both of these to measure amount of messages send per batch and also measurement duration of the send. For example in code you can take a look actual -actual code in [writer.go](../common/event/writer.go) where we are sending multiple +actual code in [writer.go](https://github.com/AliceO2Group/Control/blob/master/common/event/writer.go) where we are sending multiple fields per metric and demonstrate full potential of these metrics. Previous code example will result in following metrics to be reported: @@ -203,7 +203,7 @@ if different points, but creating statistical report as mentioned in previous pa ### Event loop In order to send metrics from unlimited amount of goroutines, we need to have -robust and thread-safe mechanism. It is implemented in [common/monitoring/monitoring.go](../common/monitoring/monitoring.go) +robust and thread-safe mechanism. It is implemented in [common/monitoring/monitoring.go](https://github.com/AliceO2Group/Control/blob/master/common/monitoring/monitoring.go) as event loop (`eventLoop`) that reads data from two buffered channels (`metricsChannel` and `metricsHistosChannel`) with one goroutine. Apart from reading messages from these two channels event loop also handles scraping requests from `http.Server` endpoint. As the http endpoint is called by a @@ -219,8 +219,8 @@ which are consumed by event loop. In order to correctly implement behaviour described in the part about Aggregation we use the same implementation in two container aggregating objects `MetricsAggregate`, `MetricsReservoirSampling` implemented in files -[common/monitoring/metricsaggregate.go](../common/monitoring/metricsaggregate.go) -and [metricsreservoirsampling.go](../common/monitoring/metricsreservoirsampling.go) +[common/monitoring/metricsaggregate.go](https://github.com/AliceO2Group/Control/blob/master/common/monitoring/metricsaggregate.go) +and [metricsreservoirsampling.go](https://github.com/AliceO2Group/Control/blob/master/common/monitoring/metricsreservoirsampling.go) in the same directory. The implementation is done as different buckets in map with distinct keys (`metricsBuckets`). These keys need to be unique according to the timestamp and tags. We use struct `key` composed diff --git a/docs/running.md b/docs/running.md index 840c89aab..53d80afab 100644 --- a/docs/running.md +++ b/docs/running.md @@ -1,14 +1,14 @@ # Running AliECS as a developer - -> **WARNING**: The running instructions described in this page are **for development purposes only**. Users interested in deploying, running and controlling O²/FLP software or their own software with AliECS should refer to the [O²/FLP Suite instructions](https://alice-flp-suite.docs.cern.ch/installation/) instead. - +> **WARNING**: The running instructions described in this page are **for development purposes only**. Users interested in deploying, running and controlling O²/FLP software or their own software with AliECS should refer to the O²/FLP Suite instructions instead. ## Running the AliECS core This part assumes you have already set up the Go environment, fetched the sources and built all AliECS Go components. -The recommended way to set up a Mesos cluster is by performing a complete deployment of the O²/FLP Suite with `o2-flp-setup`. The AliECS core on the head node should be stopped (`systemctl stop o2-aliecs-core`) and your own AliECS core should be made to point to the head node. +The recommended way to set up a Mesos cluster is by performing a complete deployment of the O²/FLP Suite with `o2-flp-setup`. +The AliECS core on the head node should be stopped (`systemctl stop o2-aliecs-core`) and your own AliECS core should be made to point to the head node. +Typically, it can be done by replacing the AliECS core binary on the head node with your own and restarting the `o2-aliecs-core` systemd service. The following example flags assume a remote head node `centosvmtest`, the use of the default `settings.yaml` file, very verbose output, verbose workflow dumps on every workflow deployment, and the executor having been copied (`scp`) to `/opt/o2control-executor` on all controlled nodes: @@ -26,7 +26,7 @@ http://centosvmtest:5050/api/v1/scheduler --dumpWorkflows ``` -See [Using `coconut`](./coconut/README.md) for instructions on the O² Control core command line interface. +See [Using `coconut`](/coconut/README.md) for instructions on the O² Control core command line interface. # Running AliECS in production diff --git a/docs/using_grpcc_occ.md b/docs/using_grpcc_occ.md index 68e535aee..8a26c5b11 100644 --- a/docs/using_grpcc_occ.md +++ b/docs/using_grpcc_occ.md @@ -8,7 +8,7 @@ installation with `npm` is straightforward. ```bash $ sudo yum install http-parser nodejs npm -$ npm install -g grpcc +$ npm install -g grpcc # it can take a few minutes due to grpc build ``` In a new terminal, we go to the `occ` directory (not the `build` dir) and connect via gRPC: diff --git a/hacking/COG.md b/hacking/COG.md index 0bb5aad7b..21ca826e8 100644 --- a/hacking/COG.md +++ b/hacking/COG.md @@ -1,4 +1,4 @@ -# AliECS GUI +# AliECS GUI overview If you are using the [Single node O²/FLP software deployment instructions](https://gitlab.cern.ch/AliceO2Group/system-configuration/blob/master/ansible/docs/O2_INSTALL_FLP_STANDALONE.md), the AliECS GUI is automatically installed along with the full O²/FLP suite. @@ -44,11 +44,11 @@ In production, AliECS will manage and push all configuration to active tasks, bu Every task still has their own configuration file, with paths such as `/etc/flp.d/qc/*.json` for QualityControl and `/home/flp/readout.cfg` for Readout. These paths can be edited by the user, and any changes affect all newly launched instances of the task. -All configuration file paths used by tasks can be found in the task descriptors of the workflow configuration repository in use. For more information on workflow configuration repositories, see [the `coconut repository` reference](https://github.com/AliceO2Group/Control/blob/doc/coconut/doc/coconut_repository.md). The default workflow configuration repository which comes pre-loaded with AliECS is accessible at [AliceO2Group/ControlWorkflows](https://github.com/AliceO2Group/ControlWorkflows) (all task descriptor files are found in the `tasks` directory). +All configuration file paths used by tasks can be found in the task descriptors of the workflow configuration repository in use. For more information on workflow configuration repositories, see [the `coconut repository` reference](../coconut/doc/coconut_repository.md). The default workflow configuration repository which comes pre-loaded with AliECS is accessible at [AliceO2Group/ControlWorkflows](https://github.com/AliceO2Group/ControlWorkflows) (all task descriptor files are found in the `tasks` directory). * **modify an existing workflow or task?** -You are free to keep as many workflow configuration repositories as you wish in your AliECS instance. For more information on workflow configuration repositories, see [the `coconut repository` reference](https://github.com/AliceO2Group/Control/blob/doc/coconut/doc/coconut_repository.md). +You are free to keep as many workflow configuration repositories as you wish in your AliECS instance. For more information on workflow configuration repositories, see [the `coconut repository` reference](../coconut/doc/coconut_repository.md). Changes to a configuration repository are immediately available after running `coconut repo refresh`. There is no support in the AliECS GUI at this time. diff --git a/mkdocs-dev.yml b/mkdocs-dev.yml new file mode 100644 index 000000000..6f5db4aeb --- /dev/null +++ b/mkdocs-dev.yml @@ -0,0 +1,12 @@ +site_name: aliecs-dev + +nav: + - 'Contributing': '/docs/CONTRIBUTING.md' + - 'Development': '/docs/development.md' + - 'Package pkg.go.dev': 'https://pkg.go.dev/github.com/AliceO2Group/Control' + - 'Building': '/docs/building.md' + - 'Makefile reference': '/docs/makefile_reference.md' + - 'Component Configuration': '/docs/handbook/appconfiguration.md' + - 'Running': 'docs/running.md' + - 'Metrics': '/docs/metrics.md' + - 'OCC API debugging': '/docs/using_grpcc_occ.md' diff --git a/mkdocs.yml b/mkdocs.yml index edcf58f51..35ae83ae8 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -2,31 +2,29 @@ site_name: aliecs nav: - 'Handbook': - - Introduction: handbook/index.md - - Overview: handbook/overview.md - - Basic Concepts: handbook/concepts.md - - Environment Operation Order: handbook/operation_order.md - - Configuration: - - Workflow Configuration: handbook/configuration.md - - Apricot Usage: './apricot/docs/apricot.md' - - Apricot HTTP API: './apricot/docs/apricot_http_service.md' - - Interfaces: - - AliECS gRPC API: apidocs_aliecs.md - - Apricot gRPC API: apidocs_apricot.md - - OCC gRPC API (Protobuf based): apidocs_occ.md - - Kafka: kafka.md - - Workflow Variables: '../controlworkflows/README.md' - - 'Command Reference': - - coconut: - - Overview: './coconut/README.md' - - coconut: './coconut/doc/coconut.md' - - coconut about: './coconut/doc/coconut_about.md' - - coconut configuration: './coconut/doc/coconut_configuration.md' - - coconut environment: './coconut/doc/coconut_environment.md' - - coconut info: './coconut/doc/coconut_info.md' - - coconut repository: './coconut/doc/coconut_repository.md' - - coconut role: './coconut/doc/coconut_role.md' - - coconut task: './coconut/doc/coconut_task.md' - - coconut template: './coconut/doc/coconut_template.md' - - peanut: './occ/peanut/README.md' - - 'FAQ': faq.md + - 'Introduction': 'docs/handbook/introduction.md' + - 'Basic Concepts': 'docs/handbook/concepts.md' + - 'Design Overview': 'docs/handbook/overview.md' + - 'Workflow and Task Config.': 'docs/handbook/configuration.md' + - 'Environment Operation Order': 'docs/handbook/operation_order.md' + - 'Workflow Variables': '../controlworkflows' + - 'Component reference': + - 'AliECS GUI': 'hacking/COG.md' + - 'AliECS core': + - 'Integrated Services': 'core/integration/README.md' + - 'Protocol': 'docs/apidocs_aliecs.md' + - 'coconut': + - 'Overview': 'coconut/README.md' + - 'Command Ref.': 'coconut/doc/coconut.md' + - 'apricot': + - 'Overview': 'apricot/README.md' + - 'HTTP Service': 'apricot/docs/apricot_http_service.md' + - 'Protocol': 'docs/apidocs_apricot.md' + - 'Command Ref.': 'apricot/docs/apricot.md' + - 'occ': + - 'Overview': 'occ/README.md' + - 'Example': 'occ/occlib/examples/dummy-process/README.md' + - 'Protocol': 'docs/apidocs_occ.md' + - 'peanut': + - 'Overview': 'occ/peanut/README.md' + - 'Event Service': 'docs/kafka.md' diff --git a/occ/README.md b/occ/README.md index 330a370ec..8e173441b 100644 --- a/occ/README.md +++ b/occ/README.md @@ -9,9 +9,9 @@ For stateful tasks that do not use FairMQ, the OCC interface is implemented by t ## Developer quick start instructions for OCClib 1. Build & install the OCC library either manually or via aliBuild (`Control-OCCPlugin`); -2. check out [the dummy process example](occlib/examples/dummy-process) and [its entry point](occlib/examples/dummy-process/main.cxx) and to see how to instantiate OCC; -3. implement interface at [`occlib/RuntimeControlledObject.h`](occlib/RuntimeControlledObject.h), -4. link your non-FairMQ O² process against the target `AliceO2::Occ` as described in [the dummy process README](occlib/examples/dummy-process/README.md#standalone-build). +2. check out [the dummy process example](https://github.com/AliceO2Group/Control/blob/master/occ/occlib/examples/dummy-process) and [its entry point](https://github.com/AliceO2Group/Control/blob/master/occ/occlib/examples/dummy-process/main.cxx) and to see how to instantiate OCC; +3. implement interface at [`occlib/RuntimeControlledObject.h`](https://github.com/AliceO2Group/Control/blob/master/occ/occlib/RuntimeControlledObject.h), +4. link your non-FairMQ O² process against the target `AliceO2::Occ` as described in [the dummy process README](https://github.com/AliceO2Group/Control/blob/master/occ/occlib/examples/dummy-process/README.md#standalone-build). ## Manual build instructions Starting from the `occ` directory. diff --git a/occ/occlib/examples/dummy-process/README.md b/occ/occlib/examples/dummy-process/README.md index 905a427d1..9324c7915 100644 --- a/occ/occlib/examples/dummy-process/README.md +++ b/occ/occlib/examples/dummy-process/README.md @@ -6,7 +6,7 @@ For instructions on running it, see [Run example](../../../README.md#run-example ## Standalone build -For guidelines on building the example as a standalone project, see [CMakeLists.txt.example](CMakeLists.txt.example). +For guidelines on building the example as a standalone project, see [CMakeLists.txt.example](https://github.com/AliceO2Group/Control/blob/master/occ/occlib/examples/dummy-process/CMakeLists.txt.example). Dependencies in aliBuild: diff --git a/occ/peanut/README.md b/occ/peanut/README.md index b6d488e3f..61f72815b 100644 --- a/occ/peanut/README.md +++ b/occ/peanut/README.md @@ -1,4 +1,4 @@ -# `peanut` +# Process control and execution utility overview `peanut` is the **p**rocess **e**xecution **a**nd co**n**trol **ut**ility for OCClib-based O² processes. Its purpose is to be a debugging and development aid for non-FairMQ O² devices, where FairMQ's interactive