diff --git a/.github/workflows/linkspector.yml b/.github/workflows/linkspector.yml new file mode 100644 index 000000000..75d766990 --- /dev/null +++ b/.github/workflows/linkspector.yml @@ -0,0 +1,15 @@ +name: Linkspector +on: [pull_request] +jobs: + check-links: + name: runner / linkspector + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - name: Run linkspector + uses: umbrelladocs/action-linkspector@v1 + with: + github_token: ${{ secrets.github_token }} + reporter: github-pr-check + fail_level: any + filter_mode: added diff --git a/Makefile b/Makefile index 8a3a5deee..483758d5f 100644 --- a/Makefile +++ b/Makefile @@ -313,7 +313,7 @@ docs/coconut: docs/grpc: @echo -e "generating gRPC API documentation \033[1;33m==>\033[0m \033[1;34m./docs\033[0m" @cd apricot/protos && PATH="$(ROOT_DIR)/tools:$$PATH" protoc --doc_out="$(ROOT_DIR)/docs" --doc_opt=markdown,apidocs_apricot.md "apricot.proto" - @cd core/protos && PATH="$(ROOT_DIR)/tools:$$PATH" protoc -I=. -I=../../common --doc_out="$(ROOT_DIR)/docs" --doc_opt=markdown,apidocs_aliecs.md "o2control.proto" + @cd core/protos && PATH="$(ROOT_DIR)/tools:$$PATH" protoc -I=. -I=../../common --doc_out="$(ROOT_DIR)/docs" --experimental_allow_proto3_optional --doc_opt=markdown,apidocs_aliecs.md o2control.proto ../../common/protos/events.proto ../../common/protos/common.proto @cd occ/protos && PATH="$(ROOT_DIR)/tools:$$PATH" protoc --doc_out="$(ROOT_DIR)/docs" --doc_opt=markdown,apidocs_occ.md "occ.proto" docs/swaggo: diff --git a/README.md b/README.md index c72abca5b..79d3ca6f3 100644 --- a/README.md +++ b/README.md @@ -2,54 +2,193 @@ [](https://godoc.org/github.com/AliceO2Group/Control) # AliECS -The ALICE Experiment Control System +The ALICE Experiment Control System (**AliECS**) is the piece of software to drive and control data taking activities in the experiment. +It is a distributed system that combines state of the art cluster resource management and experiment control functionalities into a single comprehensive solution. -## Install instructions +Please refer to the [CHEP 2023 paper](https://doi.org/10.1051/epjconf/202429502027) for the latest design overview. -What is your use case? +## How to get started -* I want to **run AliECS** and other O²/FLP software +Regardless of your particular interests, it is recommended to get acquainted with the main [AliECS concepts](docs/handbook/concepts.md). - :arrow_right: [O²/FLP Suite deployment instructions](https://alice-flp.docs.cern.ch/system-configuration/utils/o2-flp-setup/) +After that, please find your concrete use case: - These instructions apply to both single-node and multi-node deployments. +### I want to **run AliECS** and other O²/FLP software - Contact [alice-o2-flp-support](mailto:alice-o2-flp-support@cern.ch) for assistance with provisioning and deployment. - -* I want to ensure AliECS can **run and control my process** +See [O²/FLP Suite deployment instructions](https://alice-flp.docs.cern.ch/system-configuration/utils/o2-flp-setup/) - * My software is based on FairMQ and/or O² DPL - - :palm_tree: Nothing to do, AliECS natively supports FairMQ (and DPL) devices. - - * My software does not use FairMQ and/or DPL, but should be controlled through a state machine - - :telescope: See [the OCC documentation](occ/README.md) to learn how to integrate the O² Control and Configuration library with your software. [Readout](https://github.com/AliceO2Group/Readout) is currently the only example of this setup. - - * My software is a command line utility with no state machine - - :palm_tree: Nothing to do, AliECS natively supports generic commands. Make sure the task template for your command sets the control mode to `basic` ([see example](https://github.com/AliceO2Group/ControlWorkflows/blob/basic-tasks/tasks/sleep.yaml)). - -* I want to build and run AliECS for **development** purposes +These instructions apply to both single-node and multi-node deployments. +Contact [alice-o2-flp-support](mailto:alice-o2-flp-support@cern.ch) for assistance with provisioning and deployment. - :hammer_and_wrench: [Building instructions](https://alice-flp.docs.cern.ch/aliecs/building/) - - :arrow_right: [Running instructions](https://alice-flp.docs.cern.ch/aliecs/running/) +There are two ways of interacting with AliECS: -* I want to communicate with AliECS via one of the plugins - - * [Receive updates on running environments via Kafka](docs/kafka.md) +- The AliECS GUI (a.k.a. Control GUI, COG) - not in this repository, but included in most deployments, recommended -## Using AliECS + :arrow_right: [AliECS GUI documentation](hacking/COG.md) -There are two ways of interacting with AliECS: - -* The AliECS GUI - not in this repository, but included in most deployments, recommended +- `coconut` - the command-line control and configuration utility, included with AliECS core, typically for developers and advanced users + + :arrow_right: [Using `coconut`](https://alice-flp.docs.cern.ch/aliecs/coconut/) - :arrow_right: [AliECS GUI documentation](hacking/COG.md) + :arrow_right: [`coconut` command reference](https://alice-flp.docs.cern.ch/aliecs/coconut/doc/coconut/) -* `coconut` - the command-line control and configuration utility, included with AliECS core +### I want to ensure AliECS can **run and control my process** - :arrow_right: [Using `coconut`](https://alice-flp.docs.cern.ch/aliecs/coconut/) +* **My software is based on FairMQ and/or O² DPL (Data Processing Later)** + + AliECS natively supports FairMQ (and DPL) devices. + Head to [ControlWorkflows](https://github.com/AliceO2Group/ControlWorkflows) for instructions on how to configure your software to be controlled by AliECS. + +* **My software does not use FairMQ and/or DPL, but should be controlled through a state machine** + + See [the OCC documentation](occ/README.md) to learn how to integrate the O² Control and Configuration library with your software. [Readout](https://github.com/AliceO2Group/Readout) is an example of this setup. + + Once ready, head to [ControlWorkflows](https://github.com/AliceO2Group/ControlWorkflows) for instructions on how to configure it to be controlled by AliECS. + +* **My software is a command line utility with no state machine** + + AliECS natively supports generic commands. + Head to [ControlWorkflows](https://github.com/AliceO2Group/ControlWorkflows) for instructions to have your command ran by AliECS. + Make sure the task template for your command sets the control mode to `basic` ([see example](https://github.com/AliceO2Group/ControlWorkflows/blob/master/tasks/o2-roc-cleanup.yaml)). - :arrow_right: [`coconut` command reference](https://alice-flp.docs.cern.ch/aliecs/coconut/doc/coconut/) +### I want to develop AliECS + +:hammer_and_wrench: Welcome to the team, please head to [contributing instructions](/docs/CONTRIBUTING.md) + +### I want to receive updates about environments or services controlled by AliECS + +:pager: Learn more about the [kafka event service](/docs/kafka.md) + +### I want my application to send requests to AliECS + +:scroll: See the API docs of AliECS components: + +- [core gRPC server](/docs/apidocs_aliecs.md) +- [apricot gRPC server](/docs/apidocs_apricot.md) +- [apricot HTTP server](/apricot/docs/apricot_http_service.md) + +### I want my service to be sent requests by AliECS + +:electric_plug: Learn more about the [plugin system](/core/integration/README.md) + +## Table of Contents + +* Introduction + * [Basic Concepts](/docs/handbook/concepts.md#basic-concepts) + * [Tasks](/docs/handbook/concepts.md#tasks) + * [Workflows, roles and environments](/docs/handbook/concepts.md#workflows-roles-and-environments) + * [Design Overview](/docs/handbook/overview.md#design-overview) + * [AliECS Structure](/docs/handbook/overview.md#aliecs-structure) + * [Resource Management](/docs/handbook/overview.md#resource-management) + * [FairMQ](/docs/handbook/overview.md#fairmq) + * [State machines](/docs/handbook/overview.md#state-machines) + +* Component reference + * AliECS GUI + * [AliECS GUI overview](/hacking/COG.md) + * AliECS core + * [Workflow Configuration](/docs/handbook/configuration.md#workflow-configuration) + * [The AliECS workflow template language](/docs/handbook/configuration.md#the-aliecs-workflow-template-language) + * [Workflow template structure](/docs/handbook/configuration.md#workflow-template-structure) + * [Task roles](/docs/handbook/configuration.md#task-roles) + * [Call roles](/docs/handbook/configuration.md#call-roles) + * [Aggregator roles](/docs/handbook/configuration.md#aggregator-roles) + * [Iterator roles](/docs/handbook/configuration.md#iterator-roles) + * [Include roles](/docs/handbook/configuration.md#include-roles) + * [Template expressions](/docs/handbook/configuration.md#template-expressions) + * [Task Configuration](/docs/handbook/configuration.md#task-configuration) + * [Task template structure](/docs/handbook/configuration.md#task-template-structure) + * [Variables pushed to controlled tasks](/docs/handbook/configuration.md#variables-pushed-to-controlled-tasks) + * [Resource wants and limits](/docs/handbook/configuration.md#resource-wants-and-limits) + * [Integration plugins](/core/integration/README.md#integration-plugins) + * [Plugin system overview](/core/integration/README.md#plugin-system-overview) + * [Integrated service operations](/core/integration/README.md#integrated-service-operations) + * [Bookkeeping](/core/integration/README.md#bookkeeping) + * [CCDB](/core/integration/README.md#ccdb) + * [DCS](/core/integration/README.md#dcs) + * [DCS operations](/core/integration/README.md#dcs-operations) + * [DCS PrepareForRun behaviour](/core/integration/README.md#dcs-prepareforrun-behaviour) + * [DCS StartOfRun behaviour](/core/integration/README.md#dcs-startofrun-behaviour) + * [DCS EndOfRun behaviour](/core/integration/README.md#dcs-endofrun-behaviour) + * [DD Scheduler](/core/integration/README.md#dd-scheduler) + * [Kafka (legacy)](/core/integration/README.md#kafka-legacy) + * [ODC](/core/integration/README.md#odc) + * [Test plugin](/core/integration/README.md#test-plugin) + * [Trigger](/core/integration/README.md#trigger) + * [Environment operation order](/docs/handbook/operation_order.md#environment-operation-order) + * [State machine triggers](/docs/handbook/operation_order.md#state-machine-triggers) + * [START_ACTIVITY (Start Of Run)](/docs/handbook/operation_order.md#start_activity-start-of-run) + * [STOP_ACTIVITY (End Of Run)](/docs/handbook/operation_order.md#stop_activity-end-of-run) + * [Protocol documentation](/docs/apidocs_aliecs.md) + * coconut + * [The O² control and configuration utility overview](/coconut/README.md#the-o-control-and-configuration-utility-overview) + * [Configuration file](/coconut/README.md#configuration-file) + * [Using coconut](/coconut/README.md#using-coconut) + * [Creating an environment](/coconut/README.md#creating-an-environment) + * [Controlling an environment](/coconut/README.md#controlling-an-environment) + * [Command reference](/coconut/doc/coconut.md) + * apricot + * [ALICE configuration service overview](/apricot/README.md#alice-configuration-service-overview) + * [HTTP service](/apricot/docs/apricot_http_service.md#apricot-http-service) + * [Configuration](/apricot/docs/apricot_http_service.md#configuration) + * [Usage and options](/apricot/docs/apricot_http_service.md#usage-and-options) + * [Examples](/apricot/docs/apricot_http_service.md#examples) + * [Protocol documentation](/docs/apidocs_apricot.md) + * [Command reference](/apricot/docs/apricot.md) + * occ + * [O² Control and Configuration Components](/occ/README.md#o-control-and-configuration-components) + * [Developer quick start instructions for OCClib](/occ/README.md#developer-quick-start-instructions-for-occlib) + * [Manual build instructions](/occ/README.md#manual-build-instructions) + * [Run example](/occ/README.md#run-example) + * [The OCC state machine](/occ/README.md#the-occ-state-machine) + * [Single process control with peanut](/occ/README.md#single-process-control-with-peanut) + * [OCC API debugging with grpcc](/occ/README.md#occ-api-debugging-with-grpcc) + * [Dummy process example for OCC library](/occ/occlib/examples/dummy-process/README.md#dummy-process-example-for-occ-library) + * [Protocol documentation](/docs/apidocs_occ.md) + * peanut + * [Process control and execution utility overview](/occ/peanut/README.md) + * Event service + * [Kafka producer functionality in AliECS core](/docs/kafka.md#kafka-producer-functionality-in-aliecs-core) + * [Making sure that AliECS sends messages](/docs/kafka.md#making-sure-that-aliecs-sends-messages) + * [Currently available topics](/docs/kafka.md#currently-available-topics) + * [Decoding the messages](/docs/kafka.md#decoding-the-messages) + * [Legacy events: Kafka plugin](/docs/kafka.md#legacy-events-kafka-plugin) + * [Making sure that AliECS sends messages](/docs/kafka.md#making-sure-that-aliecs-sends-messages-1) + * [Currently available topics](/docs/kafka.md#currently-available-topics-1) + * [Decoding the messages](/docs/kafka.md#decoding-the-messages-1) + * [Getting Start of Run and End of Run notifications](/docs/kafka.md#getting-start-of-run-and-end-of-run-notifications) + * [Using Kafka debug tools](/docs/kafka.md#using-kafka-debug-tools) + +* Developer documentation + * [Contributing](/docs/CONTRIBUTING.md) + * [Package pkg.go.dev documentation](https://pkg.go.dev/github.com/AliceO2Group/Control) + * [Building AliECS](/docs/building.md#building-aliecs) + * [Overview](/docs/building.md#overview) + * [Building with aliBuild](/docs/building.md#building-with-alibuild) + * [Manual build](/docs/building.md#manual-build) + * [Go environment](/docs/building.md#go-environment) + * [Clone and build (Go components only)](/docs/building.md#clone-and-build-go-components-only) + * [Makefile reference](/docs/makefile_reference.md) + * [Component Configuration](/docs/handbook/appconfiguration.md#component-configuration) + * [Apache Mesos](/docs/handbook/appconfiguration.md#apache-mesos) + * [Connectivity to controlled nodes](/docs/handbook/appconfiguration.md#connectivity-to-controlled-nodes) + * [Running AliECS as a developer](/docs/running.md#running-aliecs-as-a-developer) + * [Running the AliECS core](/docs/running.md#running-the-aliecs-core) + * [Running AliECS in production](/docs/running.md#running-aliecs-in-production) + * [Health checks](/docs/running.md#health-checks) + * [Development Information](/docs/development.md#development-information) + * [Release Procedure](/docs/development.md#release-procedure) + * [Metrics in ECS](/docs/metrics.md#metrics-in-ecs) + * [Overview and simple usage](/docs/metrics.md#overview-and-simple-usage) + * [Types and aggregation of metrics](/docs/metrics.md#types-and-aggregation-of-metrics) + * [Metric types](/docs/metrics.md#metric-types) + * [Aggregation](/docs/metrics.md#aggregation) + * [Implementation details](/docs/metrics.md#implementation-details) + * [Event loop](/docs/metrics.md#event-loop) + * [Hashing to aggregate](/docs/metrics.md#hashing-to-aggregate) + * [Sampling reservoir](/docs/metrics.md#sampling-reservoir) + * [OCC API debugging with grpcc](/docs/using_grpcc_occ.md#occ-api-debugging-with-grpcc) + +* Resources + * T. Mrnjavac et. al, [AliECS: A New Experiment Control System for the ALICE Experiment](https://doi.org/10.1051/epjconf/202429502027), CHEP23 + diff --git a/apricot/README.md b/apricot/README.md index 243b580d1..f3a63b6e5 100644 --- a/apricot/README.md +++ b/apricot/README.md @@ -1,14 +1,10 @@ -# `APRICOT` +# ALICE configuration service overview -**A** **p**rocessor and **r**epos**i**tory for **co**nfiguration **t**emplates +**A** **p**rocessor and **r**epos**i**tory for **co**nfiguration **t**emplates, or apricot, implements the configuration service for the ALICE data taking activities. +It adds templating, load balancing and caching on top of the configuration store. -The `o2-apricot` binary implements a centralized configuration (micro)service for ALICE O². +See also: -``` -Usage of bin/o2-apricot: - --backendUri string URI of the Consul server or YAML configuration file (default "consul://127.0.0.1:8500") - --listenPort int Port of apricot server (default 32101) - --verbose Verbose logging -``` - -Protofile: [apricot.proto](apricot/protos/apricot.proto) +* [apricot HTTP service](docs/apricot_http_service.md) - make essential cluster information available via a web server +* Protofile: [apricot.proto](protos/apricot.proto) +* [Command reference](docs/apricot.md) diff --git a/apricot/docs/apricot.md b/apricot/docs/apricot.md index e0448f6d8..43fd6027b 100644 --- a/apricot/docs/apricot.md +++ b/apricot/docs/apricot.md @@ -13,8 +13,4 @@ Usage of bin/o2-apricot: --backendUri string URI of the Consul server or YAML configuration file (default "consul://127.0.0.1:8500") --listenPort int Port of apricot server (default 32101) --verbose Verbose logging -``` - -### SEE ALSO - -* [apricot HTTP service](apricot_http_service.md) - make essential cluster information available via a web server +``` \ No newline at end of file diff --git a/apricot/docs/apricot_http_service.md b/apricot/docs/apricot_http_service.md index 15dd91a97..fd4609736 100644 --- a/apricot/docs/apricot_http_service.md +++ b/apricot/docs/apricot_http_service.md @@ -46,4 +46,4 @@ Besides configuration retrieval, the API also includes calls for browsing the co Getting a template-processed configuration payload for a component (entry `tpc-full-qcmn` for component `qc`, with `list_of_detectors` and `run_type` passed as template variables): * In a browser: `http://localhost:32188/components/qc/ANY/any/tpc-full-qcmn?process=true&list_of_detectors=tpc,its&run_type=PHYSICS` -* With `curl`: `curl http://127.0.0.1:32188/components/qc/ANY/any/tpc-full-qcmn\?process\=true\&list_of_detectors\=tpc,its\&run_type\=PHYSICS` \ No newline at end of file +* With `curl`: `curl http://127.0.0.1:32188/components/qc/ANY/any/tpc-full-qcmn\?process\=true\&list_of_detectors\=tpc,its\&run_type\=PHYSICS` diff --git a/coconut/README.md b/coconut/README.md index f20c2bfa6..086de07f0 100644 --- a/coconut/README.md +++ b/coconut/README.md @@ -1,4 +1,4 @@ -# `coconut` - the O² control and configuration utility +# The O² control and configuration utility overview The O² **co**ntrol and **con**figuration **ut**ility is a command line program for interacting with the AliECS core. @@ -98,6 +98,7 @@ A valid workflow template (sometimes called simply "workflow" for brevity) must Workflows and tasks are managed with a git based configuration system, so the workflow template may be provided simply by name or with repository and branch/tag/hash constraints. Examples: + * `coconut env create -w myworkflow` - loads workflow `myworkflow` from default configuration repository at HEAD of master branch * `coconut env create -w github.com/AliceO2Group/MyConfRepo/myworkflow` - loads a workflow from a specific git repository, HEAD of master branch * `coconut env create -w myworkflow@rev` - loads a workflow from default repository, on branch, tag or revision `rev` diff --git a/coconut/doc/coconut_environment_create.md b/coconut/doc/coconut_environment_create.md index ce39b7349..62ec30471 100644 --- a/coconut/doc/coconut_environment_create.md +++ b/coconut/doc/coconut_environment_create.md @@ -13,6 +13,7 @@ A valid workflow template (sometimes called simply "workflow" for brevity) must Workflows and tasks are managed with a git based configuration system, so the workflow template may be provided simply by name or with repository and branch/tag/hash constraints. Examples: + * `coconut env create -w myworkflow` - loads workflow `myworkflow` from default configuration repository at HEAD of master branch * `coconut env create -w github.com/AliceO2Group/MyConfRepo/myworkflow` - loads a workflow from a specific git repository, HEAD of master branch * `coconut env create -w myworkflow@rev` - loads a workflow from default repository, on branch, tag or revision `rev` diff --git a/coconut/doc/coconut_repository.md b/coconut/doc/coconut_repository.md index 32156d56d..7ba96cc14 100644 --- a/coconut/doc/coconut_repository.md +++ b/coconut/doc/coconut_repository.md @@ -9,6 +9,7 @@ The repository command performs operations on the repositories used for task and A valid workflow configuration repository must contain the directories `tasks` and `workflows` in its `master` branch. When referencing a repository, the clone method should never be prepended. Supported repo backends and their expected format are: + - https: [hostname]/[repo_path] - ssh: [hostname]:[repo_path] - local [repo_path] (local repo entries are ephemeral and will not survive a core restart) diff --git a/coconut/doc/coconut_repository_add.md b/coconut/doc/coconut_repository_add.md index 90db22e45..7b24c4f80 100644 --- a/coconut/doc/coconut_repository_add.md +++ b/coconut/doc/coconut_repository_add.md @@ -16,6 +16,7 @@ the ensuing list is followed until a valid revision has been identified: Exhaustion of the aforementioned list results in a repo add failure. `coconut repo add` can be called with + 1) a repository identifier 2) a repository identifier coupled with the `--default-revision` flag (see examples below) diff --git a/coconut/doc/coconut_role_query.md b/coconut/doc/coconut_role_query.md index 6d5626791..ee3759272 100644 --- a/coconut/doc/coconut_role_query.md +++ b/coconut/doc/coconut_role_query.md @@ -17,6 +17,7 @@ walk through the role tree of the given environment, starting from the root role per https://github.com/gobwas/glob syntax. Examples: + * `coconut role query 2rE9AV3m1HL readout-dataflow` - queries the role `readout-dataflow` in environment `2rE9AV3m1HL`, prints the full tree, along with the variables defined in the root role * `coconut role query 2rE9AV3m1HL readout-dataflow.host-aido2-bld4-lab102` - queries the role `readout-dataflow.host-aido2-bld4-lab102`, prints the subtree of that role, along with the variables defined in it * `coconut role query 2rE9AV3m1HL readout-dataflow.host-aido2-bld4-lab102.data-distribution.stfs` - queries the role at the given path, it is a task role so there is no subtree, prints the variables defined in that role diff --git a/coconut/doc/coconut_template_list.md b/coconut/doc/coconut_template_list.md index 47853a119..7d2fdda29 100644 --- a/coconut/doc/coconut_template_list.md +++ b/coconut/doc/coconut_template_list.md @@ -7,7 +7,8 @@ list available workflow templates The template list command shows a list of available workflow templates. These workflow templates can then be loaded to create an environment. -`coconut templ list` can be called with +`coconut templ list` can be called with + 1) a combination of the `--repo` , `--revision` , `--all-branches` , `--all-tags` , `--all-workflows` flags, or with 2) an argument in the form of [repo-pattern]@[revision-pattern], where the patterns are globbing. diff --git a/core/integration/README.md b/core/integration/README.md new file mode 100644 index 000000000..d86622233 --- /dev/null +++ b/core/integration/README.md @@ -0,0 +1,159 @@ +# Integration plugins + +The integration plugins allow AliECS to communicate with other ALICE services. +A plugin can register a set of callback which can be invoked upon defined environment events (state transitions). + +## Plugin system overview + +All plugins should implement the [`Plugin`](https://github.com/AliceO2Group/Control/blob/master/core/integration/plugin.go) interface. +See the existing plugins for examples. + +In order to have the plugin loaded by the AliECS, one has to: + +- add `RegisterPlugin` to the `init()` function in [AliECS core main source](https://github.com/AliceO2Group/Control/blob/master/cmd/o2-aliecs-core/main.go) +- add plugin name in the `integrationPlugins` list and set the endpoint in the AliECS configuration file (typically at `/o2/components/aliecs/ANY/any/settings` in the configuration store) + +# Integrated service operations + +In this chapter we list and describe the integrated service plugins. + +## Bookkeeping + +The legacy Bookkeeping plugin sends updates to Bookkeeping about the state of data taking runs. +As of May 2025, Bookkeeping has transitioned into consuming input from the Kafka event service and the only call in use is "FillInfo", which allows ECS to retrieve LHC fill information. + +## CCDB + +CCDB plugin calls PDP-provided executable which creates a General Run Parameters (GRP) object at each run start and stop. + +## DCS + +DCS plugin communicates with the ALICE Detector Control System (DCS). + +### DCS operations + +The DCS integration plugin exposes to the workflow template (WFT) context the +following operations. Their associated transitions in this table refer +to the [readout-dataflow](https://github.com/AliceO2Group/ControlWorkflows/blob/master/workflows/readout-dataflow.yaml) workflow template. + +| **DCS operation** | **WFT call** | **Call timing** | **Critical** | **Contingent on detector state** | +|-----------------------|---------------------|---------------------------|--------------|----------------------------------| +| Prepare For Run (PFR) | `dcs.PrepareForRun` | during `CONFIGURE` | `false` | yes | +| Start Of Run (SOR) | `dcs.StartOfRun` | early in `START_ACTIVITY` | `true` | yes | +| End Of Run (EOR) | `dcs.EndOfRun` | late in `STOP_ACTIVITY` | `true` | no | + +The DCS integration plugin subscribes to the [DCS service](https://github.com/AliceO2Group/Control/blob/master/core/integration/dcs/protos/dcs.proto) and continually +receives information on operation-state compatibility for all +detectors. +When a given environment reaches a DCS call, the relevant DCS operation +will be called only if the DCS service reports that all detectors in that +environment are compatible with this operation, except EOR, which is +always called. + +### DCS PrepareForRun behaviour + +Unlike SOR and EOR, which are mandatory if `dcs_enabled` is set to `true`, +an impossibility to run PFR or a PFR failure will not prevent the +environment from transitioning forward. + +#### DCS PFR incompatibility + +When `dcs.PrepareForRun` is called, if at least one detector is in a +state that is incompatible with PFR as reported by the DCS service, +a grace period of 10 seconds is given for the detector(s) to become +compatible with PFR, with 1Hz polling frequency. As soon as all +detectors become compatible with PFR, the PFR operation is requested +to the DCS service. + +If the grace period ends and at least one detector +included in the environment is still incompatible with PFR, the PFR +operation will be performed for the PFR-compatible detectors. + +Despite some detectors not having performed PFR, the environment +can still transition forward towards the `RUNNING` state, and any DCS +activities that would have taken place in PFR will instead happen +during SOR. Only at that point, if at least one detector is not +compatible with SOR (or if it is but SOR fails), will the environment +declare a failure. + +#### DCS PFR failure + +When `dcs.PrepareForRun` is called, if all detectors are compatible +with PFR as reported by the DCS service (or become compatible during +the grace period), the PFR operation is immediately requested to the +DCS service. + +`dcs.PrepareForRun` call fails if no detectors are PFR-compatible +or PFR fails for all those which were PFR-compatible, +but since it is non-critical the environment may still reach the +`CONFIGURED` state and transition forward towards `RUNNING`. + +As in the case of an impossibility to run PFR, any DCS activities that +would have taken place in PFR will instead be done during SOR. + +### DCS StartOfRun behaviour + +The SOR operation is mandatory if `dcs_enabled` is set to `true` +(AliECS GUI "DCS" switched on). + +#### DCS SOR incompatibility + +When `dcs.StartOfRun` is called, if at least one detector is in a +state that is incompatible with SOR as reported by the DCS service, +or if after a grace period of 10 seconds at least one detector is +still incompatible with SOR, the SOR operation **will not run for any +detector**. + +The environment will then declare a **failure**, the +`START_ACTIVITY` transition will be blocked and the environment +will move to `ERROR`. + +#### DCS SOR failure + +When `dcs.StartOfRun` is called, if all detectors are compatible +with SOR as reported by the DCS service (or become compatible during +the grace period), the SOR operation is immediately requested to the +DCS service. + +If this operation fails for one or more detectors, the +`dcs.StartOfRun` call as a whole is considered to have failed. + +The environment will then declare a **failure**, the +`START_ACTIVITY` transition will be blocked and the environment +will move to `ERROR` + +### DCS EndOfRun behaviour + +The EOR operation is mandatory if `dcs_enabled` is set to `true` +(AliECS GUI "DCS" switched on). However, unlike with PFR and SOR, there +is **no check for compatibility** with the EOR operation. The EOR +request will always be sent to the DCS service during `STOP_ACTIVITY`. + +#### DCS EOR failure + +If this operation fails for one or more detectors, the +`dcs.EndOfRun` call as a whole is considered to have failed. + +The environment will then declare a **failure**, the +`STOP_ACTIVITY` transition will be blocked and the environment +will move to `ERROR`. + +## DD Scheduler + +DD scheduler plugin informs the Data Distribution software about the pool of FLPs taking part in data taking. + +## Kafka (legacy) + +See [Legacy events: Kafka plugin](/docs/kafka.md#legacy-events-kafka-plugin) + +## ODC + +ODC plugin communicates with the [Online Device Control (ODC)](https://github.com/FairRootGroup/ODC) instance of the ALICE experiment, which controls the event processing farm used in data taking and offline processing. + +## Test plugin + +Test plugin serves as an example of a plugin and is used for testing the plugin system. + +## Trigger + +Trigger plugin communicates with the ALICE trigger system. diff --git a/core/protos/o2control.proto b/core/protos/o2control.proto index 17698b13c..df3069b90 100644 --- a/core/protos/o2control.proto +++ b/core/protos/o2control.proto @@ -30,11 +30,16 @@ option go_package = "github.com/AliceO2Group/Control/core/protos;pb"; import public "protos/events.proto"; +// The Control service is the main interface to AliECS service Control { rpc GetFrameworkInfo (GetFrameworkInfoRequest) returns (GetFrameworkInfoReply) {} rpc GetEnvironments (GetEnvironmentsRequest) returns (GetEnvironmentsReply) {} + // Creates a new environment which automatically follows one STANDBY->RUNNING->DONE cycle in the state machine. + // It returns only once the environment reaches the CONFIGURED state or upon any earlier failure. rpc NewAutoEnvironment (NewAutoEnvironmentRequest) returns (NewAutoEnvironmentReply) {} + // Creates a new environment. + // It returns only once the environment reaches the CONFIGURED state or upon any earlier failure. rpc NewEnvironment (NewEnvironmentRequest) returns (NewEnvironmentReply) {} rpc GetEnvironment (GetEnvironmentRequest) returns (GetEnvironmentReply) {} rpc ControlEnvironment (ControlEnvironmentRequest) returns (ControlEnvironmentReply) {} @@ -42,6 +47,9 @@ service Control { rpc GetActiveDetectors (Empty) returns (GetActiveDetectorsReply) {} rpc GetAvailableDetectors (Empty) returns (GetAvailableDetectorsReply) {} + // Creates a new environment. + // It returns once an environment ID is created and continues the creation asynchronously to the call. + // The environment will be listed in GetEnvironments() only once the workflow is loaded and deployment starts. rpc NewEnvironmentAsync (NewEnvironmentRequest) returns (NewEnvironmentReply) {} // rpc SetEnvironmentProperties (SetEnvironmentPropertiesRequest) returns (SetEnvironmentPropertiesReply) {} diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md new file mode 100644 index 000000000..54611fd45 --- /dev/null +++ b/docs/CONTRIBUTING.md @@ -0,0 +1,63 @@ +# Contributing + +Thank you for your interest in contributing to the project. +This document provides guidelines and information to help you contribute effectively. + +If you are not in contact with the project maintainers, please reach out to them before proposing any changes. +We use JIRA for issue tracking and project management. +This software component is part of the O²/FLP project in the ALICE experiment. + +## Getting started + +Getting acquainted with the introduction chapters is absolutely essential, glossing over the whole documentation is highly advised. + +A development environment setup will be necessary for compiling binaries and running unit tests, see [Building](/docs/building.md) for details. + +## Testing + +Run unit tests in the Control project with `make test`. +To obtain test coverage reports, run `make coverage`. + +Typically, you will also want to prepare a test setup in form of an [FLP suite deployment](https://alice-flp.docs.cern.ch/system-configuration/utils/o2-flp-setup/) on a virtual machine. +Since AliECS interacts with many other project components, the last testing step might involve replacing the modified binary on the test VM and trying out the new functionality or the fix. + +The binaries are installed at `/opt/o2/bin`. + +`o2-aliecs-core` and `o2-apricot` are ran as systemd services, so you will need to restart them after replacing the binary. + +`o2-aliecs-executor` is started by `mesos-slave` if it is not running already at environment creation. +To make sure that the replaced binary is used, kill the running process (`pkill -f o2-aliecs-executor`). + +## Pull requests guidelines + +- Make sure your work has a corresponding JIRA ticket and it is assigned to yourself. +Trivial PRs are acceptable without a ticket. + +- Work on your changes in your fork on a dedicated branch with a descriptive name. + +- Make focused, logically atomic commits with clear messages and descriptions explaining the design choices. +Multiple commits per pull request are allowed. +However, please make sure that the project can be built and the tests pass at any commit. + +- Commit message or description should include the JIRA ticket number + +- Add tests for your changes whenever possible. +Gomega/Ginkgo tests are preferred, but other style of tests are also welcome. + +- Add documentation for new features. + +- Your contribution will be reviewed by the project maintainers once the PR is marked as ready for review. + +## Documentation guidelines + +The markdown documentation is aimed to be browsed on GitHub, but it also on the aggregated [FLP documentation](https://alice-flp.docs.cern.ch) based on [MkDocs](https://www.mkdocs.org/). +Consequently, any changes in the documentation structure should be reflected in the Table of Contents in the main README.md, as well as `mkdocs.yml` and `mkdocs.yml`. + +The AliECS MkDocs documentation is split into two aforementioned files to follow the split between "Products" and "Developers" tabs in the FLP documentation. +The `mkdocs-dev.yml` uses a symlink `aliecs-dev` to `aliecs` directory to avoid complaints about duplicated site names. + +Because of the dual target of the documentation, the points below are important to keep in mind: + +- Absolute paths in links to other files do not always work, they should be avoided. +- When referencing source files in the repository, use full URIs to GitHub. +- In MkDocs layouts, one cannot reference specific sections within markdown files. Only links to entire markdown files are possible. \ No newline at end of file diff --git a/docs/apidocs_aliecs.md b/docs/apidocs_aliecs.md index 6783e0b13..7004c9050 100644 --- a/docs/apidocs_aliecs.md +++ b/docs/apidocs_aliecs.md @@ -88,6 +88,26 @@ - [Control](#o2control-Control) +- [protos/events.proto](#protos_events-proto) + - [Ev_CallEvent](#events-Ev_CallEvent) + - [Ev_EnvironmentEvent](#events-Ev_EnvironmentEvent) + - [Ev_EnvironmentEvent.VarsEntry](#events-Ev_EnvironmentEvent-VarsEntry) + - [Ev_IntegratedServiceEvent](#events-Ev_IntegratedServiceEvent) + - [Ev_MetaEvent_CoreStart](#events-Ev_MetaEvent_CoreStart) + - [Ev_MetaEvent_FrameworkEvent](#events-Ev_MetaEvent_FrameworkEvent) + - [Ev_MetaEvent_MesosHeartbeat](#events-Ev_MetaEvent_MesosHeartbeat) + - [Ev_RoleEvent](#events-Ev_RoleEvent) + - [Ev_RunEvent](#events-Ev_RunEvent) + - [Ev_TaskEvent](#events-Ev_TaskEvent) + - [Event](#events-Event) + - [Traits](#events-Traits) + + - [OpStatus](#events-OpStatus) + +- [protos/common.proto](#protos_common-proto) + - [User](#common-User) + - [WorkflowTemplateInfo](#common-WorkflowTemplateInfo) + - [Scalar Value Types](#scalar-value-types) @@ -1453,20 +1473,20 @@ Not implemented yet ### Control - +The Control service is the main interface to AliECS | Method Name | Request Type | Response Type | Description | | ----------- | ------------ | ------------- | ------------| | GetFrameworkInfo | [GetFrameworkInfoRequest](#o2control-GetFrameworkInfoRequest) | [GetFrameworkInfoReply](#o2control-GetFrameworkInfoReply) | | | GetEnvironments | [GetEnvironmentsRequest](#o2control-GetEnvironmentsRequest) | [GetEnvironmentsReply](#o2control-GetEnvironmentsReply) | | -| NewAutoEnvironment | [NewAutoEnvironmentRequest](#o2control-NewAutoEnvironmentRequest) | [NewAutoEnvironmentReply](#o2control-NewAutoEnvironmentReply) | | -| NewEnvironment | [NewEnvironmentRequest](#o2control-NewEnvironmentRequest) | [NewEnvironmentReply](#o2control-NewEnvironmentReply) | | +| NewAutoEnvironment | [NewAutoEnvironmentRequest](#o2control-NewAutoEnvironmentRequest) | [NewAutoEnvironmentReply](#o2control-NewAutoEnvironmentReply) | Creates a new environment which automatically follows one STANDBY->RUNNING->DONE cycle in the state machine. It returns only once the environment reaches the CONFIGURED state or upon any earlier failure. | +| NewEnvironment | [NewEnvironmentRequest](#o2control-NewEnvironmentRequest) | [NewEnvironmentReply](#o2control-NewEnvironmentReply) | Creates a new environment. It returns only once the environment reaches the CONFIGURED state or upon any earlier failure. | | GetEnvironment | [GetEnvironmentRequest](#o2control-GetEnvironmentRequest) | [GetEnvironmentReply](#o2control-GetEnvironmentReply) | | | ControlEnvironment | [ControlEnvironmentRequest](#o2control-ControlEnvironmentRequest) | [ControlEnvironmentReply](#o2control-ControlEnvironmentReply) | | | DestroyEnvironment | [DestroyEnvironmentRequest](#o2control-DestroyEnvironmentRequest) | [DestroyEnvironmentReply](#o2control-DestroyEnvironmentReply) | | | GetActiveDetectors | [Empty](#o2control-Empty) | [GetActiveDetectorsReply](#o2control-GetActiveDetectorsReply) | | | GetAvailableDetectors | [Empty](#o2control-Empty) | [GetAvailableDetectorsReply](#o2control-GetAvailableDetectorsReply) | | -| NewEnvironmentAsync | [NewEnvironmentRequest](#o2control-NewEnvironmentRequest) | [NewEnvironmentReply](#o2control-NewEnvironmentReply) | | +| NewEnvironmentAsync | [NewEnvironmentRequest](#o2control-NewEnvironmentRequest) | [NewEnvironmentReply](#o2control-NewEnvironmentReply) | Creates a new environment. It returns once an environment ID is created and continues the creation asynchronously to the call. The environment will be listed in GetEnvironments() only once the workflow is loaded and deployment starts. | | GetTasks | [GetTasksRequest](#o2control-GetTasksRequest) | [GetTasksReply](#o2control-GetTasksReply) | | | GetTask | [GetTaskRequest](#o2control-GetTaskRequest) | [GetTaskReply](#o2control-GetTaskReply) | | | CleanupTasks | [CleanupTasksRequest](#o2control-CleanupTasksRequest) | [CleanupTasksReply](#o2control-CleanupTasksReply) | | @@ -1488,6 +1508,321 @@ Not implemented yet + +
+ +## protos/events.proto + + + + + +### Ev_CallEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| func | [string](#string) | | name of the function being called, within the workflow template context | +| callStatus | [OpStatus](#events-OpStatus) | | progress or success/failure state of the call | +| return | [string](#string) | | return value of the function | +| traits | [Traits](#events-Traits) | | | +| output | [string](#string) | | any additional output of the function | +| error | [string](#string) | | error value, if returned | +| environmentId | [string](#string) | | | +| path | [string](#string) | | path to the parent callRole of this call within the environment | + + + + + + + + +### Ev_EnvironmentEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| environmentId | [string](#string) | | | +| state | [string](#string) | | | +| runNumber | [uint32](#uint32) | | only when the environment is in the running state | +| error | [string](#string) | | | +| message | [string](#string) | | any additional message concerning the current state or transition | +| transition | [string](#string) | | | +| transitionStep | [string](#string) | | | +| transitionStatus | [OpStatus](#events-OpStatus) | | | +| vars | [Ev_EnvironmentEvent.VarsEntry](#events-Ev_EnvironmentEvent-VarsEntry) | repeated | consolidated environment variables at the root role of the environment | +| lastRequestUser | [common.User](#common-User) | | | +| workflowTemplateInfo | [common.WorkflowTemplateInfo](#common-WorkflowTemplateInfo) | | | + + + + + + + + +### Ev_EnvironmentEvent.VarsEntry + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| key | [string](#string) | | | +| value | [string](#string) | | | + + + + + + + + +### Ev_IntegratedServiceEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| name | [string](#string) | | name of the context, usually the path of the callRole that calls a given integrated service function e.g. readout-dataflow.dd-scheduler.terminate | +| error | [string](#string) | | error message, if any | +| operationName | [string](#string) | | name of the operation, usually the name of the integrated service function being called e.g. ddsched.PartitionTerminate()" | +| operationStatus | [OpStatus](#events-OpStatus) | | progress or success/failure state of the operation | +| operationStep | [string](#string) | | if the operation has substeps, this is the name of the current substep, like an API call or polling phase | +| operationStepStatus | [OpStatus](#events-OpStatus) | | progress or success/failure state of the current substep | +| environmentId | [string](#string) | | | +| payload | [string](#string) | | any additional payload, depending on the integrated service; there is no schema, it can even be the raw return structure of a remote API call | + + + + + + + + +### Ev_MetaEvent_CoreStart + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| frameworkId | [string](#string) | | | + + + + + + + + +### Ev_MetaEvent_FrameworkEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| frameworkId | [string](#string) | | | +| message | [string](#string) | | | + + + + + + + + +### Ev_MetaEvent_MesosHeartbeat + + + + + + + + + +### Ev_RoleEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| name | [string](#string) | | role name | +| status | [string](#string) | | posible values: ACTIVE/INACTIVE/PARTIAL/UNDEFINED/UNDEPLOYABLE as defined in status.go. Derived from the state of child tasks, calls or other roles | +| state | [string](#string) | | state machine state for this role | +| rolePath | [string](#string) | | path to this role within the environment | +| environmentId | [string](#string) | | | + + + + + + + + +### Ev_RunEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| environmentId | [string](#string) | | | +| runNumber | [uint32](#uint32) | | | +| state | [string](#string) | | | +| error | [string](#string) | | | +| transition | [string](#string) | | | +| transitionStatus | [OpStatus](#events-OpStatus) | | | +| lastRequestUser | [common.User](#common-User) | | | + + + + + + + + +### Ev_TaskEvent + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| name | [string](#string) | | task name, based on the name of the task class | +| taskid | [string](#string) | | task id, unique | +| state | [string](#string) | | state machine state for this task | +| status | [string](#string) | | posible values: ACTIVE/INACTIVE/PARTIAL/UNDEFINED/UNDEPLOYABLE as defined in status.go. | +| hostname | [string](#string) | | | +| className | [string](#string) | | name of the task class from which this task was spawned | +| traits | [Traits](#events-Traits) | | | +| environmentId | [string](#string) | | | +| path | [string](#string) | | path to the parent taskRole of this task within the environment | + + + + + + + + +### Event + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| timestamp | [int64](#int64) | | | +| timestampNano | [int64](#int64) | | | +| environmentEvent | [Ev_EnvironmentEvent](#events-Ev_EnvironmentEvent) | | | +| taskEvent | [Ev_TaskEvent](#events-Ev_TaskEvent) | | | +| roleEvent | [Ev_RoleEvent](#events-Ev_RoleEvent) | | | +| callEvent | [Ev_CallEvent](#events-Ev_CallEvent) | | | +| integratedServiceEvent | [Ev_IntegratedServiceEvent](#events-Ev_IntegratedServiceEvent) | | | +| runEvent | [Ev_RunEvent](#events-Ev_RunEvent) | | | +| frameworkEvent | [Ev_MetaEvent_FrameworkEvent](#events-Ev_MetaEvent_FrameworkEvent) | | | +| mesosHeartbeatEvent | [Ev_MetaEvent_MesosHeartbeat](#events-Ev_MetaEvent_MesosHeartbeat) | | | +| coreStartEvent | [Ev_MetaEvent_CoreStart](#events-Ev_MetaEvent_CoreStart) | | | + + + + + + + + +### Traits + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| trigger | [string](#string) | | | +| await | [string](#string) | | | +| timeout | [string](#string) | | | +| critical | [bool](#bool) | | | + + + + + + + + + + +### OpStatus + + +| Name | Number | Description | +| ---- | ------ | ----------- | +| NULL | 0 | | +| STARTED | 1 | | +| ONGOING | 2 | | +| DONE_OK | 3 | | +| DONE_ERROR | 4 | | +| DONE_TIMEOUT | 5 | | + + + + + + + + + + + + + +## protos/common.proto + + + + + +### User + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| externalId | [int32](#int32) | optional | The unique CERN identifier of this user. | +| id | [int32](#int32) | optional | The unique identifier of this entity. | +| name | [string](#string) | | Name of the user. | + + + + + + + + +### WorkflowTemplateInfo + + + +| Field | Type | Label | Description | +| ----- | ---- | ----- | ----------- | +| name | [string](#string) | | | +| description | [string](#string) | | | +| path | [string](#string) | | | +| public | [bool](#bool) | | whether the environment is public or not | + + + + + + + + + + + + + + + ## Scalar Value Types | .proto Type | Notes | C++ | Java | Python | Go | C# | PHP | Ruby | diff --git a/docs/building.md b/docs/building.md index ffd5fbe1b..0a4243ec9 100644 --- a/docs/building.md +++ b/docs/building.md @@ -1,6 +1,6 @@ # Building AliECS -> **WARNING**: The building instructions described in this page are **for development purposes only**. Users interested in deploying, running and controlling O²/FLP software or their own software with AliECS should refer to the [O²/FLP Suite instructions](../../installation/) instead. +> **WARNING**: The building instructions described in this page are **for development purposes only**. Users interested in deploying, running and controlling O²/FLP software or their own software with AliECS should refer to the [O²/FLP Suite instructions](https://alice-flp.docs.cern.ch/Operations/Experts/system-configuration/utils/o2-flp-setup/) instead. ## Overview @@ -84,6 +84,6 @@ You should find several executables including `o2control-core`, `o2control-execu For subsequent builds (after the first one), plain `make` (instead of `make all`) is sufficient. See the [Makefile reference](makefile_reference.md) for more information. -If you wish to also build the process control library and/or plugin, see [the OCC readme](./occ/README.md). +If you wish to also build the process control library and/or plugin, see [the OCC readme](../occ/README.md). This build of AliECS can be run locally and connected to an existing O²/FLP Suite cluster by passing a `--mesosUrl` parameter. If you do this, remember to `systemctl stop o2-aliecs-core` on the head node, in order to stop the core that came with the O²/FLP Suite and use your own. diff --git a/docs/development.md b/docs/development.md index bfccde00a..d0373f09f 100644 --- a/docs/development.md +++ b/docs/development.md @@ -4,7 +4,7 @@ Generated API documentation is available on [pkg.go.dev](https:///pkg.go.dev/git The release log is managed via [GitHub](https://github.com/AliceO2Group/Control/releases/). -Bugs go to [JIRA](https://alice.its.cern.ch/jira/browse/OCTRL). +Bugs go to [JIRA](https://its.cern.ch/jira/projects/OCTRL/issues). ## Release Procedure @@ -13,7 +13,6 @@ Bugs go to [JIRA](https://alice.its.cern.ch/jira/browse/OCTRL). 3. Run `hacking/release_notes.sh HEAD` to get a formatted commit message list since the last tag, copy it. 4. Paste the above into a [new GitHub release draft](https://github.com/AliceO2Group/Control/releases/new). Sort, categorize, add summary on top. 5. Pick a version number. Numbers `x.x.80`-`x.x.89` are reserved for Alpha pre-releases. Numbers `x.x.90`-`x.x.99` are reserved for Beta and RC pre-releases. If doing a pre-release, don't forget to tick `This is a pre-release`. When ready, hit `Publish release`. -6. Go to your local clone of [`alice-flp/documentation`](https://gitlab.cern.ch/alice-flp/documentation), descend into `docs/aliecs`. `git pull --rebase` to ensure the submodule points to the tag created just now. Commit and push (or merge request). -7. Go to your local clone of `alidist`, ensure that the branch is `master` and that it's up to date. Then branch out into `aliecs-bump` (`git branch aliecs-bump`). -8. Bump the version in `control.sh`, `control-core.sh`, `control-occplugin.sh` and `coconut.sh`. Commit and push to `origin/aliecs-bump` (`git push -u origin aliecs-bump`). -9. Submit pull request with the above to `alisw/alidist`. +6. Go to your local clone of `alidist`, ensure that the branch is `master` and that it's up to date. Then branch out into `aliecs-bump` (`git branch aliecs-bump`). +7. Bump the version in `control.sh`, `control-core.sh`, `control-occplugin.sh` and `coconut.sh`. Commit and push to `origin/aliecs-bump` (`git push -u origin aliecs-bump`). +8. Submit pull request with the above to `alisw/alidist`. diff --git a/docs/faq.md b/docs/faq.md deleted file mode 100644 index d39935874..000000000 --- a/docs/faq.md +++ /dev/null @@ -1,2 +0,0 @@ -# Frequently Asked Questions - diff --git a/docs/handbook/AliECS-environment.png b/docs/handbook/AliECS-environment.png new file mode 100644 index 000000000..e94444806 Binary files /dev/null and b/docs/handbook/AliECS-environment.png differ diff --git a/docs/handbook/appconfiguration.md b/docs/handbook/appconfiguration.md index 5b5892771..ed96e462b 100644 --- a/docs/handbook/appconfiguration.md +++ b/docs/handbook/appconfiguration.md @@ -1,6 +1,10 @@ # Component Configuration -## Connectivity to controlled nodes +## Apache Mesos + +Apache Mesos is installed as a part of the FLP Suite. + +### Connectivity to controlled nodes ECS relies on Mesos to know the state of the controlled nodes. Thus, losing connection to a Mesos slave can be treated as a node being down or unresponsive. @@ -9,4 +13,4 @@ Then, the environment is transitioned to ERROR. Mesos slave health check can be configured with `MESOS_MAX_AGENT_PING_TIMEOUTS` (`--max_agent_ping_timeouts`) and `MESOS_AGENT_PING_TIMEOUT` (`--agent_ping_timeout`) parameters for Mesos. Effectively, the factor of the two parameters is the time needed to consider a slave/agent as lost. -Please refer to Mesos documentation for more details. \ No newline at end of file +Please refer to [Mesos documentation](https://mesos.apache.org/documentation/latest/) for more details. \ No newline at end of file diff --git a/docs/handbook/concepts.md b/docs/handbook/concepts.md index c5eace31e..45b34fcfd 100644 --- a/docs/handbook/concepts.md +++ b/docs/handbook/concepts.md @@ -2,10 +2,37 @@ From a logical point of view of data processing deployment and control, AliECS deals with concepts such as **environments**, **roles** and **tasks**, the understanding of which is paramount for using AliECS effectively. +
+