|
1 | 1 | # InfoSec Overview |
2 | 2 |
|
3 | | -## Purpose |
| 3 | +_Last updated: 19 May 2025_ |
| 4 | + |
| 5 | +### Purpose |
4 | 6 | This document provides the information needed when evaluating OpenFL for real world deployment in highly sensitive environments. The target audience is InfoSec reviewers who need detailed information about code contents, communication traffic, and potential exploit vectors. |
5 | 7 |
|
6 | | -## Network Connectivity Overview |
7 | | -OpenFL federations use a hub-and-spoke topology between _collaborator_ clients that generate model parameter updates from their data and the _aggregator_ server that combines their training updates into new models [[ref](https://openfl.readthedocs.io/en/latest/about/features_index/taskrunner.html)]. Key details about this functionality are: |
8 | | -* Connections are made using request/response gRPC connections [[ref](https://grpc.io/docs/what-is-grpc/core-concepts/)]. |
9 | | -* The _aggregator_ listens for connections on a single port (usually decided by the experiment admin), and is explicitly defined in the FL plan (f.e. `50051`), so all _collaborators_ must be able to send outgoing traffic to this port. |
10 | | -* All connections are initiated by the _collaborator_, i.e., a `pull` architecture [[ref](https://karlchris.github.io/data-engineering/data-ingestion/push-pull/#pull)]. |
11 | | -* The _collaborator_ does not open any listening sockets. |
12 | | -* Connections are secured using mutually-authenticated TLS [[ref](https://www.cloudflare.com/learning/access-management/what-is-mutual-tls/)]. |
| 8 | +### Overview: Network Connectivity |
| 9 | +OpenFL federations use a hub-and-spoke topology between `collaborator` clients that generate model parameter updates from their data and the `aggregator` server that combines their training updates into new models[^1]. Key details about this functionality are: |
| 10 | + |
| 11 | +* Connections are made using request/response gRPC[^2] connections. |
| 12 | +* The `aggregator` listens for connections on a single port (usually decided by the experiment admin), and is explicitly defined in the FL plan (f.e. `50051`), so all `collaborator`s must be able to send outgoing traffic to this port. |
| 13 | +* All connections are initiated by the `collaborator`, i.e., a [`pull`](https://karlchris.github.io/data-engineering/data-ingestion/push-pull/#pull) architecture. |
| 14 | +* The `collaborator` does not open any listening sockets. |
| 15 | +* Connections are secured using mTLS[^3]. |
13 | 16 | * Each request response pair is done on a new TLS connection. |
14 | | -* The PKI for federations can be created using the [OpenFL CLI](https://openfl.readthedocs.io/en/latest/about/features_index/taskrunner.html#step-2-configure-the-federation). OpenFL internally leverages Python's cryptography module. The organization hosting the _aggregator_ usually acts as the Certificate Authority (CA) and verifies each identity before signing. |
15 | | -* Currently, the _collaborator_ polls the _aggregator_ at a fixed interval. We have had a request to enable client-side configuration of this interval and hope to support that feature soon. |
| 17 | +* The PKI for federations is created using the [`fx aggregator/collaborator certify`](https://openfl.readthedocs.io/en/latest/fx.html) CLI command. OpenFL internally leverages Python's cryptography module. The organization hosting the `aggregator` usually acts as the Certificate Authority (CA) and verifies each identity before signing. |
| 18 | +* Currently, the `collaborator` polls the `aggregator` at a fixed interval. We have had a request to enable client-side configuration of this interval and hope to support that feature soon. |
16 | 19 | * Connection timeouts are set to gRPC defaults. |
17 | | -* If the _aggregator_ is not available, the _collaborator_ will retry connections indefinitely. This is currently useful so that we can take the aggregator down for bugfixes without _collaborator_ processes exiting. |
| 20 | +* If the `aggregator` is not available, the `collaborator` will retry connections indefinitely. This is currently useful so that we can take the aggregator down for bugfixes without `collaborator` processes exiting. |
18 | 21 |
|
19 | | -## Overview of Contents of Network Messages |
| 22 | +### Contents of Network Messages |
20 | 23 | Network messages are well defined protobufs which can be found in the following files: |
21 | | -- [aggregator.proto](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto) |
22 | | -- [base.proto](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/base.proto) |
| 24 | +- [`aggregator.proto`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto) |
| 25 | +- [`base.proto`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/base.proto) |
23 | 26 |
|
24 | 27 | Key points about the network messages/protocol: |
25 | 28 | * No executable code is ever sent to the collaborator. All code to be executed is contained within the OpenFL package and the custom FL workspace. The code, along with the FL plan file that specifies the classes and initial parameters to be used, is available for review prior to the FL plans execution. This ensures that all potential operations are understood before they take place. |
26 | | -* The _collaborator_ typically requests the FL tasks to execute from the aggregator via the `GetTasksRequest` message [[ref](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L34)] |
27 | | -* The _aggregator_ reads the FL plan and returns a `GetTasksResponse` [[ref](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L45)] which includes metadata (`Tasks`) [[ref](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L38)] about the Python functions to be invoked by the collaborator (the code being installed locally as part of a pre-distributed workspace bundle) |
28 | | -* The _collaborator_ then uses its TaskRunner framework to execute the FL tasks on the locally available data, producing output tensors such as model weights or metrics |
29 | | -* During task execution, the _collaborator_ may additionally request tensors from the aggregator via the `GetAggregatedTensor` RPC method [[ref](https://openfl.readthedocs.io/en/latest/reference/_autosummary/openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.html#openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.GetAggregatedTensor)] |
30 | | -* Upon task completion, the _collaborator_ transmits the results by emitting a `SendLocalTaskResults` call [[ref](https://openfl.readthedocs.io/en/latest/reference/_autosummary/openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.html#openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.SendLocalTaskResults)] which contains `NamedTensor` [[ref](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/base.proto#L11)] objects that encode model weight updates or ML metrics such as loss or accuracy (among others). |
31 | | - |
32 | | -## Testing a Collaborator |
33 | | -There is a "no-op" workspace template in OpenFL (available in versions `>=1.9`) which can be used to test the network connection between the _aggregator_ and each _collaborator_ without performing any computational task. More details can be found [here](https://github.com/securefederatedai/openfl/tree/develop/openfl-workspace/no-op#overview). |
| 29 | +* The `collaborator` typically requests the FL tasks to execute from the aggregator via a [`GetTasksRequest`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L34) message. |
| 30 | +* The `aggregator` based on the FL plan, returns a [`GetTasksResponse`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L45) which includes [`Tasks`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/aggregator.proto#L38) - metadata about the Python functions to be invoked by the collaborator. All code is available locally to each collaborator as part of a pre-distributed workspace bundle. |
| 31 | +* The `collaborator` then uses its TaskRunner framework to execute the FL tasks on the locally available data, producing output tensors such as model weights or metrics. |
| 32 | +* During task execution, the `collaborator` may require certain tensors for task execution that are not available locally. For example, a federated training task requires globally averaged model weights from the `aggregator`. Collaborators gather a list of tensor keys that need to be fetched from the aggregator and download them via the [`GetAggregatedTensors`](https://openfl.readthedocs.io/en/latest/reference/_autosummary/openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.html#openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.GetAggregatedTensor) RPC method. |
| 33 | +* Upon task completion, the `collaborator` transmits the results by emitting a [`SendLocalTaskResults`](https://openfl.readthedocs.io/en/latest/reference/_autosummary/openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.html#openfl.transport.grpc.aggregator_server.AggregatorGRPCServer.SendLocalTaskResults) RPC method which contains [`NamedTensor`](https://github.com/securefederatedai/openfl/blob/develop/openfl/protocols/base.proto#L11) objects that encode results (like model weight updates or metrics such as loss or accuracy). |
| 34 | + |
| 35 | +### Testing a Collaborator |
| 36 | +There is a "no-op" workspace template in OpenFL (available in versions `>=1.9`) which can be used to test the network connection between the `aggregator` and each `collaborator` without performing any computational task. More details can be found [here](https://github.com/securefederatedai/openfl/tree/develop/openfl-workspace/no-op#overview). |
| 37 | + |
| 38 | + |
| 39 | +[^1]: [OpenFL TaskRunner Overview](https://openfl.readthedocs.io/en/latest/about/features_index/taskrunner.html) |
| 40 | +[^2]: [gRPC Overview](https://grpc.io/docs/what-is-grpc/core-concepts/) |
| 41 | +[^3]: [mTLS Overview](https://www.cloudflare.com/learning/access-management/what-is-mutual-tls/) |
0 commit comments