-
Notifications
You must be signed in to change notification settings - Fork 233
Batched fetching of tensors to reduce RPC calls [1/2] #1575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
cef758c to
19c29e2
Compare
teoparvanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of comments from my first read-through:
kminhta
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this up @MasterSkepticista - this is a great fundamental adjustment to how we transport the model across the network. I don't see any major concerns, I'm going to go ahead and approve this after my review and our offline discussion with the understanding that:
- delta updates get reintroduced in the follow up PR
- streaming gets reintroduced in a more generic manner (as you called out in #1575 (comment)) in some subsequent PR as well - or otherwise some manner of handling large models from both the aggregator and collaborator
I know you called out point 1 in the PR, but do you have a mental model for how point 2 can be achieved or is the general implementation still an open? I ask mainly because it would be good to close the gap for first class LLM support (thinking in terms of pretraining), but it is also true that LLM fine-tuning has a large library of parameter efficient methods that may still allow for expanding OpenFL-LLM capabilities in the interim
|
FYI, a couple of the tests are failing - secure aggregation in particular https://github.com/securefederatedai/openfl/actions/runs/15063524782/job/42343165534?pr=1575 |
|
Also, going to tag @ishaileshpant to keep him in the loop as there are modifications to RPC calls that may affect #1500 |
porteratzo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
cdbe2d6 to
dbb6eca
Compare
Signed-off-by: payalcha <payal.chaurasiya@intel.com> Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <karan.shah@intel.com> Co-authored-by: Payal Chaurasiya <payal.chaurasiya@intel.com> Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
dbb6eca to
4ae2b31
Compare
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
34949fd to
fcbfff4
Compare
teoparvanov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @MasterSkepticista !
Signed-off-by: Shah, Karan <kbshah1998@outlook.com>
…eratedai/openfl into karansh1/batched_fetch
theakshaypant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
I do think we wait for 1.9 cut-off before merging this given the temporary removal of use_delta_updates.
Signed-off-by: Shah, Karan <karan.shah@intel.com>
Signed-off-by: Shah, Karan <karan.shah@intel.com>
8b1b27b to
6bce94f
Compare
a56ce95 to
6bce94f
Compare
…efederatedai#1575 - rename the function to get_aggregated_tensors - adjust both grpc and rest client for name and signature change in api - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc - fix the test cases Signed-off-by: Shailesh Pant <shailesh.pant@intel.com>
…efederatedai#1575 - rename the function to get_aggregated_tensors - adjust both grpc and rest client for name and signature change in api - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc - fix the test cases rebased 29th.May.1 Signed-off-by: Shailesh Pant <shailesh.pant@intel.com>
…efederatedai#1575 - rename the function to get_aggregated_tensors - adjust both grpc and rest client for name and signature change in api - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc - fix the test cases rebased 29th.May.1 Signed-off-by: Shailesh Pant <shailesh.pant@intel.com>
…efederatedai#1575 - rename the function to get_aggregated_tensors - adjust both grpc and rest client for name and signature change in api - for the protobuf changes of TensorSpeca and Batched response adjust both client and server in line with grpc - fix the test cases rebased 29th.May.1 Signed-off-by: Shailesh Pant <shailesh.pant@intel.com>
Motivation
Collaborators today fetch each model tensor from the aggregator via a dedicated RPC call. Aggregator has limited resources to serve requests, and it is not uncommon to have hundreds (if not thousands) of model tensors waiting to be served to each collaborator. A federation itself may have hundreds (if not thousands) of participants making these requests.
Example
Consider
torch/histologyexperiment.(get_tasks + 3 x send_local_task_results) x 8 collaborators32 RPC calls are made for other purposes in each round.Problem gets worse when models have hundreds of tensors and experiments span more collaborators.
Goal of this PR
This is Part 1 (of 2) PRs that adds support for batched fetching of tensors over gRPC. This PR makes the following major changes:
send_model_deltasanduse_delta_updatessupport.nparray_to_named_tensor(henceforth calledserialize_tensor) andnamed_tensor_to_nparray(henceforth calleddeserialize_tensor) without any brittle conditional checks.In Part 2 of this PR, delta update support will be added back. The reason for removal is to straighten the logic for actions that concern a communication pipeline (de/compression, de/serialization) and the aggregator/collaborator component.
Implementation
A new RPC call replaces the
get_aggregated_tensorwithget_aggregated_tensors(plural 's'). Collaborators request a batch of tensors in a newTensorSpecformat.You may observe the resemblance with
TensorKey, except that origin field is missing.The RPC request and response formats are shown below:
On collaborator side, tensors are fetched via
self.fetch_tensors_from_aggregator(tensor_keys)wheretensor_keysare of typeList[TensorKey].Cost/Benefit Analysis
One downside of this PR is that potential bandwidth savings achieved due to delta updates is lost.
The upside of this PR is a significantly simplified mental model of how tensors are processed on both ends, higher correctness confidence and higher maintainability.
Indeed, the second part will close the drawback by bringing in delta updates, keeping the mental model simple.
Reviewer note(s): This PR has a lot of cleanup. It is best understood by looking at the proposed changes directly and not diffs.