metal : add residency sets keep-alive heartbeat #17766
Merged
+320
−43
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.



cont #11427
ref #10119
So something changed in MacOS recently because the fix from #11427 no longer works - the memory wiring/unwiring (a.k.a. throttling) after 1 second of being idle is back. Maybe this happened with the update to MacOS Tahoe - not sure.
Here are the results on
master:make -j && ./bin/llama-idle -m ../models/llama-3.1-70b/ggml-model-f16.ggufAnd here are the results with this PR:
It seems that attaching the residency sets to the Metal queue mostly eliminates the unwiring of the memory. Although, every now and then, it still seems to occur - not sure if this was the case before on MacOS Sequoia.Edit: Just attaching the residency sets to the Metal queue is not enough. Ended up implementing a background thread that periodically calls MTLResidencySet::requestResidency() in order to keep the memory buffers wired. The thread loops as a heartbeat in the background and requests residency for all sets approximately every 500ms.
By default, this heartbeat stops after 3 minutes of inactivity. It can be controlled with the environment variable:
# keep the memory wired for 30 seconds after last activity (e.g. graph computation) GGML_METAL_RESIDENCY_KEEP_ALIVE_S=30 ...