Skip to content

Conversation

@agarwal-navin
Copy link
Contributor

@agarwal-navin agarwal-navin commented Nov 14, 2025

Description

The summaries written by Shared Tree and its summarizables are not versioned. Any changes to the summary format can lead to failures in clients that do not understand this new format. For example, the incremental summarization feature changes the forest's summary format by adding more nodes to its summary tree which older clients won't be able to read. Making this change behind a new version makes it safer as we can fail fast on seeing unrecognizable versions.
This change adds versioning by adding a metadata blob to the summary tree of the shared tree and its summarizables. This metadata blob will contain the format version of the summary. Every time the format of the summary changes, a new version should be added so that clients that don't understand this format will fail.

Here's how versioning will be added:

  1. The current summaries generation (before this PR) are considered format version 1.
  2. WIth this PR, a metadata blob is written into the summary and thus, changing the summary format. This summary's format version is 2. This is fully backwards compatible with version 1 and will not break older clients.
  3. In a future change, the summariables will write format version 3 based on the minVersionForCollab. This version will change the format such that older clients that read / write version 1 and 2 will fail on reading version 3. This will be done to ensure that summary format changes in the future don't silently fail at random places.
  4. Any changes to the summary format by a summarizable after this point should do so with a new format version.

AB#53723

@agarwal-navin agarwal-navin requested a review from a team as a code owner November 14, 2025 19:05
@github-actions github-actions bot added area: dds Issues related to distributed data structures area: dds: tree base: main PRs targeted against main branch labels Nov 14, 2025
@yann-achard-MS
Copy link
Contributor

The PR description currently says:

The metadata blob will be written when minVersionForCollab is greater than the next release version 2.73.0.
but the code seems to write the metadata when minVersionForCollab is greater or equal to version 2.73.0.

Since 2.73.0 has not yet been released, it seems fine to write the metadata when when minVersionForCollab is greater or equal to version 2.73.0, so I suggest updating the PR description.

@yann-achard-MS
Copy link
Contributor

If I understand correctly, this PR ensures that future changes to the summary tree structures (so long as they update the version numbers written to the relevant metadata summary blobs) will make FF clients >= v2_73 fail fast when they do not know that version. This doesn't solve the issue of FF clients < v2_73, which is why the PR description suggests that future changes be made in a way that breaks them too.

My question is: why not proactively make such a change (either in this PR or as an immediate follow-up)?

/**
* The storage key for the blob containing metadata for the detached field index's summary.
*/
export const detachedFieldIndexMetadataKey = ".metadata";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This starting with a "." is interesting.

If this is indented to mean something, maybe establish a pattern that "." keys are part of the top level summary not the summary items or something, it should be documented somewhere central, and this should link to it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I used "." here is because that's the convention the runtime follows for blobs that are not actual data. For example, the metadata, alias, blobs added by the container runtime. The .attributes blob added by channel context's to a DDS's summary, etc.

However, there is no documentation for this and no real reason to do it. It's just a conventional that is followed. Not sure what a good place to add this would be. Maybe to ISummaryTree?

* The type for the metadata in the detached field index's summary.
* Using type definition instead of interface to make this compatible with JsonCompatible.
*/
// eslint-disable-next-line @typescript-eslint/consistent-type-definitions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not follow our linter's recommendations? Did you read the documentation for that rule and decide why its suggestion should apply to our codebase, but not to this specific instance? If so, please include that reason as a comment here. If not, go read it now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm pretty sure I know what it recommends and I agree with it here for a better developer experience, but I should have to know that from memory to review this: the linter is supposed to help tell people things when relevant.

Note 2: its way easier to get to that rule documentation from in the IDE when looking at the warning (click the link in the warning IIRC) than for a reviewer here. That's another reason to explain why this suppression is ok despite our choice to enable this lint inline here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the documentation and agree that having consistent types is better for readability. I did add the reason for using a type here in the doc comments above. Do you think that's not sufficient info or maybe you missed it?

services,
parse,
);
assert(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure an "assert" is the right error type for this. When encountering a document too new for the current code, that indicates a usage error by the application team (either bad rollout schedule, bad min version for collab) or a decision to simply not support this client.

In none of those cases is this error a bug in the fluid framework code, which is what asserts indicate.

If there was some upstream checking which was supposed to ensure such a case would never make it to this point, then an assert here would be valid. If that is the case, can you add a comment explaining what check would catch this and give a usage effort before we get here? If not, replace this with a usage error with useful customer facing details (maybe we should have a shared utility for throwing such forwards compat errors?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to a UsageError. Although, not sure how to incorporate information about minVersionForCollab, etc. in the error.

// Load the incremental summary builder so that it can download any incremental chunks in the
// snapshot.
await this.incrementalSummaryBuilder.load(services, readAndParseBlob);
await this.incrementalSummaryBuilder.load(services, async (id: string) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are a lot of "id"s in this code. Could you use a more specific name or type? Just looking at this code I don't even know what this async function is (can't see the parameter name for it from the call site).

It seems a bit odd to be adding this async function as a parameter here in this PR: is this part of the versioning logic somehow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the id to be called chunkPath. Also, updated the incrementalSummaryBuilder.load to take an parameter object to make it easier to see the names of parameter.

I updated this code because I added a utility in this PR to read a blob from snapshot and parse it. It is not related to versioning but just removes duplicate code.

@agarwal-navin
Copy link
Contributor Author

If I understand correctly, this PR ensures that future changes to the summary tree structures (so long as they update the version numbers written to the relevant metadata summary blobs) will make FF clients >= v2_73 fail fast when they do not know that version. This doesn't solve the issue of FF clients < v2_73, which is why the PR description suggests that future changes be made in a way that breaks them too.

My question is: why not proactively make such a change (either in this PR or as an immediate follow-up)?

That is a good point. We could make a breaking change in v1. Let me create a work item to do that as a follow up.

@agarwal-navin
Copy link
Contributor Author

If I understand correctly, this PR ensures that future changes to the summary tree structures (so long as they update the version numbers written to the relevant metadata summary blobs) will make FF clients >= v2_73 fail fast when they do not know that version. This doesn't solve the issue of FF clients < v2_73, which is why the PR description suggests that future changes be made in a way that breaks them too.
My question is: why not proactively make such a change (either in this PR or as an immediate follow-up)?

That is a good point. We could make a breaking change in v1. Let me create a work item to do that as a follow up.

AB#53973

}

const supportedReadVersions = new Set<DetachedFieldIndexSummaryVersion>([
DetachedFieldIndexSummaryVersion.v1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit confusing not to have an entry to represent the current version: supportedReadVersions.has(minVersionToDetachedFieldIndexSummaryVersion("2.72.0")) returns false despite the fact that such a format is supported.

In the message codec this was handled by having MessageFormatVersion.undefined = 0. I'm not sure if there are better patterns, but it seems better than nothing.

Copy link
Contributor

@yann-achard-MS yann-achard-MS Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that VersionedSummarizer treats undefined as a special value. Having a DetachedFieldIndexSummaryVersion.vUndefined = 0 will mess that up...

Maybe the VersionedSummarizer ctor could take an optional parameter that represents a version number (DetachedFieldIndexSummaryVersion.vUndefined) for which no version information should be written/read in the summary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a vPreVersioning entry which is 0. VersionedSummarizer will write metadata blob for versions that are greater than this. Please let me know what you think.

Copy link
Contributor

@yann-achard-MS yann-achard-MS Nov 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem better, but why is vPreVersioning not included in supportedReadVersions? My comment about supportedReadVersions.has(minVersionToDetachedFieldIndexSummaryVersion("2.72.0")) returning false still holds, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the logic a bit.
Regarding supportedReadVersions.has(minVersionToDetachedFieldIndexSummaryVersion("2.72.0")) returning false. I was thinking that vPreConditioning is not a supported write version because metadata blob won't be written for this version. It exists to represent summaries written in the time period where this code is added but the minVersionForCollab is < 2.73.0. So, we don't write the metadata blob.
We could change this logic and write the metadata blob with version 0. It won't break anything and the logic becomes a lot simpler. The only slightly weird side effect is that older client will just ignore the metadata blob. But maybe that's a fine trade off.

Comment on lines 24 to 26
* Base class from which all {@link Summarizable}s derive.
* It handles versioning of summaries - writing version metadata to summaries
* and checking version compatibility when loading.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going to have both the Summarizable interface, and this class, then there is the option of having a an implementation of Summarizable that does not use this class. This flexibility might be useful (for example if something already extending so other class also want to implement Summarizable), but if we want to allow using it we should not say "Base class from which all {@link Summarizable}s derive." and instead something like "Utility for implementing {@link Summarizable}s classes with versioning"

If instead we do actually want to enforce that all Summarizables use versioning via these utilities, I see two approaches:

  1. Replace Summarizable interface with this class (For example just rename this class to it to Summarizable and delete the interface)
  2. Modify the interface to consist of the abstract methods and constructor parameters as properties, and move the code from here to the user of Summarizable objects (sharedTreeCore)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment to say "Utility for implementing {https://github.com/link Summarizable}s classes with versioning".

I don't think we need to force all summarizables to use this utility - only ones that need to be versioned. At least we should start with this flexibility. If we do decide that all summarizables must have versioning, we can update the pattern.

@agarwal-navin agarwal-navin changed the title (tree) Add versioning to shared tree and its summarizables' summaries (tree) Add versioning to shared tree's summary Nov 21, 2025
supportedVersions: Set<number>;
/**
* The default format version to use if the summary doesn't have metadata blob.
* This is true for summaries that were written before versioning was added for summaries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* This is true for summaries that were written before versioning was added for summaries.
* This is used for summaries that were written before versioning was added for summaries.

}
if (!this.supportedVersions.has(version)) {
throw new UsageError(
`Cannot read version ${version} of shared tree summary. Upgrade to a supported version.`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`Cannot read version ${version} of shared tree summary. Upgrade to a supported version.`,
`Cannot read version ${version} of shared tree summary.`,

I'm not sure we should suggest upgrading as a remedy because this error could happen when newer clients who have dropped support for an old format end up encountering that format. In such a situation, the solution is not to upgrade the client, but to downgrade it.

Comment on lines +104 to +105
// Delete metadata blob from the summary
Reflect.deleteProperty(summary.summary.tree, summarizablesMetadataKey);
Copy link
Contributor

@yann-achard-MS yann-achard-MS Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this only makes sense because the v1 summary format happens to be the same as the v2 format that is generated above.
This test will break when change the default summary being written to some v3 that does not happen to be the same as v1.
This is bad because this test is checking the ability to load v1 summaries, and the test will fail despite the fact that the ability to load v1 summaries might still work fine.

A better pattern would be to keep a JSON object or JSON string representing a summary in the old format as a constant (instead of generating one at runtime). This will also guard against the risk that the generation code for v1/v2 summaries might change.

I think the overall pattern we want to promote is as follows:
For each format that we support reading:

  • keep a copy of a summary written in that format
  • write a test that ensures that reading/loading a summary from that format succeeds

For each format that we support writing:

  • write a snapshot test that ensures writing a summary to that format generates the same thing as it always has

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write a snapshot test that ensures writing a summary to that format generates the same thing as it always has

Maybe that last bit is already sufficiently covered by the existing snapshot tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: dds: tree area: dds Issues related to distributed data structures base: main PRs targeted against main branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants