-
Notifications
You must be signed in to change notification settings - Fork 14.1k
parser/lexer: bump to Unicode 17, use faster unicode-ident #148321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The list of allowed third-party dependencies may have been modified! You must ensure that any new dependencies have compatible licenses before merging. These commits modify the If this was unintentional then you should revert the changes before this PR is merged. |
|
rustbot has assigned @Mark-Simulacrum. Use |
This comment has been minimized.
This comment has been minimized.
|
If the Unicode version changes are intentional, cc @ehuss |
|
@bors try @rust-timer queue (Parsing could be affected too). |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
parser/lexer: bump to Unicode 17, use faster unicode-ident
This comment has been minimized.
This comment has been minimized.
|
Finished benchmarking commit (988451c): comparison URL. Overall result: ❌ regressions - no action neededBenchmarking this pull request means it may be perf-sensitive – we'll automatically label it not fit for rolling up. You can override this, but we strongly advise not to, due to possible changes in compiler perf. @bors rollup=never Instruction countOur most reliable metric. Used to determine the overall result above. However, even this metric can be noisy.
Max RSS (memory usage)This benchmark run did not return any relevant results for this metric. CyclesResults (primary 2.6%, secondary 3.6%)A less reliable metric. May be of interest, but not used to determine the overall result above.
Binary sizeThis benchmark run did not return any relevant results for this metric. Bootstrap: 473.971s -> 474.835s (0.18%) |
|
Is there a reason why the reference explicitly specifies the Unicode version in a way that makes it feel like updating that version is a nontrivial change? i.e., is there a reason why it does not clarify that the Unicode version in the compiler is allowed to be (and should be) bumped whenever Unicode releases a new version, and to simply say something like "it is version N as of Rust 1.M"? |
|
I think it needs some level of review, to make sure that (for instance) the new Unicode version isn't doing anything out of the ordinary, and to make sure that some person in the project experienced with Unicode has taken at least a cursory look at the changes to I don't think that review is best done in lang. I think we should delgate that to whichever team is making sure of the above. (Is that T-compiler?) So, ideally, I'd love to see a proposal to lang requesting a delegation to take responsibility for the above. That said, let's go ahead and sign off on this change to unblock it. |
|
☔ The latest upstream changes (presumably #147029) made this pull request unmergeable. Please resolve the merge conflicts. |
|
I tend to agree that FCP seems like the wrong tool to confirm implementation details (e.g., membership of XID sets). I don't know what specific review you'd like to perform so yielding to you. |
|
|
Replace unicode-xid with unicode-ident which is 6 times faster
|
This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed. Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers. |
|
cc @rust-lang/fls |
|
The final comment period, with a disposition to merge, as per the review above, is now complete. As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed. This will be merged soon. |
|
@simulacrum @Josh Triplett This PR seems to be stuck in limbo. I'm not quite following exchange. Josh, were you specifically asking someone to do an analysis now of the codepoints that changed? Or were you saying that is something that should usually be done in the future? @simulacrum I think Josh was saying they specifically don't want to do the review. Are you saying that unicode-security needs to be updated to Unicode 17 before this can continue? Are Unicode updates required to keep the confusables list in sync? If you want a diff, here it is: https://gist.github.com/ehuss/4ec7304c08bc3d179febf3fff8fdbacd |
|
Personally, I'm happy to r+ this, but several people on lang indicated they wanted some kind of review. It's not obvious to me whether their check boxes on the FCP indicate they have done said review (it's not clear to me what would go into that, so I can't really check). It's not clear to me what would block us from bumping the Unicode version we're using (maybe I don't know enough, though), so in light of that uncertainty I didn't want to just r+. But if we treat the FCP as having resolved it then I'm happy to effectively rubber stamp this. @ehuss, you had some concerns over using different crates here - happy for us to not approve this for those reasons (I don't fully understand that concern but haven't dug into the details). |
|
Since all the relevant crates seem to export some flavour of the |
|
From the lang side, as @scottmcm mentioned,
our FCP covered only the decision to change the set of Rust programs that are accepted by virtue of changing the version of Unicode that Rust is defined to use. It did not indicate that we had done any review beyond that we felt necessary to OK this in terms of the language definition. No concerns have been filed, and our FCP is complete, so if an r+ reviewer is happy with this change (on the understanding that nobody on lang necessarily did the detailed check that @joshtriplett mentioned), it's procedurally good to go as far as the lang team is concerned. |
This sounds like a good idea. Can we add a static assert or similar on this?
Given the above I think I'm OK moving forward with the asserts added. r=me with that done. |
I was poking around on the feasibility of this and will post the relevant PRs when I get around to it. Ultimately, after looking more into the relevant stuff, it really doesn't feel like it matters that much; most of the "worst case scenarios" are just things like identifiers using just-added scripts being accepted by cargo and not by the compiler, or vice/versa, which still would result in an error regardless. Unicode has a pretty strong stability guarantee and they take care to not cause too many issues. That said, I'll be sure to link the relevant PRs back here in case people are curious. |
|
Clarifying something: the request for the future, not blocking this PR, is that we have some set of folks (e.g. @Manishearth) who know Unicode well that we could delegate the "version N+1 LGTM" decision to in the future. |
|
I'd be willing to help but I think for that to work the relevant teams would have to codify what they are looking for. For reference, here's the difference between Unicode 16 and 17 for XID_Start: (and here are the new XID_Continue-only characters) Is it a question of mismatched goals between Rust and Unicode, where Unicode's bar for "valid in identifiers" is subtly different from that of Rust? Is it a question of trust in Unicode's ability to hit that bar? Or something else? The rough definition from the spec is this:
If it's about Rust having a different bar, then Rust needs to articulate what that bar is. There's a good chance that we can even write tests against Unicode data that verify some of those properties. This is what we did in the non-ascii idents RFC: Rust articulated its concerns about it, and we were able to crystallize those into lints that use other Unicode properties (ones explicitly designed for this purpose, even). If it's about Unicode's ability to hit that bar, I can certainly give such a list a once-over and check if there aren't any mistakes. But beyond that, I am literally on the body that assigns these properties, and if I needed help to verify such a thing (no one person understands all of the things in Unicode) I'd just be asking other people on that body, so I don't think that would be any sort of meaningful exercise.
Note that this is mostly because unicode_rs crates get bumped when someone wants them to be. If Rust wanted them to be bumped, they can be bumped (best way: make a PR with the bump). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since all the relevant crates seem to export some flavour of the UNICODE_VERSION constant included in the standard library, perhaps there could be a simple test somewhere that asserts that all these constants are identical? That way manual review of the versions isn't required.
At the moment there is this test in place, checking that rustc_lexer::UNICODE_IDENT_VERSION == rustc_parse::UNICODE_NORMALIZATION_VERSION
Would expanding this test to all the Unicode-version dependent crates used in /compiler/* be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, to clarify, this is exactly what I was proposing in this comment: #148321 (comment)
I was looking into it, and my particular method I was going with was just adding an anonymous constant in each of the crates to make a static assertion. So, for example, in rustc_lexer, this is what I currently have right now in my (unpublished) branch:
// ensure that unicode version is same as libstd
const _: () = {
let internal = std::char::UNICODE_VERSION;
let properties = unicode_properties::UNICODE_VERSION;
assert!(internal.0 as u64 == properties.0);
assert!(internal.1 as u64 == properties.1);
assert!(internal.2 as u64 == properties.2);
};
const _: () = {
let internal = std::char::UNICODE_VERSION;
let xid = unicode_xid::UNICODE_VERSION;
assert!(internal.0 as u64 == xid.0);
assert!(internal.1 as u64 == xid.1);
assert!(internal.2 as u64 == xid.2);
};(note: I'm using libstd's version as an anchor because it's the most convenient: if all of them match, then libstd should match. also, while I could use unstable features to make this simpler with assert_eq, I chose to go with something that works on stable atm just for ease of maintenance)
|
Gonna preface this with: my main reason for having this discussion in the first place is, as I mostly mentioned, I think that bumping the Unicode version, since it is a routine change, should have some sort of playbook so we don't need to have these discussions again. And, so, I'm having these discussions here because I think the formation of said playbook is relevant, but if you'd rather move it somewhere else, I'm fine doing that too.
Not that this is the place to put this, but the exact reason why I submitted unicode-rs/unicode-script#25 (which you merged) is because Clippy uses Right now, tidy only allows checking particular crates in There's also a potential consideration of just including the Unicode functionality most critical to the compiler directly in the compiler itself. Since all of the code used has a compatible license, it shouldn't be tremendously difficult to copy it in somewhere, and it would also mean that we could share some of the code that parses stuff like |
Hello,
Bump the unicode version used by lexer/parser to 17.0.0 by updating:
unicode-normalizationto 0.1.25unicode-propertiesto 0.1.4unicode-widthto 0.2.2and by replacing
unicode-xidwithunicode-identwhich is also 6 times faster.I think it might be worth to run the benchmarks to double check.
(
unicode-identis already insrc/tools/tidy/src/deps.rs)Thanks!