Skip to content

Conversation

@encukou
Copy link
Member

@encukou encukou commented Oct 22, 2025

This simplifies the Lexical Analysis section on Names (but keeps it technically correct) by putting all the info about non-ASCII characters in a separate (and very technical) section.

It uses a mental model where the parser doesn't handle Unicode complexity “immediately”, but:

  1. parses any non-ASCII character (outside strings/comments) as part of a name, since these can't (yet) be e.g. operators
  2. normalizes the name
  3. validates the name, using the id_start/id_continue sets (referred to in previous sections as “letter-like” and “number-like” characters, with a link to the details)

This also means we don't need xid_start/xid_continue to define the behaviour :)


📚 Documentation preview 📚: https://cpython-previews--140464.org.readthedocs.build/

encukou and others added 4 commits October 8, 2025 17:58
Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
Co-authored-by: Blaise Pabon <blaise@gmail.com>
Co-authored-by: Micha Albert <info@micha.zone>
Co-authored-by: KeithTheEE <kmurrayis@gmail.com>
Copy link
Contributor

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Outstanding document @encukou. I had one small suggestion to be a bit more explicit on the normalization example with number.

This means that, for example, some typographic variants of characters are
converted to their "basic" form, for example::

>>> nᵘₘᵇₑʳ = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to add an explicit comment that the normalized form of nᵘₘᵇₑʳis number.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this look good?

@encukou
Copy link
Member Author

encukou commented Nov 5, 2025

There was an insightful conversation in #140269. I'll update this PR to make things even clearer.

Copy link
Contributor

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @encukou

@encukou encukou marked this pull request as ready for review November 19, 2025 16:08
Copy link
Contributor

@willingc willingc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @encukou!

@encukou
Copy link
Member Author

encukou commented Nov 20, 2025

Thank you for the review!

@malemburg, do you also want to take a look?

@encukou encukou merged commit 2ff8608 into python:main Nov 26, 2025
36 checks passed
@encukou encukou deleted the lex-analysis-names-simpler branch November 26, 2025 15:10
@github-project-automation github-project-automation bot moved this from Todo to Done in Docs PRs Nov 26, 2025
@encukou encukou added the needs backport to 3.14 bugs and security fixes label Nov 26, 2025
@miss-islington-app
Copy link

Thanks @encukou for the PR 🌮🎉.. I'm working now to backport this PR to: 3.14.
🐍🍒⛏🤖

@miss-islington-app
Copy link

Sorry, @encukou, I could not cleanly backport this to 3.14 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 2ff8608b4da33f667960e5099a1a442197acaea4 3.14

@bedevere-app
Copy link

bedevere-app bot commented Nov 27, 2025

GH-142015 is a backport of this pull request to the 3.14 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.14 bugs and security fixes label Nov 27, 2025
StanFromIreland added a commit to StanFromIreland/cpython that referenced this pull request Nov 27, 2025
This simplifies the Lexical Analysis section on Names (but keeps it technically correct) by putting all the info about non-ASCII characters in a separate (and very technical) section.

It uses a mental model where the parser doesn't handle Unicode complexity “immediately”, but:

- parses any non-ASCII character (outside strings/comments) as part of a name, since these can't (yet) be e.g. operators
- normalizes the name
- validates the name, using the xid_start/xid_continue sets

(cherry picked from commit 2ff8608)

Co-authored-by: Petr Viktorin <encukou@gmail.com>
Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com>
Co-authored-by: Blaise Pabon <blaise@gmail.com>
Co-authored-by: Micha Albert <info@micha.zone>
Co-authored-by: KeithTheEE <kmurrayis@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs Documentation in the Doc dir skip news

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Docs: note requirement to normalise unicode identifiers passed to globals() and locals()

2 participants