From 4606120878c12b6d20930689a91dbf7096af66ee Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 8 Oct 2025 17:58:54 +0200 Subject: [PATCH 1/7] Simplify Names section Co-authored-by: Stan Ulbrych <89152624+StanFromIreland@users.noreply.github.com> Co-authored-by: Blaise Pabon Co-authored-by: Micha Albert Co-authored-by: KeithTheEE --- Doc/reference/lexical_analysis.rst | 140 +++++++++++++++++------------ 1 file changed, 82 insertions(+), 58 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 0b0dba1a996af0..4c62d06db6a187 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -386,73 +386,29 @@ Names (identifiers and keywords) :data:`~token.NAME` tokens represent *identifiers*, *keywords*, and *soft keywords*. -Within the ASCII range (U+0001..U+007F), the valid characters for names -include the uppercase and lowercase letters (``A-Z`` and ``a-z``), -the underscore ``_`` and, except for the first character, the digits -``0`` through ``9``. +Names are composed of the following characters: + +* Uppercase and lowercase letters (``A-Z`` and ``a-z``) +* The underscore (``_``) +* Digits (``0`` through ``9``), which cannot appear as the first character +* Non-ASCII characters. Valid names may only contain "letter-like" and + "digit-like" characters; see :ref:`lexical-names-nonascii` for details. Names must contain at least one character, but have no upper length limit. Case is significant. -Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can also use "letter-like" -and "number-like" characters from outside the ASCII range, as detailed below. - -All identifiers are converted into the `normalization form`_ NFKC while -parsing; comparison of identifiers is based on NFKC. - -Formally, the first character of a normalized identifier must belong to the -set ``id_start``, which is the union of: - -* Unicode category ```` - uppercase letters (includes ``A`` to ``Z``) -* Unicode category ```` - lowercase letters (includes ``a`` to ``z``) -* Unicode category ```` - titlecase letters -* Unicode category ```` - modifier letters -* Unicode category ```` - other letters -* Unicode category ```` - letter numbers -* {``"_"``} - the underscore -* ```` - an explicit set of characters in `PropList.txt`_ - to support backwards compatibility - -The remaining characters must belong to the set ``id_continue``, which is the -union of: - -* all characters in ``id_start`` -* Unicode category ```` - decimal numbers (includes ``0`` to ``9``) -* Unicode category ```` - connector punctuations -* Unicode category ```` - nonspacing marks -* Unicode category ```` - spacing combining marks -* ```` - another explicit set of characters in - `PropList.txt`_ to support backwards compatibility - -Unicode categories use the version of the Unicode Character Database as -included in the :mod:`unicodedata` module. - -These sets are based on the Unicode standard annex `UAX-31`_. -See also :pep:`3131` for further details. - -Even more formally, names are described by the following lexical definitions: +Formally, names are described by the following lexical definitions: .. grammar-snippet:: :group: python-grammar - NAME: `xid_start` `xid_continue`* - id_start: | | | | | | "_" | - id_continue: `id_start` | | | | | - xid_start: - xid_continue: - identifier: <`NAME`, except keywords> - -A non-normative listing of all valid identifier characters as defined by -Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode -Character Database. - + NAME: `name_start` `name_continue`* + name_start: "a".."z" | "A".."Z" | "_" | + name_continue: name_start | "0".."9" + identifier: <`NAME`, except keywords> -.. _UAX-31: https://www.unicode.org/reports/tr31/ -.. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt -.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt -.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms +Note that not all names matched by this grammar are valid; see +:ref:`lexical-names-nonascii` for details. .. _keywords: @@ -555,6 +511,74 @@ characters: :ref:`atom-identifiers`. +.. _lexical-names-nonascii: + +Non-ASCII characters in names +----------------------------- + +Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can use "letter-like" +and "number-like" characters from outside the ASCII range, +as detailed in this sections. + +All names are converted into the `normalization form`_ NFKC while parsing. +This means that, for example, some typographic variants of characters are +converted to their "basic" form, for example:: + + >>> nᵘₘᵇₑʳ = 3 + >>> number + 3 + +.. note:: + + Normalization is done at the lexical level only. + Run-time functions that take names as *strings* generally do not normalize + their arguments. + For example, the variable defined above is accessible in the + :func:`globals` dictionary as ``globals()["number"]`` but not + ``globals()["nᵘₘᵇₑʳ"]``. + +The first character of a normalized identifier must be "letter-like". +Formally, this means it must belong to the set ``id_start``, +which is the union of: + +* Unicode category ```` - uppercase letters (includes ``A`` to ``Z``) +* Unicode category ```` - lowercase letters (includes ``a`` to ``z``) +* Unicode category ```` - titlecase letters +* Unicode category ```` - modifier letters +* Unicode category ```` - other letters +* Unicode category ```` - letter numbers +* {``"_"``} - the underscore +* ```` - an explicit set of characters in `PropList.txt`_ + to support backwards compatibility + +The remaining characters must be "letter-like" or "digit-like". +Formally, they must belong to the set ``id_continue``, which is the union of: + +* ``id_start`` (see above) +* Unicode category ```` - decimal numbers (includes ``0`` to ``9``) +* Unicode category ```` - connector punctuations +* Unicode category ```` - nonspacing marks +* Unicode category ```` - spacing combining marks +* ```` - another explicit set of characters in + `PropList.txt`_ to support backwards compatibility + +Unicode categories use the version of the Unicode Character Database as +included in the :mod:`unicodedata` module. + +These sets are based on the Unicode standard annex `UAX-31`_. +See also :pep:`3131` for further details. + +A non-normative listing of all valid identifier characters as defined by +Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode +Character Database. + + +.. _UAX-31: https://www.unicode.org/reports/tr31/ +.. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt +.. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt +.. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms + + .. _literals: Literals From 6163c24f21edf615c4066de1a721f5a48a9d7092 Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 22 Oct 2025 17:01:14 +0200 Subject: [PATCH 2/7] Casing; 3 dots for character ranges --- Doc/reference/lexical_analysis.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 4c62d06db6a187..033f36f3783166 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -388,10 +388,10 @@ Names (identifiers and keywords) Names are composed of the following characters: -* Uppercase and lowercase letters (``A-Z`` and ``a-z``) -* The underscore (``_``) -* Digits (``0`` through ``9``), which cannot appear as the first character -* Non-ASCII characters. Valid names may only contain "letter-like" and +* uppercase and lowercase letters (``A-Z`` and ``a-z``), +* the underscore (``_``), +* digits (``0`` through ``9``), which cannot appear as the first character, and +* non-ASCII characters. Valid names may only contain "letter-like" and "digit-like" characters; see :ref:`lexical-names-nonascii` for details. Names must contain at least one character, but have no upper length limit. @@ -403,8 +403,8 @@ Formally, names are described by the following lexical definitions: :group: python-grammar NAME: `name_start` `name_continue`* - name_start: "a".."z" | "A".."Z" | "_" | - name_continue: name_start | "0".."9" + name_start: "a"..."z" | "A"..."Z" | "_" | + name_continue: name_start | "0"..."9" identifier: <`NAME`, except keywords> Note that not all names matched by this grammar are valid; see From de6d1afee24ab668797f1c81c870bcbdee5197da Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 22 Oct 2025 17:20:31 +0200 Subject: [PATCH 3/7] Clean-ups --- Doc/reference/lexical_analysis.rst | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 033f36f3783166..4d406fcedb78bb 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -518,7 +518,7 @@ Non-ASCII characters in names Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can use "letter-like" and "number-like" characters from outside the ASCII range, -as detailed in this sections. +as detailed in this section. All names are converted into the `normalization form`_ NFKC while parsing. This means that, for example, some typographic variants of characters are @@ -533,13 +533,13 @@ converted to their "basic" form, for example:: Normalization is done at the lexical level only. Run-time functions that take names as *strings* generally do not normalize their arguments. - For example, the variable defined above is accessible in the + For example, the variable defined above is accessible at run time in the :func:`globals` dictionary as ``globals()["number"]`` but not ``globals()["nᵘₘᵇₑʳ"]``. The first character of a normalized identifier must be "letter-like". Formally, this means it must belong to the set ``id_start``, -which is the union of: +which is defined as the union of: * Unicode category ```` - uppercase letters (includes ``A`` to ``Z``) * Unicode category ```` - lowercase letters (includes ``a`` to ``z``) @@ -552,7 +552,8 @@ which is the union of: to support backwards compatibility The remaining characters must be "letter-like" or "digit-like". -Formally, they must belong to the set ``id_continue``, which is the union of: +Formally, they must belong to the set ``id_continue``, which is defined as +the union of: * ``id_start`` (see above) * Unicode category ```` - decimal numbers (includes ``0`` to ``9``) @@ -565,14 +566,14 @@ Formally, they must belong to the set ``id_continue``, which is the union of: Unicode categories use the version of the Unicode Character Database as included in the :mod:`unicodedata` module. -These sets are based on the Unicode standard annex `UAX-31`_. -See also :pep:`3131` for further details. +The ``id_start`` and ``id_continue`` sets are based on the Unicode standard +annex `UAX-31`_. See also :pep:`3131` for further details. +Note that Python does not necessarily conform to `UAX-31`_. A non-normative listing of all valid identifier characters as defined by Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode Character Database. - .. _UAX-31: https://www.unicode.org/reports/tr31/ .. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt .. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt From 152e7aa1bfba6bc6e1fd5f45a0617bb92f4be2ff Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 22 Oct 2025 17:39:36 +0200 Subject: [PATCH 4/7] Mention Unicode's *ID_Start* and *ID_Continue* --- Doc/reference/lexical_analysis.rst | 3 +++ 1 file changed, 3 insertions(+) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 4d406fcedb78bb..1116e8c43bfd9d 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -573,6 +573,9 @@ Note that Python does not necessarily conform to `UAX-31`_. A non-normative listing of all valid identifier characters as defined by Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode Character Database. +The properties *ID_Start* and *ID_Continue* are very similar to Python's +``id_start`` and ``id_continue`` sets; the properties *XID_Start* and +*XID_Continue* play similar roles for identifiers before NFKC normalization. .. _UAX-31: https://www.unicode.org/reports/tr31/ .. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt From fce5e98384e660ad78659d00660fc2d4e96ce50a Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 29 Oct 2025 16:05:10 +0100 Subject: [PATCH 5/7] =?UTF-8?q?Make=20it=20clear=20that=20`n=E1=B5=98?= =?UTF-8?q?=E2=82=98=E1=B5=87=E2=82=91=CA=B3`=20normalizes=20to=20`number`?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- Doc/reference/lexical_analysis.rst | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index 1116e8c43bfd9d..e3de31729bbc31 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -522,7 +522,8 @@ as detailed in this section. All names are converted into the `normalization form`_ NFKC while parsing. This means that, for example, some typographic variants of characters are -converted to their "basic" form, for example:: +converted to their "basic" form. For example, ``nᵘₘᵇₑʳ`` normalizes to +``number``, so Python treats them as the same name:: >>> nᵘₘᵇₑʳ = 3 >>> number From b9fdcf0c6a6aa6d58ea670461e3418bc6d6dcd3a Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 12 Nov 2025 18:06:42 +0100 Subject: [PATCH 6/7] WIP --- Doc/reference/lexical_analysis.rst | 28 +++++++++++++++++++++++++--- 1 file changed, 25 insertions(+), 3 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index e3de31729bbc31..65a0553e3329d3 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -516,9 +516,28 @@ characters: Non-ASCII characters in names ----------------------------- -Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can use "letter-like" -and "number-like" characters from outside the ASCII range, -as detailed in this section. +Python identifiers may contain all sorts of characters. +For example, ``ř_1``, ``蛇``, or ``साँप`` are valid identifiers. +However, ``r〰2``, ``€``, or ``🐍`` are not. +Additionally, some variations are considered equivalent: for example, +``fi`` (2 letters) and ``fi`` (1 ligature). + + +A :ref:`name token ` that only contains ASCII characters +(``A-Z``, ``a-z``, ``_`` and ``0-9``) is always valid, and distinct from +different ASCII-only names. +The rules are somewhat more complicated when using non-ASCII characters. + +Informally, all names must be composed of letters, digits, numbers and +underscores, and cannot start with a digit. + + + + +Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can use characters +from outside the ASCII range. + +, as detailed in this section. All names are converted into the `normalization form`_ NFKC while parsing. This means that, for example, some typographic variants of characters are @@ -538,6 +557,9 @@ converted to their "basic" form. For example, ``nᵘₘᵇₑʳ`` normalizes to :func:`globals` dictionary as ``globals()["number"]`` but not ``globals()["nᵘₘᵇₑʳ"]``. +Similarly to how ASCII-only names must contain only letters, numbers and +the underscore, and cannot start with a digit, the normalized name must + The first character of a normalized identifier must be "letter-like". Formally, this means it must belong to the set ``id_start``, which is defined as the union of: From 43f609192c2e206a503ec3df11224156b472a785 Mon Sep 17 00:00:00 2001 From: Petr Viktorin Date: Wed, 19 Nov 2025 17:07:17 +0100 Subject: [PATCH 7/7] Reword to use XID_Start and XID_Continue --- Doc/reference/lexical_analysis.rst | 86 ++++++++++++++---------------- 1 file changed, 40 insertions(+), 46 deletions(-) diff --git a/Doc/reference/lexical_analysis.rst b/Doc/reference/lexical_analysis.rst index b726cfa66989e8..129dc10d07f7c9 100644 --- a/Doc/reference/lexical_analysis.rst +++ b/Doc/reference/lexical_analysis.rst @@ -516,36 +516,21 @@ characters: Non-ASCII characters in names ----------------------------- -Python identifiers may contain all sorts of characters. -For example, ``ř_1``, ``蛇``, or ``साँप`` are valid identifiers. -However, ``r〰2``, ``€``, or ``🐍`` are not. -Additionally, some variations are considered equivalent: for example, -``fi`` (2 letters) and ``fi`` (1 ligature). +Names that contain non-ASCII characters need additional normalization +and validation beyond the rules and grammar explained +:ref:`above `. +For example, ``ř_1``, ``蛇``, or ``साँप`` are valid names, but ``r〰2``, +``€``, or ``🐍`` are not. - -A :ref:`name token ` that only contains ASCII characters -(``A-Z``, ``a-z``, ``_`` and ``0-9``) is always valid, and distinct from -different ASCII-only names. -The rules are somewhat more complicated when using non-ASCII characters. - -Informally, all names must be composed of letters, digits, numbers and -underscores, and cannot start with a digit. - - - - -Besides ``A-Z``, ``a-z``, ``_`` and ``0-9``, names can use characters -from outside the ASCII range. - -, as detailed in this section. +This section explains the exact rules. All names are converted into the `normalization form`_ NFKC while parsing. This means that, for example, some typographic variants of characters are -converted to their "basic" form. For example, ``nᵘₘᵇₑʳ`` normalizes to -``number``, so Python treats them as the same name:: +converted to their "basic" form. For example, ``fiⁿₐˡᵢᶻₐᵗᵢᵒₙ`` normalizes to +``finalization``, so Python treats them as the same name:: - >>> nᵘₘᵇₑʳ = 3 - >>> number + >>> fiⁿₐˡᵢᶻₐᵗᵢᵒₙ = 3 + >>> finalization 3 .. note:: @@ -554,15 +539,26 @@ converted to their "basic" form. For example, ``nᵘₘᵇₑʳ`` normalizes to Run-time functions that take names as *strings* generally do not normalize their arguments. For example, the variable defined above is accessible at run time in the - :func:`globals` dictionary as ``globals()["number"]`` but not - ``globals()["nᵘₘᵇₑʳ"]``. + :func:`globals` dictionary as ``globals()["finalization"]`` but not + ``globals()["fiⁿₐˡᵢᶻₐᵗᵢᵒₙ"]``. + +Similarly to how ASCII-only names must contain only letters, digits and +the underscore, and cannot start with a digit, a valid name must +start with a character in the "letter-like" set ``xid_start``, +and the remaining characters must be in the "letter- and digit-like" set +``xid_continue``. + +These sets based on the *XID_Start* and *XID_Continue* sets as defined by the +Unicode standard annex `UAX-31`_. +Python's ``xid_start`` additionally includes the underscore (``_``). +Note that Python does not necessarily conform to `UAX-31`_. -Similarly to how ASCII-only names must contain only letters, numbers and -the underscore, and cannot start with a digit, the normalized name must +A non-normative listing of characters in the *XID_Start* and *XID_Continue* +sets as defined by Unicode is available in the `DerivedCoreProperties.txt`_ +file in the Unicode Character Database. +For reference, the construction rules for the ``xid_*`` sets are given below. -The first character of a normalized identifier must be "letter-like". -Formally, this means it must belong to the set ``id_start``, -which is defined as the union of: +The set ``id_start`` is defined as the union of: * Unicode category ```` - uppercase letters (includes ``A`` to ``Z``) * Unicode category ```` - lowercase letters (includes ``a`` to ``z``) @@ -574,9 +570,11 @@ which is defined as the union of: * ```` - an explicit set of characters in `PropList.txt`_ to support backwards compatibility -The remaining characters must be "letter-like" or "digit-like". -Formally, they must belong to the set ``id_continue``, which is defined as -the union of: +The set ``xid_start`` then closes this set under NFKC normalization, by +removing all characters whose normalization is not of the form +``id_start id_continue*``. + +The set ``id_continue`` is defined as the union of: * ``id_start`` (see above) * Unicode category ```` - decimal numbers (includes ``0`` to ``9``) @@ -586,25 +584,21 @@ the union of: * ```` - another explicit set of characters in `PropList.txt`_ to support backwards compatibility +Again, ``xid_continue`` closes this set under NFKC normalization. + Unicode categories use the version of the Unicode Character Database as included in the :mod:`unicodedata` module. -The ``id_start`` and ``id_continue`` sets are based on the Unicode standard -annex `UAX-31`_. See also :pep:`3131` for further details. -Note that Python does not necessarily conform to `UAX-31`_. - -A non-normative listing of all valid identifier characters as defined by -Unicode is available in the `DerivedCoreProperties.txt`_ file in the Unicode -Character Database. -The properties *ID_Start* and *ID_Continue* are very similar to Python's -``id_start`` and ``id_continue`` sets; the properties *XID_Start* and -*XID_Continue* play similar roles for identifiers before NFKC normalization. - .. _UAX-31: https://www.unicode.org/reports/tr31/ .. _PropList.txt: https://www.unicode.org/Public/17.0.0/ucd/PropList.txt .. _DerivedCoreProperties.txt: https://www.unicode.org/Public/17.0.0/ucd/DerivedCoreProperties.txt .. _normalization form: https://www.unicode.org/reports/tr15/#Norm_Forms +.. seealso:: + + * :pep:`3131` -- Supporting Non-ASCII Identifiers + * :pep:`672` -- Unicode-related Security Considerations for Python + .. _literals: