Skip to content

Conversation

@erifan
Copy link
Contributor

@erifan erifan commented Dec 8, 2025

This patch adds intrinsic support for UMIN and UMAX reduction operations in the Vector API on AArch64, enabling direct hardware instruction mapping for better performance.

Changes:

  1. C2 mid-end:

    • Added UMinReductionVNode and UMaxReductionVNode
  2. AArch64 Backend:

    • Added uminp/umaxp/sve_uminv/sve_umaxv instructions
    • Updated match rules for all vector sizes and element types
    • Both NEON and SVE implementation are supported
  3. Test:

    • Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
    • Added assembly tests in aarch64-asmtest.py for new instructions
    • Added a JTReg test file VectorUMinMaxReductionTest.java

Different configurations were tested on aarch64 and x86 machines, and all tests passed.

Test results of JMH benchmarks from the panama-vector project:

On a Nvidia Grace machine with 128-bit SVE:

Benchmark                       Unit    Before  Error   After           Error   Uplift
Byte128Vector.UMAXLanes         ops/ms  411.60  42.18   25226.51        33.92   61.29
Byte128Vector.UMAXMaskedLanes   ops/ms  558.56  85.12   25182.90        28.74   45.09
Byte128Vector.UMINLanes         ops/ms  645.58  780.76  28396.29        103.11  43.99
Byte128Vector.UMINMaskedLanes   ops/ms  621.09  718.27  26122.62        42.68   42.06
Byte64Vector.UMAXLanes          ops/ms  296.33  34.44   14357.74        15.95   48.45
Byte64Vector.UMAXMaskedLanes    ops/ms  376.54  44.01   14269.24        21.41   37.90
Byte64Vector.UMINLanes          ops/ms  373.45  426.51  15425.36        66.20   41.31
Byte64Vector.UMINMaskedLanes    ops/ms  353.32  346.87  14201.37        13.79   40.19
Int128Vector.UMAXLanes          ops/ms  174.79  192.51  9906.07         286.93  56.67
Int128Vector.UMAXMaskedLanes    ops/ms  157.23  206.68  10246.77        11.44   65.17
Int64Vector.UMAXLanes           ops/ms  95.30   126.49  4719.30         98.57   49.52
Int64Vector.UMAXMaskedLanes     ops/ms  88.19   87.44   4693.18         19.76   53.22
Long128Vector.UMAXLanes         ops/ms  80.62   97.82   5064.01         35.52   62.82
Long128Vector.UMAXMaskedLanes   ops/ms  78.15   102.91  5028.24         8.74    64.34
Long64Vector.UMAXLanes          ops/ms  47.56   62.01   46.76           52.28   0.98
Long64Vector.UMAXMaskedLanes    ops/ms  45.44   46.76   45.79           42.91   1.01
Short128Vector.UMAXLanes        ops/ms  316.65  410.30  14814.82        23.65   46.79
Short128Vector.UMAXMaskedLanes  ops/ms  308.90  351.78  15155.26        31.03   49.06
Short64Vector.UMAXLanes         ops/ms  190.38  245.09  8022.46         14.30   42.14
Short64Vector.UMAXMaskedLanes   ops/ms  195.54  36.15   7930.28         11.88   40.56

On a Nvidia Grace machine with 128-bit NEON:

Benchmark                       Unit    Before  Error   After           Error   Uplift
Byte128Vector.UMAXLanes         ops/ms  414.69  42.52   25257.61        25.91   60.91
Byte128Vector.UMAXMaskedLanes   ops/ms  552.00  56.61   23063.14        304.45  41.78
Byte128Vector.UMINLanes         ops/ms  634.98  849.04  28444.37        180.80  44.80
Byte128Vector.UMINMaskedLanes   ops/ms  612.88  735.18  26127.07        27.99   42.63
Byte64Vector.UMAXLanes          ops/ms  291.53  32.19   13893.62        28.09   47.66
Byte64Vector.UMAXMaskedLanes    ops/ms  363.34  48.17   13290.59        12.53   36.58
Byte64Vector.UMINLanes          ops/ms  368.70  433.60  15416.90        15.80   41.81
Byte64Vector.UMINMaskedLanes    ops/ms  350.46  371.05  14524.29        121.63  41.44
Int128Vector.UMAXLanes          ops/ms  177.67  201.38  10182.82        20.21   57.31
Int128Vector.UMAXMaskedLanes    ops/ms  155.25  187.88  9194.13         393.35  59.22
Int64Vector.UMAXLanes           ops/ms  93.93   115.02  5106.79         4.54    54.37
Int64Vector.UMAXMaskedLanes     ops/ms  87.01   88.50   4405.87         8.06    50.63
Long128Vector.UMAXLanes         ops/ms  80.32   98.50   3229.80         40.53   40.21
Long128Vector.UMAXMaskedLanes   ops/ms  77.65   103.25  3161.50         4.45    40.72
Long64Vector.UMAXLanes          ops/ms  47.72   65.38   46.41           50.38   0.97
Long64Vector.UMAXMaskedLanes    ops/ms  45.26   47.46   45.13           47.23   1.00
Short128Vector.UMAXLanes        ops/ms  316.09  429.34  14748.07        14.78   46.66
Short128Vector.UMAXMaskedLanes  ops/ms  307.70  342.54  14359.11        44.99   46.67
Short64Vector.UMAXLanes         ops/ms  187.67  253.01  8180.63         178.65  43.59
Short64Vector.UMAXMaskedLanes   ops/ms  191.10  33.51   7949.19         108.65  41.60

Progress

  • Change must be properly reviewed (1 review required, with at least 1 Reviewer)
  • Change must not contain extraneous whitespace
  • Commit message must refer to an issue

Issue

  • JDK-8372980: [VectorAPI] AArch64: Add intrinsic support for unsigned min/max reduction operations (Enhancement - P4)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28693/head:pull/28693
$ git checkout pull/28693

Update a local copy of the PR:
$ git checkout pull/28693
$ git pull https://git.openjdk.org/jdk.git pull/28693/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 28693

View PR using the GUI difftool:
$ git pr show -t 28693

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28693.diff

Using Webrev

Link to Webrev Comment

…tions

The original implementation of UMIN/UMAX reductions in JDK-8346174
used incorrect identity values in the Java implementation and test code.

Problem:
--------
UMIN was using MAX_OR_INF (signed maximum value) as the identity:
  - Byte.MAX_VALUE (127) instead of max unsigned byte (255)
  - Short.MAX_VALUE (32767) instead of max unsigned short (65535)
  - Integer.MAX_VALUE instead of max unsigned int (-1)
  - Long.MAX_VALUE instead of max unsigned long (-1)

UMAX was using MIN_OR_INF (signed minimum value) as the identity:
  - Byte.MIN_VALUE (-128) instead of 0
  - Short.MIN_VALUE (-32768) instead of 0
  - Integer.MIN_VALUE instead of 0
  - Long.MIN_VALUE instead of 0

This caused incorrect result. For example:
  UMAX([42,42,...,42]) returned 128 instead of 42

Solution:
---------
Use correct unsigned identity values:
  - UMIN: ($type$)-1 (maximum unsigned value)
  - UMAX: ($type$)0 (minimum unsigned value)

Changes:
--------
- X-Vector.java.template: Fixed identity values in reductionOperations
- gen-template.sh: Fixed identity values for test code generation
- templates/Unit-header.template: Updated copyright year to 2025
- Regenerated all Vector classes and test files

Testing:
--------
All types (byte/short/int/long) now return correct results in both
interpreter mode (-Xint) and compiled mode.
…max reduction operations

This patch adds intrinsic support for UMIN and UMAX reduction operations
in the Vector API on AArch64, enabling direct hardware instruction mapping
for better performance.

Changes:
--------

1. C2 mid-end:
   - Added UMinReductionVNode and UMaxReductionVNode

2. AArch64 Backend:
   - Added uminp/umaxp/sve_uminv/sve_umaxv instructions
   - Updated match rules for all vector sizes and element types
   - Both NEON and SVE implementation are supported

3. Test:
   - Added UMIN_REDUCTION_V and UMAX_REDUCTION_V to IRNode.java
   - Added assembly tests in aarch64-asmtest.py for new instructions
   - Added a JTReg test file VectorUMinMaxReductionTest.java

Different configurations were tested on aarch64 and x86 machines, and
all tests passed.

Test results of JMH benchmarks from the panama-vector project:
--------

On a Nvidia Grace machine with 128-bit SVE:
```
Benchmark			Unit	Before	Error	After		Error	Uplift
Byte128Vector.UMAXLanes		ops/ms	411.60	42.18	25226.51	33.92	61.29
Byte128Vector.UMAXMaskedLanes	ops/ms	558.56	85.12	25182.90	28.74	45.09
Byte128Vector.UMINLanes		ops/ms	645.58	780.76	28396.29	103.11	43.99
Byte128Vector.UMINMaskedLanes	ops/ms	621.09	718.27	26122.62	42.68	42.06
Byte64Vector.UMAXLanes		ops/ms	296.33	34.44	14357.74	15.95	48.45
Byte64Vector.UMAXMaskedLanes	ops/ms	376.54	44.01	14269.24	21.41	37.90
Byte64Vector.UMINLanes		ops/ms	373.45	426.51	15425.36	66.20	41.31
Byte64Vector.UMINMaskedLanes	ops/ms	353.32	346.87	14201.37	13.79	40.19
Int128Vector.UMAXLanes		ops/ms	174.79	192.51	9906.07		286.93	56.67
Int128Vector.UMAXMaskedLanes	ops/ms	157.23	206.68	10246.77	11.44	65.17
Int64Vector.UMAXLanes		ops/ms	95.30	126.49	4719.30		98.57	49.52
Int64Vector.UMAXMaskedLanes	ops/ms	88.19	87.44	4693.18		19.76	53.22
Long128Vector.UMAXLanes		ops/ms	80.62	97.82	5064.01		35.52	62.82
Long128Vector.UMAXMaskedLanes	ops/ms	78.15	102.91	5028.24		8.74	64.34
Long64Vector.UMAXLanes		ops/ms	47.56	62.01	46.76		52.28	0.98
Long64Vector.UMAXMaskedLanes	ops/ms	45.44	46.76	45.79		42.91	1.01
Short128Vector.UMAXLanes	ops/ms	316.65	410.30	14814.82	23.65	46.79
Short128Vector.UMAXMaskedLanes	ops/ms	308.90	351.78	15155.26	31.03	49.06
Short64Vector.UMAXLanes		ops/ms	190.38	245.09	8022.46		14.30	42.14
Short64Vector.UMAXMaskedLanes	ops/ms	195.54	36.15	7930.28		11.88	40.56
```

On a Nvidia Grace machine with 128-bit NEON:
```
Benchmark			Unit	Before	Error	After		Error	Uplift
Byte128Vector.UMAXLanes		ops/ms	414.69	42.52	25257.61	25.91	60.91
Byte128Vector.UMAXMaskedLanes	ops/ms	552.00	56.61	23063.14	304.45	41.78
Byte128Vector.UMINLanes		ops/ms	634.98	849.04	28444.37	180.80	44.80
Byte128Vector.UMINMaskedLanes	ops/ms	612.88	735.18	26127.07	27.99	42.63
Byte64Vector.UMAXLanes		ops/ms	291.53	32.19	13893.62	28.09	47.66
Byte64Vector.UMAXMaskedLanes	ops/ms	363.34	48.17	13290.59	12.53	36.58
Byte64Vector.UMINLanes		ops/ms	368.70	433.60	15416.90	15.80	41.81
Byte64Vector.UMINMaskedLanes	ops/ms	350.46	371.05	14524.29	121.63	41.44
Int128Vector.UMAXLanes		ops/ms	177.67	201.38	10182.82	20.21	57.31
Int128Vector.UMAXMaskedLanes	ops/ms	155.25	187.88	9194.13		393.35	59.22
Int64Vector.UMAXLanes		ops/ms	93.93	115.02	5106.79		4.54	54.37
Int64Vector.UMAXMaskedLanes	ops/ms	87.01	88.50	4405.87		8.06	50.63
Long128Vector.UMAXLanes		ops/ms	80.32	98.50	3229.80		40.53	40.21
Long128Vector.UMAXMaskedLanes	ops/ms	77.65	103.25	3161.50		4.45	40.72
Long64Vector.UMAXLanes		ops/ms	47.72	65.38	46.41		50.38	0.97
Long64Vector.UMAXMaskedLanes	ops/ms	45.26	47.46	45.13		47.23	1.00
Short128Vector.UMAXLanes	ops/ms	316.09	429.34	14748.07	14.78	46.66
Short128Vector.UMAXMaskedLanes	ops/ms	307.70	342.54	14359.11	44.99	46.67
Short64Vector.UMAXLanes		ops/ms	187.67	253.01	8180.63		178.65	43.59
Short64Vector.UMAXMaskedLanes	ops/ms	191.10	33.51	7949.19		108.65	41.60
```
@bridgekeeper
Copy link

bridgekeeper bot commented Dec 8, 2025

👋 Welcome back erfang! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Dec 8, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added hotspot-compiler hotspot-compiler-dev@openjdk.org core-libs core-libs-dev@openjdk.org labels Dec 8, 2025
@openjdk
Copy link

openjdk bot commented Dec 8, 2025

@erifan The following labels will be automatically applied to this pull request:

  • core-libs
  • hotspot-compiler

When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command.

@openjdk openjdk bot added the rfr Pull request is ready for review label Dec 8, 2025
@mlbridge
Copy link

mlbridge bot commented Dec 8, 2025

Webrevs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core-libs core-libs-dev@openjdk.org hotspot-compiler hotspot-compiler-dev@openjdk.org rfr Pull request is ready for review

Development

Successfully merging this pull request may close these issues.

1 participant