-
Notifications
You must be signed in to change notification settings - Fork 6.2k
8371711: AArch64: SVE intrinsics for Arrays.sort methods (int, float) #28675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
👋 Welcome back bkilambi! A progress list of the required criteria for merging this PR into |
|
❗ This change is not yet ready to be integrated. |
|
@Bhavana-Kilambi The following labels will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing lists. If you would like to change these labels, use the /label pull request command. |
|
/contributor add Yanqin Wei yanqin.wei@arm.com |
|
@Bhavana-Kilambi |
Separated the libsimdsort implementation for aarch64 and x86 in two different folders under src/java.base/linux/native/libsimdsort which might help in better future maintenance of AArch64 and x86 implementations. New layout - src/java.base/linux/native/libsimdsort/aarch64/… src/java.base/linux/native/libsimdsort/x86/… Moved the following files into the libsimdsort/x86 folder - src/java.base/linux/native/libsimdsort/x86/avx2-32bit-qsort.hpp src/java.base/linux/native/libsimdsort/x86/avx2-emu-funcs.hpp src/java.base/linux/native/libsimdsort/x86/avx2-linux-qsort.cpp src/java.base/linux/native/libsimdsort/x86/avx512-32bit-qsort.hpp src/java.base/linux/native/libsimdsort/x86/avx512-64bit-qsort.hpp src/java.base/linux/native/libsimdsort/x86/avx512-linux-qsort.cpp src/java.base/linux/native/libsimdsort/x86/simdsort-support.hpp src/java.base/linux/native/libsimdsort/x86/xss-common-includes.h src/java.base/linux/native/libsimdsort/x86/xss-common-qsort.h src/java.base/linux/native/libsimdsort/x86/xss-network-qsort.hpp src/java.base/linux/native/libsimdsort/x86/xss-optimal-networks.hpp src/java.base/linux/native/libsimdsort/x86/xss-pivot-selection.hpp Copied the following files from libsimdsort/x86 to libsimdsort/aarch64 folder - x86/xss-pivot-selection.hpp -> aarch64/pivot-selection.hpp x86/simdsort-support.hpp -> aarch64/simdsort-support.hpp x86/xss-common-qsort.h -> aarch64/sve-common-qsort.hpp x86/avx2-linux-qsort.cpp -> aarch64/sve-linux-qsort.cpp x86/avx2-32bit-qsort.hpp -> aarch64/sve-qsort.hpp
This patch adds an SVE implementation of primitive array sorting (Arrays.sort()) on AArch64 systems that support SVE. On non-SVE machines, we fall back to the existing Java implementation. For smaller arrays (length <= 64), we use insertion sort; for larger arrays we use an SVE-vectorized quicksort partitioner followed by an odd-even transposition cleanup pass. The SVE path is enabled by default for int type. For float type, it is available through the experimental flag : -XX:+UnlockExperimentalVMOptions -XX:+UseSVELibSimdSortForFP Without this flag being enabled, the default Java implementation would be executed for floats (the flag is disabled by default). Float is gated due to observed regressions on some small/medium sizes. On larger arrays, the SVE float path shows upto 1.47x speedup on Neoverse V2 and 2.12x on Neoverse V1. Following are the performance numbers for ArraysSort JMH benchmark - Case A: Ratio between the scores of master branch and UseSVELibSimdSortForFP flag disabled (which is the default). Case B: Ratio between the scores of master branch and UseSVELibSimdSortForFP flag enabled (the int numbers will be the same but this now enables SVE vectorized sorting for floats). We would want the ratios to be >= 1 to be at par or better than the default Java implementation (master branch). On Neoverse V1: Benchmark (size) Mode Cnt A B ArraysSort.floatParallelSort 10 avgt 3 0.98 0.98 ArraysSort.floatParallelSort 25 avgt 3 1.01 0.83 ArraysSort.floatParallelSort 50 avgt 3 0.99 0.55 ArraysSort.floatParallelSort 75 avgt 3 0.99 0.66 ArraysSort.floatParallelSort 100 avgt 3 0.98 0.66 ArraysSort.floatParallelSort 1000 avgt 3 1.00 0.84 ArraysSort.floatParallelSort 10000 avgt 3 1.03 1.52 ArraysSort.floatParallelSort 100000 avgt 3 1.03 1.46 ArraysSort.floatParallelSort 1000000 avgt 3 0.98 1.81 ArraysSort.floatSort 10 avgt 3 1.00 0.98 ArraysSort.floatSort 25 avgt 3 1.00 0.81 ArraysSort.floatSort 50 avgt 3 0.99 0.56 ArraysSort.floatSort 75 avgt 3 0.99 0.65 ArraysSort.floatSort 100 avgt 3 0.98 0.70 ArraysSort.floatSort 1000 avgt 3 0.99 0.84 ArraysSort.floatSort 10000 avgt 3 0.99 1.72 ArraysSort.floatSort 100000 avgt 3 1.00 1.94 ArraysSort.floatSort 1000000 avgt 3 1.00 2.13 ArraysSort.intParallelSort 10 avgt 3 1.08 1.08 ArraysSort.intParallelSort 25 avgt 3 1.04 1.05 ArraysSort.intParallelSort 50 avgt 3 1.29 1.30 ArraysSort.intParallelSort 75 avgt 3 1.16 1.16 ArraysSort.intParallelSort 100 avgt 3 1.07 1.07 ArraysSort.intParallelSort 1000 avgt 3 1.13 1.13 ArraysSort.intParallelSort 10000 avgt 3 1.49 1.38 ArraysSort.intParallelSort 100000 avgt 3 1.64 1.62 ArraysSort.intParallelSort 1000000 avgt 3 2.26 2.27 ArraysSort.intSort 10 avgt 3 1.08 1.08 ArraysSort.intSort 25 avgt 3 1.02 1.02 ArraysSort.intSort 50 avgt 3 1.25 1.25 ArraysSort.intSort 75 avgt 3 1.16 1.20 ArraysSort.intSort 100 avgt 3 1.07 1.07 ArraysSort.intSort 1000 avgt 3 1.12 1.13 ArraysSort.intSort 10000 avgt 3 1.94 1.95 ArraysSort.intSort 100000 avgt 3 1.86 1.86 ArraysSort.intSort 1000000 avgt 3 2.09 2.09 On Neoverse V2: Benchmark (size) Mode Cnt A B ArraysSort.floatParallelSort 10 avgt 3 1.02 1.02 ArraysSort.floatParallelSort 25 avgt 3 0.97 0.71 ArraysSort.floatParallelSort 50 avgt 3 0.94 0.65 ArraysSort.floatParallelSort 75 avgt 3 0.96 0.82 ArraysSort.floatParallelSort 100 avgt 3 0.95 0.84 ArraysSort.floatParallelSort 1000 avgt 3 1.01 0.94 ArraysSort.floatParallelSort 10000 avgt 3 1.01 1.25 ArraysSort.floatParallelSort 100000 avgt 3 1.01 1.09 ArraysSort.floatParallelSort 1000000 avgt 3 1.00 1.10 ArraysSort.floatSort 10 avgt 3 1.02 1.00 ArraysSort.floatSort 25 avgt 3 0.99 0.76 ArraysSort.floatSort 50 avgt 3 0.97 0.66 ArraysSort.floatSort 75 avgt 3 1.01 0.83 ArraysSort.floatSort 100 avgt 3 1.00 0.85 ArraysSort.floatSort 1000 avgt 3 0.99 0.93 ArraysSort.floatSort 10000 avgt 3 1.00 1.28 ArraysSort.floatSort 100000 avgt 3 1.00 1.37 ArraysSort.floatSort 1000000 avgt 3 1.00 1.48 ArraysSort.intParallelSort 10 avgt 3 1.05 1.05 ArraysSort.intParallelSort 25 avgt 3 0.99 0.84 ArraysSort.intParallelSort 50 avgt 3 1.03 1.14 ArraysSort.intParallelSort 75 avgt 3 0.91 0.99 ArraysSort.intParallelSort 100 avgt 3 0.98 0.96 ArraysSort.intParallelSort 1000 avgt 3 1.32 1.30 ArraysSort.intParallelSort 10000 avgt 3 1.40 1.40 ArraysSort.intParallelSort 100000 avgt 3 1.00 1.04 ArraysSort.intParallelSort 1000000 avgt 3 1.15 1.14 ArraysSort.intSort 10 avgt 3 1.05 1.05 ArraysSort.intSort 25 avgt 3 1.03 1.03 ArraysSort.intSort 50 avgt 3 1.08 1.14 ArraysSort.intSort 75 avgt 3 0.88 0.98 ArraysSort.intSort 100 avgt 3 1.01 0.99 ArraysSort.intSort 1000 avgt 3 1.3 1.32 ArraysSort.intSort 10000 avgt 3 1.43 1.43 ArraysSort.intSort 100000 avgt 3 1.30 1.30 ArraysSort.intSort 1000000 avgt 3 1.37 1.37
0c3ab11 to
513b1f1
Compare
| ifeq ($(call isTargetOs, linux)+$(call isTargetCpu, aarch64)+$(INCLUDE_COMPILER2)+$(filter $(TOOLCHAIN_TYPE), gcc), true+true+true+gcc) | ||
| $(eval $(call SetupJdkLibrary, BUILD_LIBSIMD_SORT, \ | ||
| NAME := simdsort, \ | ||
| TOOLCHAIN := TOOLCHAIN_LINK_CXX, \ | ||
| OPTIMIZATION := HIGH, \ | ||
| SRC := $(SIMDSORT_BASE_DIR)/aarch64, \ | ||
| CFLAGS := $(CFLAGS_JDKLIB) -march=armv8.2-a+sve, \ | ||
| CXXFLAGS := $(CXXFLAGS_JDKLIB) -march=armv8.2-a+sve -std=c++17, \ | ||
| LDFLAGS := $(LDFLAGS_JDKLIB) \ | ||
| $(call SET_SHARED_LIBRARY_ORIGIN), \ | ||
| LIBS := $(LIBCXX), \ | ||
| DISABLED_WARNINGS_gcc := unused-variable, \ | ||
| LIBS_linux := -lc -lm -ldl, \ | ||
| )) | ||
|
|
||
| TARGETS += $(BUILD_LIBSIMD_SORT) | ||
| endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole block should be combined with the existing block above, something like this:
ifeq ($(call isTargetOs, linux)+$(call isTargetCpu, x86_64 aarch64)+$(INCLUDE_COMPILER2)+$(filter $(TOOLCHAIN_TYPE), gcc), true+true+true+gcc)
##############################################################################
## Build libsimdsort
##############################################################################
$(eval $(call SetupJdkLibrary, BUILD_LIBSIMD_SORT, \
NAME := simdsort, \
LINK_TYPE := C++, \
OPTIMIZATION := HIGH, \
INCLUDES := $(OPENJDK_TARGET_CPU_ARCH), \
CXXFLAGS := -std=c++17, \
CXXFLAGS_linux_aarch64 := -march=armv8.2-a+sve, \
DISABLED_WARNINGS_gcc := unused-variable, \
LIBS_linux := $(LIBM), \
))
TARGETS += $(BUILD_LIBSIMD_SORT)
endif
Unfortunately we don't currently support CXXFLAGS_, just CFLAGS_, but this can be fixed and I think it should be since we now have a need for it.
diff --git a/make/common/native/Flags.gmk b/make/common/native/Flags.gmk
index efb4c08e74c..2f3680af7c7 100644
--- a/make/common/native/Flags.gmk
+++ b/make/common/native/Flags.gmk
@@ -106,10 +106,12 @@ define SetupCompilerFlags
$1_EXTRA_CFLAGS += -DSTATIC_BUILD=1
endif
- # Pickup extra OPENJDK_TARGET_OS_TYPE, OPENJDK_TARGET_OS and/or TOOLCHAIN_TYPE
- # dependent variables for CXXFLAGS.
+ # Pickup extra OPENJDK_TARGET_OS_TYPE, OPENJDK_TARGET_OS, TOOLCHAIN_TYPE and
+ # OPENJDK_TARGET_OS plus OPENJDK_TARGET_CPU pair dependent variables for
+ # CXXFLAGS.
$1_EXTRA_CXXFLAGS := $$($1_CXXFLAGS_$(OPENJDK_TARGET_OS_TYPE)) $$($1_CXXFLAGS_$(OPENJDK_TARGET_OS)) \
- $$($1_CXXFLAGS_$(TOOLCHAIN_TYPE))
+ $$($1_CXXFLAGS_$(TOOLCHAIN_TYPE)) \
+ $$($1_CXXFLAGS_$(OPENJDK_TARGET_OS)_$(OPENJDK_TARGET_CPU))
ifneq ($(DEBUG_LEVEL), release)
# Pickup extra debug dependent variables for CXXFLAGS
The above at least compiles for me.
This patch adds an SVE implementation of primitive array sorting (Arrays.sort()) on AArch64 systems that support SVE. On non-SVE machines, we fall back to the existing Java implementation.
For smaller arrays (length <= 64), we use insertion sort; for larger arrays we use an SVE-vectorized quicksort partitioner followed by an odd-even transposition cleanup pass.
The SVE path is enabled by default for int type. For float type, it is available through the experimental flag :
-XX:+UnlockExperimentalVMOptions -XX:+UseSVELibSimdSortForFPWithout this flag being enabled, the default Java implementation would be executed for floats (the flag is disabled by default).
Float is gated due to observed regressions on some small/medium sizes. On larger arrays, the SVE float path shows upto 1.47x speedup on Neoverse V2 and 2.12x on Neoverse V1.
Following are the performance numbers for ArraysSort JMH benchmark -
Case A: Ratio between the scores of master branch and
UseSVELibSimdSortForFPflag disabled (which is the default).Case B: Ratio between the scores of master branch and
UseSVELibSimdSortForFPflag enabled (the int numbers will be the same but this now enables SVE vectorized sorting for floats).We would want the ratios to be >= 1 to be at par or better than the default Java implementation (master branch).
On Neoverse V1:
On Neoverse V2:
Progress
Issue
Contributors
<yanqin.wei@arm.com>Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/28675/head:pull/28675$ git checkout pull/28675Update a local copy of the PR:
$ git checkout pull/28675$ git pull https://git.openjdk.org/jdk.git pull/28675/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 28675View PR using the GUI difftool:
$ git pr show -t 28675Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/28675.diff
Using Webrev
Link to Webrev Comment