diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_index.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_index.md index cc6f9c4d25..3c0f1d4141 100644 --- a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/_index.md @@ -11,7 +11,7 @@ learning_objectives: prerequisites: - Basic C++ understanding. - - Access to an Arm-based Linux machine. + - Access to an Arm-based machine. author: Kieran Hejmadi @@ -25,12 +25,17 @@ tools_software_languages: - Runbook operatingsystems: - Linux + - Windows further_reading: - resource: title: G++ profile-guided optimization documentation link: https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Instrumentation-Options.html type: documentation + - resource: + title: MSVC profile-guided optimization documentation + link: https://learn.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=msvc-170 + type: documentation - resource: title: Google Benchmark Library link: https://github.com/google/benchmark diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-1.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-1.md index 0519661ae3..ecca44a24c 100644 --- a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-1.md +++ b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-1.md @@ -8,10 +8,10 @@ layout: learningpathall ### What is Profile-Guided Optimization (PGO) and how does it work? -Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process: +Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. PGO typically involves a two-step process: -- First, compile the program with the `-fprofile-generate` flag to produce an instrumented binary that collects profiling data during execution; -- Second, recompile the program with the `-fprofile-use` flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach identifies frequently executed paths — known as “hot” paths — and optimizes them more aggressively, while potentially reducing emphasis on less critical code paths. +- First, compile the program to produce an instrumented binary that collects profiling data during execution; +- Second, recompile the program with an optimization profile, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach identifies frequently executed paths — known as “hot” paths — and optimizes them more aggressively, while potentially reducing emphasis on less critical code paths. ### When should I use Profile-Guided Optimization? diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-3.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-3.md index 776da293ab..4994db8aed 100644 --- a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-3.md +++ b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-3.md @@ -12,9 +12,9 @@ In this section, you'll learn how to use Google Benchmark and Profile-Guided Opt Integer division is ideal for benchmarking because it's significantly more expensive than operations like addition, subtraction, or multiplication. On most CPU architectures, including Arm, division instructions have higher latency and lower throughput compared to other arithmetic operations. By applying Profile-Guided Optimization to code containing division operations, we can potentially achieve significant performance improvements. -## What tools are needed to run a Google Benchmark example? +For this example, you can use an Arm computer (Linux or Windows). -For this example, you can use any Arm Linux computer. For example, an AWS EC2 `c7g.xlarge` instance running Ubuntu 24.04 LTS can be used. +## What tools are needed to run a Google Benchmark example on Linux? Run the following commands to install the prerequisite packages: @@ -23,6 +23,20 @@ sudo apt update sudo apt install gcc g++ make libbenchmark-dev -y ``` +## What tools are needed to run a Google Benchmark example on Windows? + +Download the [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) to install the prerequisite packages. + +Next, install the static version of Google Benchmark for Arm64 via vcpkg. Run the following commands in Powershell as Administrator: + +```console +cd C:\git +git clone https://github.com/microsoft/vcpkg.git +cd vcpkg +.\bootstrap-vcpkg.bat +.\vcpkg install benchmark:arm64-windows-static +``` + ## Division example Use an editor to copy and paste the C++ source code below into a file named `div_bench.cpp`. @@ -49,20 +63,32 @@ BENCHMARK(baseDiv)->Arg(1500)->Unit(benchmark::kMicrosecond); // value of 1500 i BENCHMARK_MAIN(); ``` -To compile and run the microbenchmark on this function, you need to link with the `pthreads` and `benchmark` libraries. +To compile and run the microbenchmark on this function, you need to link with the correct libraries: -Compile with the command: +**(Linux)** Compile with the command: ```bash g++ -O3 -std=c++17 div_bench.cpp -lbenchmark -lpthread -o div_bench.base ``` -Run the program: +**(Windows)** Compile with the command: + +```console +cl /D BENCHMARK_STATIC_DEFINE div_bench.cpp /link /LIBPATH:"$VCPKG\lib" benchmark.lib benchmark_main.lib shlwapi.lib +``` + +**(Linux)** Run the program: ```bash ./div_bench.base ``` +**(Windows)** Run the program: + +```console +.\div_bench.exe +``` + ### Example output ```output @@ -80,25 +106,3 @@ Benchmark Time CPU Iterations ------------------------------------------------------- baseDiv/1500 7.90 us 7.90 us 88512 ``` - -### Inspect assembly - -To inspect what assembly instructions are being executed most frequently, you can use the `perf` command. This is useful for identifying bottlenecks and understanding the performance characteristics of your code. - -Install Perf using the [install guide](https://learn.arm.com/install-guides/perf/) before proceeding. - -{{% notice Please Note %}} -You may need to set the `perf_event_paranoid` value to -1 with the `sudo sysctl kernel.perf_event_paranoid=-1` command to run the commands below. -{{% /notice %}} - -Run the following commands to record `perf` data and create a report in the terminal: - -```bash -sudo perf record -o perf-division-base ./div_bench.base -sudo perf report --input=perf-division-base -``` - -As the `perf report` graphic below shows, the program spends a significant amount of time in the short loops with no loop unrolling. There is also an expensive `sdiv` operation, and most of the execution time is spent storing the result of the operation. - -![before-pgo](./before-pgo.gif) - diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-4.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-4.md index 7c43b8e80e..55f33a3aea 100644 --- a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-4.md +++ b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-4.md @@ -1,5 +1,5 @@ --- -title: Using Profile Guided Optimization +title: Using Profile Guided Optimization (Linux) weight: 5 ### FIXED, DO NOT MODIFY @@ -22,14 +22,35 @@ Next, run the instrumented binary to generate the profile data: This execution creates profile data files (typically with a `.gcda` extension) in the same directory. +### Inspect assembly + +To inspect what assembly instructions are being executed most frequently, you can use the `perf` command. This is useful for identifying bottlenecks and understanding the performance characteristics of your code. + +Install Perf using the [install guide](https://learn.arm.com/install-guides/perf/) before proceeding. + +{{% notice Please Note %}} +You may need to set the `perf_event_paranoid` value to -1 with the `sudo sysctl kernel.perf_event_paranoid=-1` command to run the commands below. +{{% /notice %}} + +Run the following commands to record `perf` data and create a report in the terminal: + +```bash +sudo perf record -o perf-division-base ./div_bench.base +sudo perf report --input=perf-division-base +``` + +As the `perf report` graphic below shows, the program spends a significant amount of time in the short loops with no loop unrolling. There is also an expensive `sdiv` operation, and most of the execution time is spent storing the result of the operation. + +![before-pgo](./before-pgo.gif) + +### Compile and run the optimized binary + Now recompile the program using the `-fprofile-use` flag to apply optimizations based on the collected data: ```bash g++ -O3 -std=c++17 -fprofile-use div_bench.cpp -lbenchmark -lpthread -o div_bench.opt ``` -### Run the optimized binary - Now run the optimized binary: ```bash diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-5.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-5.md index 2074a1f18b..e42c0d0dd9 100644 --- a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-5.md +++ b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-5.md @@ -1,121 +1,89 @@ --- -title: Incorporating PGO into a GitHub Actions workflow +title: Using Profile Guided Optimization (Windows) weight: 6 ### FIXED, DO NOT MODIFY layout: learningpathall --- -### Build locally with make - -PGO can be integrated into a `Makefile` and continuous integration (CI) systems using simple command-line instructions, as shown in the sample `Makefile` below. - -{{% notice Caution %}} -PGO adds additional build steps which can increase compile time - especially for large code bases. As such, PGO is not suitable for all sections of code. You should PGO only for sections of code which are heavily influenced by run-time behavior and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns. -{{% /notice %}} - -Use a text editor to create a file named `Makefile` containing the following content: - -```makefile -# Simple Makefile for building and benchmarking div_bench with and without PGO - -# Compiler and flags -CXX := g++ -CXXFLAGS := -O3 -std=c++17 -LDLIBS := -lbenchmark -lpthread - -# Default target: build both binaries -.PHONY: all clean clean-gcda clean-perf run -all: div_bench.base div_bench.opt - -# Build the baseline binary (no PGO) -div_bench.base: div_bench.cpp - $(CXX) $(CXXFLAGS) $< $(LDLIBS) -o $@ - -# Build the PGO-optimized binary: -# Note: This target depends on the source file and cleans previous profile data first. -# It runs the instrumented binary to generate new profile data before the final compilation. -div_bench.opt: div_bench.cpp - $(MAKE) clean-gcda # Ensure no old profile data interferes - $(CXX) $(CXXFLAGS) -fprofile-generate $< $(LDLIBS) -o $@ - @echo "Running instrumented binary to gather profile data..." - ./div_bench.opt # Generate .gcda file - $(CXX) $(CXXFLAGS) -fprofile-use $< $(LDLIBS) -o $@ # Compile using the generated profile - $(MAKE) clean-perf # Optional: Clean perf data if generated elsewhere - -# Remove profile data files -clean-gcda: - rm -f ./*.gcda - -# Remove perf data files (if applicable) -clean-perf: - rm -f perf-division-* - -# Remove all generated files including binaries and profile data -clean: clean-gcda clean-perf - rm -f div_bench.base div_bench.opt - -# Run both benchmarks with informative headers -run: all # Ensure binaries are built before running - @echo "==================== Without Profile-Guided Optimization ====================" - ./div_bench.base - @echo "==================== With Profile-Guided Optimization ====================" - ./div_bench.opt +### Build with PGO + +To generate a binary optimized using runtime profile data, first build an instrumented binary that records usage data. Before building, open the Arm dev shell so that the compiler is in your PATH: + +```console +& "C:\Program Files\Microsoft Visual Studio\18\Community\Common7\Tools\Launch-VsDevShell.ps1" -Arch arm64 +``` + +(**note:** you may need to change the version number in your Visual Studio path, depending on which Visual Studio version you've installed.) + +Next, set an environment variable to refer to the installed packages directory: + +```console +$VCPKG="C:\git\vcpkg\installed\arm64-windows-static" ``` -You can run the following commands in your terminal: +Next, run the following command, which includes the `/GENPROFILE` flag, to build the instrumented binary: -* `make all` (or simply `make`): Compiles both `div_bench.base` (without PGO) and `div_bench.opt` (with PGO). This includes the steps of generating profile data for the optimized version. -* `make run`: Builds both binaries (if they don't exist) and then runs them, displaying the benchmark results for comparison. -* `make clean`: Removes the compiled binaries (`div_bench.base`, `div_bench.opt`) and any generated profile data files (`*.gcda`). +```console +cl /O2 /GL /D BENCHMARK_STATIC_DEFINE /I "$VCPKG\include" /Fe:div_bench.exe div_bench.cpp /link /LTCG /GENPROFILE /PGD:div_bench.pgd /LIBPATH:"$VCPKG\lib" benchmark.lib benchmark_main.lib shlwapi.lib +``` + +The compiler options used in this command are: + +* **/O2**: Creates [fast code](https://learn.microsoft.com/en-us/cpp/build/reference/o1-o2-minimize-size-maximize-speed?view=msvc-170) +* **/GL**: Enables [whole program optimization](https://learn.microsoft.com/en-us/cpp/build/reference/gl-whole-program-optimization?view=msvc-170). +* **/D**: Enables the Benchmark [static preprocessor definition](https://learn.microsoft.com/en-us/cpp/build/reference/d-preprocessor-definitions?view=msvc-170). +* **/I**: Adds the arm64 includes to the [list of include directories](https://learn.microsoft.com/en-us/cpp/build/reference/i-additional-include-directories?view=msvc-170). +* **/Fe**: Specifies a name for the [executable file output](https://learn.microsoft.com/en-us/cpp/build/reference/fe-name-exe-file?view=msvc-170). +* **/link**: Specifies [options to pass to linker](https://learn.microsoft.com/en-us/cpp/build/reference/link-pass-options-to-linker?view=msvc-170). -### Build with GitHub Actions +The linker options used in this command are: -Alternatively, you can integrate PGO into your Continuous Integration (CI) workflow using GitHub Actions. The YAML file below provides a basic example that compiles and runs the benchmark on a GitHub-hosted Ubuntu 24.04 Arm-based runner. This setup can be extended with automated tests to check for performance regressions. +* **/LTCG**: Specifies [link time code generation](https://learn.microsoft.com/en-us/cpp/build/reference/ltcg-link-time-code-generation?view=msvc-170). +* **/GENPROFILE**: Specifies [generation of a .pgd file for PGO](https://learn.microsoft.com/en-us/cpp/build/reference/genprofile-fastgenprofile-generate-profiling-instrumented-build?view=msvc-170). +* **/PGD**: Specifies a [database for PGO](https://learn.microsoft.com/en-us/cpp/build/reference/pgd-specify-database-for-profile-guided-optimizations?view=msvc-170). +* **/LIBPATH**: Specifies the [additional library path](https://learn.microsoft.com/en-us/cpp/build/reference/libpath-additional-libpath?view=msvc-170). -```yaml -name: PGO Benchmark +Next, run the instrumented binary to generate the profile data: -on: - push: - branches: [ main ] +```console +.\div_bench.exe +``` -jobs: - build: - runs-on: ubuntu-24.04-arm +This execution creates profile data files (typically with a `.pgc` extension) in the same directory. - steps: - - name: Check out source - uses: actions/checkout@v3 +Now recompile the program using the `/USEPROFILE` flag to apply optimizations based on the collected data: - - name: Install dependencies - run: | - sudo apt-get update - sudo apt-get install -y libbenchmark-dev g++ +```console +cl /O2 /GL /D BENCHMARK_STATIC_DEFINE /I "$VCPKG\include" /Fe:div_bench_opt.exe div_bench.cpp /link /LTCG:PGOptimize /USEPROFILE /PGD:div_bench.pgd /LIBPATH:"$VCPKG\lib" benchmark.lib benchmark_main.lib shlwapi.lib +``` - - name: Clean previous profiling data - run: | - rm -rf ./*gcda - rm -f div_bench.base div_bench.opt +In this command, the [USEPROFILE linker option](https://learn.microsoft.com/en-us/cpp/build/reference/useprofile?view=msvc-170) instructs the linker to enable PGO with the profile generated during the previous run of the executable. - - name: Compile base and instrumented binary - run: | - g++ -O3 -std=c++17 div_bench.cpp -lbenchmark -lpthread -o div_bench.base - g++ -O3 -std=c++17 -fprofile-generate div_bench.cpp -lbenchmark -lpthread -o div_bench.opt +### Run the optimized binary - - name: Generate profile data and compile with PGO - run: | - ./div_bench.opt - g++ -O3 -std=c++17 -fprofile-use div_bench.cpp -lbenchmark -lpthread -o div_bench.opt +Now run the optimized binary: - - name: Run benchmarks - run: | - echo "==================== Without Profile-Guided Optimization ====================" - ./div_bench.base - echo "==================== With Profile-Guided Optimization ====================" - ./div_bench.opt - echo "==================== Benchmarking complete ====================" +```console +.\div_bench_opt.exe ``` -To use this workflow, save the YAML content into a file named `pgo_benchmark.yml` (or any other `.yml` name) inside the `.github/workflows/` directory of your GitHub repository. Ensure your `div_bench.cpp` file is present in the repository root. When you push changes to the `main` branch, GitHub Actions will automatically detect this workflow file and execute the defined steps on an Arm-based runner, compiling both versions of the benchmark and running them. +The following output shows the performance improvement: + +```output +Running ./div_bench.opt +Run on (4 X 2100 MHz CPU s) +CPU Caches: + L1 Data 64 KiB (x4) + L1 Instruction 64 KiB (x4) + L2 Unified 1024 KiB (x4) + L3 Unified 32768 KiB (x1) +Load Average: 0.10, 0.03, 0.01 +***WARNING*** Library was built as DEBUG. Timings may be affected. +------------------------------------------------------- +Benchmark Time CPU Iterations +------------------------------------------------------- +baseDiv/1500 2.86 us 2.86 us 244429 +``` +As the terminal output above shows, the average execution time is reduced from 7.90 to 2.86 microseconds. This improvement occurs because the profile data informed the compiler that the input divisor was consistently 1500 during the profiled runs, allowing it to apply specific optimizations. diff --git a/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-6.md b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-6.md new file mode 100644 index 0000000000..342741e816 --- /dev/null +++ b/content/learning-paths/servers-and-cloud-computing/cpp-profile-guided-optimisation/how-to-6.md @@ -0,0 +1,121 @@ +--- +title: Incorporating PGO into a GitHub Actions workflow +weight: 7 + +### FIXED, DO NOT MODIFY +layout: learningpathall +--- + +### Build locally with make + +PGO can be integrated into a `Makefile` and continuous integration (CI) systems using simple command-line instructions, as shown in the sample `Makefile` below. + +{{% notice Caution %}} +PGO adds additional build steps which can increase compile time - especially for large code bases. As such, PGO is not suitable for all sections of code. You should PGO only for sections of code which are heavily influenced by run-time behavior and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns. +{{% /notice %}} + +Use a text editor to create a file named `Makefile` containing the following content: + +```makefile +# Simple Makefile for building and benchmarking div_bench with and without PGO + +# Compiler and flags +CXX := g++ +CXXFLAGS := -O3 -std=c++17 +LDLIBS := -lbenchmark -lpthread + +# Default target: build both binaries +.PHONY: all clean clean-gcda clean-perf run +all: div_bench.base div_bench.opt + +# Build the baseline binary (no PGO) +div_bench.base: div_bench.cpp + $(CXX) $(CXXFLAGS) $< $(LDLIBS) -o $@ + +# Build the PGO-optimized binary: +# Note: This target depends on the source file and cleans previous profile data first. +# It runs the instrumented binary to generate new profile data before the final compilation. +div_bench.opt: div_bench.cpp + $(MAKE) clean-gcda # Ensure no old profile data interferes + $(CXX) $(CXXFLAGS) -fprofile-generate $< $(LDLIBS) -o $@ + @echo "Running instrumented binary to gather profile data..." + ./div_bench.opt # Generate .gcda file + $(CXX) $(CXXFLAGS) -fprofile-use $< $(LDLIBS) -o $@ # Compile using the generated profile + $(MAKE) clean-perf # Optional: Clean perf data if generated elsewhere + +# Remove profile data files +clean-gcda: + rm -f ./*.gcda + +# Remove perf data files (if applicable) +clean-perf: + rm -f perf-division-* + +# Remove all generated files including binaries and profile data +clean: clean-gcda clean-perf + rm -f div_bench.base div_bench.opt + +# Run both benchmarks with informative headers +run: all # Ensure binaries are built before running + @echo "==================== Without Profile-Guided Optimization ====================" + ./div_bench.base + @echo "==================== With Profile-Guided Optimization ====================" + ./div_bench.opt +``` + +You can run the following commands in your terminal: + +* `make all` (or simply `make`): Compiles both `div_bench.base` (without PGO) and `div_bench.opt` (with PGO). This includes the steps of generating profile data for the optimized version. +* `make run`: Builds both binaries (if they don't exist) and then runs them, displaying the benchmark results for comparison. +* `make clean`: Removes the compiled binaries (`div_bench.base`, `div_bench.opt`) and any generated profile data files (`*.gcda`). + +### Build with GitHub Actions + +Alternatively, you can integrate PGO into your Continuous Integration (CI) workflow using GitHub Actions. The YAML file below provides a basic example that compiles and runs the benchmark on a GitHub-hosted Ubuntu 24.04 Arm-based runner. This setup can be extended with automated tests to check for performance regressions. + +```yaml +name: PGO Benchmark + +on: + push: + branches: [ main ] + +jobs: + build: + runs-on: ubuntu-24.04-arm + + steps: + - name: Check out source + uses: actions/checkout@v3 + + - name: Install dependencies + run: | + sudo apt-get update + sudo apt-get install -y libbenchmark-dev g++ + + - name: Clean previous profiling data + run: | + rm -rf ./*gcda + rm -f div_bench.base div_bench.opt + + - name: Compile base and instrumented binary + run: | + g++ -O3 -std=c++17 div_bench.cpp -lbenchmark -lpthread -o div_bench.base + g++ -O3 -std=c++17 -fprofile-generate div_bench.cpp -lbenchmark -lpthread -o div_bench.opt + + - name: Generate profile data and compile with PGO + run: | + ./div_bench.opt + g++ -O3 -std=c++17 -fprofile-use div_bench.cpp -lbenchmark -lpthread -o div_bench.opt + + - name: Run benchmarks + run: | + echo "==================== Without Profile-Guided Optimization ====================" + ./div_bench.base + echo "==================== With Profile-Guided Optimization ====================" + ./div_bench.opt + echo "==================== Benchmarking complete ====================" +``` + +To use this workflow, save the YAML content into a file named `pgo_benchmark.yml` (or any other `.yml` name) inside the `.github/workflows/` directory of your GitHub repository. Ensure your `div_bench.cpp` file is present in the repository root. When you push changes to the `main` branch, GitHub Actions will automatically detect this workflow file and execute the defined steps on an Arm-based runner, compiling both versions of the benchmark and running them. +