Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ learning_objectives:

prerequisites:
- Basic C++ understanding.
- Access to an Arm-based Linux machine.
- Access to an Arm-based machine.

author: Kieran Hejmadi

Expand All @@ -25,12 +25,17 @@ tools_software_languages:
- Runbook
operatingsystems:
- Linux
- Windows

further_reading:
- resource:
title: G++ profile-guided optimization documentation
link: https://gcc.gnu.org/onlinedocs/gcc-13.3.0/gcc/Instrumentation-Options.html
type: documentation
- resource:
title: MSVC profile-guided optimization documentation
link: https://learn.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=msvc-170
type: documentation
- resource:
title: Google Benchmark Library
link: https://github.com/google/benchmark
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ layout: learningpathall

### What is Profile-Guided Optimization (PGO) and how does it work?

Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. In GCC/G++, PGO involves a two-step process:
Profile-Guided Optimization (PGO) is a compiler optimization technique that enhances program performance by utilizing real-world execution data. PGO typically involves a two-step process:

- First, compile the program with the `-fprofile-generate` flag to produce an instrumented binary that collects profiling data during execution;
- Second, recompile the program with the `-fprofile-use` flag, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach identifies frequently executed paths — known as “hot” paths — and optimizes them more aggressively, while potentially reducing emphasis on less critical code paths.
- First, compile the program to produce an instrumented binary that collects profiling data during execution;
- Second, recompile the program with an optimization profile, allowing the compiler to leverage the collected data to make informed optimization decisions. This approach identifies frequently executed paths — known as “hot” paths — and optimizes them more aggressively, while potentially reducing emphasis on less critical code paths.

### When should I use Profile-Guided Optimization?

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ In this section, you'll learn how to use Google Benchmark and Profile-Guided Opt

Integer division is ideal for benchmarking because it's significantly more expensive than operations like addition, subtraction, or multiplication. On most CPU architectures, including Arm, division instructions have higher latency and lower throughput compared to other arithmetic operations. By applying Profile-Guided Optimization to code containing division operations, we can potentially achieve significant performance improvements.

## What tools are needed to run a Google Benchmark example?
For this example, you can use an Arm computer (Linux or Windows).

For this example, you can use any Arm Linux computer. For example, an AWS EC2 `c7g.xlarge` instance running Ubuntu 24.04 LTS can be used.
## What tools are needed to run a Google Benchmark example on Linux?

Run the following commands to install the prerequisite packages:

Expand All @@ -23,6 +23,20 @@ sudo apt update
sudo apt install gcc g++ make libbenchmark-dev -y
```

## What tools are needed to run a Google Benchmark example on Windows?

Download the [Arm GNU Toolchain](https://developer.arm.com/Tools%20and%20Software/GNU%20Toolchain) to install the prerequisite packages.

Next, install the static version of Google Benchmark for Arm64 via vcpkg. Run the following commands in Powershell as Administrator:

```console
cd C:\git
git clone https://github.com/microsoft/vcpkg.git
cd vcpkg
.\bootstrap-vcpkg.bat
.\vcpkg install benchmark:arm64-windows-static
```

## Division example

Use an editor to copy and paste the C++ source code below into a file named `div_bench.cpp`.
Expand All @@ -49,20 +63,32 @@ BENCHMARK(baseDiv)->Arg(1500)->Unit(benchmark::kMicrosecond); // value of 1500 i
BENCHMARK_MAIN();
```

To compile and run the microbenchmark on this function, you need to link with the `pthreads` and `benchmark` libraries.
To compile and run the microbenchmark on this function, you need to link with the correct libraries:

Compile with the command:
**(Linux)** Compile with the command:

```bash
g++ -O3 -std=c++17 div_bench.cpp -lbenchmark -lpthread -o div_bench.base
```

Run the program:
**(Windows)** Compile with the command:

```console
cl /D BENCHMARK_STATIC_DEFINE div_bench.cpp /link /LIBPATH:"$VCPKG\lib" benchmark.lib benchmark_main.lib shlwapi.lib
```

**(Linux)** Run the program:

```bash
./div_bench.base
```

**(Windows)** Run the program:

```console
.\div_bench.exe
```

### Example output

```output
Expand All @@ -80,25 +106,3 @@ Benchmark Time CPU Iterations
-------------------------------------------------------
baseDiv/1500 7.90 us 7.90 us 88512
```

### Inspect assembly

To inspect what assembly instructions are being executed most frequently, you can use the `perf` command. This is useful for identifying bottlenecks and understanding the performance characteristics of your code.

Install Perf using the [install guide](https://learn.arm.com/install-guides/perf/) before proceeding.

{{% notice Please Note %}}
You may need to set the `perf_event_paranoid` value to -1 with the `sudo sysctl kernel.perf_event_paranoid=-1` command to run the commands below.
{{% /notice %}}

Run the following commands to record `perf` data and create a report in the terminal:

```bash
sudo perf record -o perf-division-base ./div_bench.base
sudo perf report --input=perf-division-base
```

As the `perf report` graphic below shows, the program spends a significant amount of time in the short loops with no loop unrolling. There is also an expensive `sdiv` operation, and most of the execution time is spent storing the result of the operation.

![before-pgo](./before-pgo.gif)

Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Using Profile Guided Optimization
title: Using Profile Guided Optimization (Linux)
weight: 5

### FIXED, DO NOT MODIFY
Expand All @@ -22,14 +22,35 @@ Next, run the instrumented binary to generate the profile data:

This execution creates profile data files (typically with a `.gcda` extension) in the same directory.

### Inspect assembly

To inspect what assembly instructions are being executed most frequently, you can use the `perf` command. This is useful for identifying bottlenecks and understanding the performance characteristics of your code.

Install Perf using the [install guide](https://learn.arm.com/install-guides/perf/) before proceeding.

{{% notice Please Note %}}
You may need to set the `perf_event_paranoid` value to -1 with the `sudo sysctl kernel.perf_event_paranoid=-1` command to run the commands below.
{{% /notice %}}

Run the following commands to record `perf` data and create a report in the terminal:

```bash
sudo perf record -o perf-division-base ./div_bench.base
sudo perf report --input=perf-division-base
```

As the `perf report` graphic below shows, the program spends a significant amount of time in the short loops with no loop unrolling. There is also an expensive `sdiv` operation, and most of the execution time is spent storing the result of the operation.

![before-pgo](./before-pgo.gif)

### Compile and run the optimized binary

Now recompile the program using the `-fprofile-use` flag to apply optimizations based on the collected data:

```bash
g++ -O3 -std=c++17 -fprofile-use div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
```

### Run the optimized binary

Now run the optimized binary:

```bash
Expand Down
Original file line number Diff line number Diff line change
@@ -1,121 +1,89 @@
---
title: Incorporating PGO into a GitHub Actions workflow
title: Using Profile Guided Optimization (Windows)
weight: 6

### FIXED, DO NOT MODIFY
layout: learningpathall
---

### Build locally with make

PGO can be integrated into a `Makefile` and continuous integration (CI) systems using simple command-line instructions, as shown in the sample `Makefile` below.

{{% notice Caution %}}
PGO adds additional build steps which can increase compile time - especially for large code bases. As such, PGO is not suitable for all sections of code. You should PGO only for sections of code which are heavily influenced by run-time behavior and are performance critical. Therefore, PGO might not be ideal for early-stage development or for applications with highly variable or unpredictable usage patterns.
{{% /notice %}}

Use a text editor to create a file named `Makefile` containing the following content:

```makefile
# Simple Makefile for building and benchmarking div_bench with and without PGO

# Compiler and flags
CXX := g++
CXXFLAGS := -O3 -std=c++17
LDLIBS := -lbenchmark -lpthread

# Default target: build both binaries
.PHONY: all clean clean-gcda clean-perf run
all: div_bench.base div_bench.opt

# Build the baseline binary (no PGO)
div_bench.base: div_bench.cpp
$(CXX) $(CXXFLAGS) $< $(LDLIBS) -o $@

# Build the PGO-optimized binary:
# Note: This target depends on the source file and cleans previous profile data first.
# It runs the instrumented binary to generate new profile data before the final compilation.
div_bench.opt: div_bench.cpp
$(MAKE) clean-gcda # Ensure no old profile data interferes
$(CXX) $(CXXFLAGS) -fprofile-generate $< $(LDLIBS) -o $@
@echo "Running instrumented binary to gather profile data..."
./div_bench.opt # Generate .gcda file
$(CXX) $(CXXFLAGS) -fprofile-use $< $(LDLIBS) -o $@ # Compile using the generated profile
$(MAKE) clean-perf # Optional: Clean perf data if generated elsewhere

# Remove profile data files
clean-gcda:
rm -f ./*.gcda

# Remove perf data files (if applicable)
clean-perf:
rm -f perf-division-*

# Remove all generated files including binaries and profile data
clean: clean-gcda clean-perf
rm -f div_bench.base div_bench.opt

# Run both benchmarks with informative headers
run: all # Ensure binaries are built before running
@echo "==================== Without Profile-Guided Optimization ===================="
./div_bench.base
@echo "==================== With Profile-Guided Optimization ===================="
./div_bench.opt
### Build with PGO

To generate a binary optimized using runtime profile data, first build an instrumented binary that records usage data. Before building, open the Arm dev shell so that the compiler is in your PATH:

```console
& "C:\Program Files\Microsoft Visual Studio\18\Community\Common7\Tools\Launch-VsDevShell.ps1" -Arch arm64
```

(**note:** you may need to change the version number in your Visual Studio path, depending on which Visual Studio version you've installed.)

Next, set an environment variable to refer to the installed packages directory:

```console
$VCPKG="C:\git\vcpkg\installed\arm64-windows-static"
```

You can run the following commands in your terminal:
Next, run the following command, which includes the `/GENPROFILE` flag, to build the instrumented binary:

* `make all` (or simply `make`): Compiles both `div_bench.base` (without PGO) and `div_bench.opt` (with PGO). This includes the steps of generating profile data for the optimized version.
* `make run`: Builds both binaries (if they don't exist) and then runs them, displaying the benchmark results for comparison.
* `make clean`: Removes the compiled binaries (`div_bench.base`, `div_bench.opt`) and any generated profile data files (`*.gcda`).
```console
cl /O2 /GL /D BENCHMARK_STATIC_DEFINE /I "$VCPKG\include" /Fe:div_bench.exe div_bench.cpp /link /LTCG /GENPROFILE /PGD:div_bench.pgd /LIBPATH:"$VCPKG\lib" benchmark.lib benchmark_main.lib shlwapi.lib
```

The compiler options used in this command are:

* **/O2**: Creates [fast code](https://learn.microsoft.com/en-us/cpp/build/reference/o1-o2-minimize-size-maximize-speed?view=msvc-170)
* **/GL**: Enables [whole program optimization](https://learn.microsoft.com/en-us/cpp/build/reference/gl-whole-program-optimization?view=msvc-170).
* **/D**: Enables the Benchmark [static preprocessor definition](https://learn.microsoft.com/en-us/cpp/build/reference/d-preprocessor-definitions?view=msvc-170).
* **/I**: Adds the arm64 includes to the [list of include directories](https://learn.microsoft.com/en-us/cpp/build/reference/i-additional-include-directories?view=msvc-170).
* **/Fe**: Specifies a name for the [executable file output](https://learn.microsoft.com/en-us/cpp/build/reference/fe-name-exe-file?view=msvc-170).
* **/link**: Specifies [options to pass to linker](https://learn.microsoft.com/en-us/cpp/build/reference/link-pass-options-to-linker?view=msvc-170).

### Build with GitHub Actions
The linker options used in this command are:

Alternatively, you can integrate PGO into your Continuous Integration (CI) workflow using GitHub Actions. The YAML file below provides a basic example that compiles and runs the benchmark on a GitHub-hosted Ubuntu 24.04 Arm-based runner. This setup can be extended with automated tests to check for performance regressions.
* **/LTCG**: Specifies [link time code generation](https://learn.microsoft.com/en-us/cpp/build/reference/ltcg-link-time-code-generation?view=msvc-170).
* **/GENPROFILE**: Specifies [generation of a .pgd file for PGO](https://learn.microsoft.com/en-us/cpp/build/reference/genprofile-fastgenprofile-generate-profiling-instrumented-build?view=msvc-170).
* **/PGD**: Specifies a [database for PGO](https://learn.microsoft.com/en-us/cpp/build/reference/pgd-specify-database-for-profile-guided-optimizations?view=msvc-170).
* **/LIBPATH**: Specifies the [additional library path](https://learn.microsoft.com/en-us/cpp/build/reference/libpath-additional-libpath?view=msvc-170).

```yaml
name: PGO Benchmark
Next, run the instrumented binary to generate the profile data:

on:
push:
branches: [ main ]
```console
.\div_bench.exe
```

jobs:
build:
runs-on: ubuntu-24.04-arm
This execution creates profile data files (typically with a `.pgc` extension) in the same directory.

steps:
- name: Check out source
uses: actions/checkout@v3
Now recompile the program using the `/USEPROFILE` flag to apply optimizations based on the collected data:

- name: Install dependencies
run: |
sudo apt-get update
sudo apt-get install -y libbenchmark-dev g++
```console
cl /O2 /GL /D BENCHMARK_STATIC_DEFINE /I "$VCPKG\include" /Fe:div_bench_opt.exe div_bench.cpp /link /LTCG:PGOptimize /USEPROFILE /PGD:div_bench.pgd /LIBPATH:"$VCPKG\lib" benchmark.lib benchmark_main.lib shlwapi.lib
```

- name: Clean previous profiling data
run: |
rm -rf ./*gcda
rm -f div_bench.base div_bench.opt
In this command, the [USEPROFILE linker option](https://learn.microsoft.com/en-us/cpp/build/reference/useprofile?view=msvc-170) instructs the linker to enable PGO with the profile generated during the previous run of the executable.

- name: Compile base and instrumented binary
run: |
g++ -O3 -std=c++17 div_bench.cpp -lbenchmark -lpthread -o div_bench.base
g++ -O3 -std=c++17 -fprofile-generate div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
### Run the optimized binary

- name: Generate profile data and compile with PGO
run: |
./div_bench.opt
g++ -O3 -std=c++17 -fprofile-use div_bench.cpp -lbenchmark -lpthread -o div_bench.opt
Now run the optimized binary:

- name: Run benchmarks
run: |
echo "==================== Without Profile-Guided Optimization ===================="
./div_bench.base
echo "==================== With Profile-Guided Optimization ===================="
./div_bench.opt
echo "==================== Benchmarking complete ===================="
```console
.\div_bench_opt.exe
```

To use this workflow, save the YAML content into a file named `pgo_benchmark.yml` (or any other `.yml` name) inside the `.github/workflows/` directory of your GitHub repository. Ensure your `div_bench.cpp` file is present in the repository root. When you push changes to the `main` branch, GitHub Actions will automatically detect this workflow file and execute the defined steps on an Arm-based runner, compiling both versions of the benchmark and running them.
The following output shows the performance improvement:

```output
Running ./div_bench.opt
Run on (4 X 2100 MHz CPU s)
CPU Caches:
L1 Data 64 KiB (x4)
L1 Instruction 64 KiB (x4)
L2 Unified 1024 KiB (x4)
L3 Unified 32768 KiB (x1)
Load Average: 0.10, 0.03, 0.01
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
baseDiv/1500 2.86 us 2.86 us 244429
```

As the terminal output above shows, the average execution time is reduced from 7.90 to 2.86 microseconds. This improvement occurs because the profile data informed the compiler that the input divisor was consistently 1500 during the profiled runs, allowing it to apply specific optimizations.
Loading