Skip to content

Performance overview

Hannupekka Sormunen edited this page Dec 30, 2018 · 3 revisions

Most of the this program was written on the C++ Tutorial for Complete Beginners course on section called Developing a Program: the Particle Fire Simulation. However, like the lecturer also noted, there are places where the code could be improved. For this there are great information floating around the web, such as this Wikibook. This page is provide overview of how the program was improved and why the title has the word revision in it. For calculations of program’s performance there are tools available in variety of integrated development environments. On this performance overview report Visual Studio 2017’s performance profiler was used. It gives great views about where Central processing unit’s (processor) performance is going and also how much computer’s memory the program is reserving during execution and even displays what lines of code are the most resource intensive. All the results on this page are based on Visual Studios profiler’s output. Finally all the results are compared between the original version found from this repo and the revisioned version.

Table of Contents

  1. Inclusion of frame rate limiter
  2. Improving the box blur algorithm
  3. Tweaking the SDL-library initialization and pixel buffer
  4. Other improvements
  5. Final performance tests

Inclusion of frame rate limiter

First things first, since this is a graphical program there is always one way to make the program more efficient, and not use every bit of computing power available for every second, is by implementing a frame rate limiter. To recap what the point of limiting frame rate is:

... to limit speed of game loop and save precious computer resources to another operations that user may want to do while using the program.

Adding a simple frame rate limiter gave some efficiency to program, here are is two diagnostic tools results running the same program with and without limit: First the original program from the course, without any frame rate limitation. Program is running around 84-88 frames per second: No framerate limit

Original version of the program with with frame rate limit of 60 frames per second: Framerate limit

In both cases program is running at resolution of 800 in width and 600 height (stated 800x600, widthXheight, from now on). With frame rate limiter, processor usage is not constant but rather cyclic and little bit lower than with no frame rate limitation. This also reflects to total amount of processor usage displayed in Total CPU-column, which is about 20000 processor cycles less. This is a little improvement but not a deal breaker yet, still it is a start to right direction. Next up some tweaking under the hood, as the source code still needs improvement. Profiler provides information where there could be more room for improvement.

Improving the box blur algorithm

As stated, diagnostic tool can be used to find lines of code that uses a lot of processor performance and could potentially be faster, if improved somehow or even removed completely if possible. First thing, that is also shown in the screenshots above, is the box blur data method that does a lot of for-loops. For better understanding how box blurring algorithm is implemented, it is recommend to read a wiki-entry about it. On the original program there is even an if-statement with multiple &&(AND)-operators. Loops combined with AND-operators, especially ones with several conditions, are not very effective. Since on each loop, it is necessary to analyze all of the condition provided one by one, and because there are only about 0.5% of cases (explained later) that do not pass the if-condition it means that over 99% of the time every variable has to be checked. This really has a impact on the execution of the program. Here is the original algorithm with if-statement included:

// Swap the buffers, so pixel is in m_buffer2 and we are drawing to m_buffer1.
Uint32 *temp = m_buffer1;
m_buffer1 = m_buffer2;
m_buffer2 = temp;

for(int y=0; y<SCREEN_HEIGHT; y++) {
	for(int x=0; x<SCREEN_WIDTH; x++) {
		int redTotal=0;
		int greenTotal=0;
		int blueTotal=0;

		for(int row=-1; row<=1; row++) {
			for(int col=-1; col<=1; col++) {
				int currentX = x + col;
				int currentY = y + row;

				//If statement with multiple conditions
				if(currentX >= 0 && currentX < SCREEN_WIDTH && currentY >= 0 && currentY < SCREEN_HEIGHT) {
					Uint32 color = m_buffer2[currentY*SCREEN_WIDTH + currentX];

					Uint8 red = color >> 24;
					Uint8 green = color >> 16;
					Uint8 blue = color >> 8;

					redTotal += red;
					greenTotal += green;
					blueTotal += blue;
				}
			}
		}

		Uint8 red = redTotal/9;
		Uint8 green = greenTotal/9;
		Uint8 blue = blueTotal/9;

		setPixel(x, y, red, green, blue);
	}
}

If statement is used for blocking loop from trying to access m_buffer2 array’s elements, that are not available thus preventing a segmentation fault. On which cases algorithm has possibility of such occurrence happening? Since box blur algorithm gathers total values of eight neighboring pixel’s color channels and calculates an average on each color channel. Segmentation fault can happen when accessing negative numbers such as -1, which are not possible, since value 0 is the first element in an array and the next elements are always positive values such as 1,2 and 3000. Conversely value can also be too high, since there is a fixed number of elements allocated to the array, which is the total amount of pixels on the screen. For example resolution of 800x600 there is total of 480000 pixels (800 multiplied by 600) and algorithm tries to access elements that are exactly that value or over. Segmentation fault happens in both of these cases because memory location they are trying to access is not allocated to the variable. For example if the left topmost pixel’s (X:0 and Y:0) neighboring pixels were tried to access with current algorithm but without the if-statement, here are the locations that would cause segmentation faults:

rows 🡆 cols 🡇 -1 0 1
-1 Segmentation fault Segmentation fault Segmentation fault
0 Segmentation fault 🢄 🢁 🢅
🢀Current pixel position🢂
🢇 🢃 🢆
OK
1 Segmentation fault OK OK

Since loops with multiple statements are not effective, the solution to this is to change how blurring algorithm works. As described at the box blur wiki page, instead of traveling through all the pixels, the effect still works when first and last pixels of both width and height axis are skipped. So the solution is to modify first two loops at the beginning of the algorithm:

for(int y=1; y<SCREEN_HEIGHT - 1; y++) {
	for(int x=1; x<SCREEN_WIDTH - 1; x++) {

Here is an example how the new algorithm works at resolution of 800x600:

🡇 Height ------ ------ ------ ------ ------ ------ ------ ------ ------
Width 🡆 X:0 X:1 X: 2 X:.. X:400 X:.. X:798 X:799 X:800
Y:0 No Blur No Blur No Blur No Blur No Blur No Blur No Blur No Blur No Blur
Y:1 No Blur Blur Blur Blur Blur Blur Blur Blur No Blur
Y:2 No Blur Blur Blur Blur Blur Blur Blur Blur No Blur
Y:.. No Blur Blur Blur Blur Blur Blur Blur Blur No Blur
Y:300 No Blur Blur Blur Blur Blur Blur Blur Blur No Blur
Y:.. No Blur Blur Blur Blur Blur Blur Blur Blur No Blur
Y:588 No Blur Blur Blur Blur Blur Blur Blur Blur No Blur
Y:599 No Blur Blur Blur Blur Blur Blur Blur Blur No Blur
Y:600 No Blur No Blur No Blur No Blur No Blur No Blur No Blur No Blur No Blur

Now the if-statement that prevents access to undesired locations can be completely removed. With this tweak whole box blur algorithm can be made about 10% faster without actually sacrificing box blur effects outcome, since elements that are marked with No Blur are still accessed by algorithm when neighboring pixels are checked. What this tweak does is prevent the loops from accessing elements that are neighboring the No Blur elements, such as X:-1 Y:-1 location, which is not allocated element to the pixel buffer.

Tweaking the SDL-library initialization and pixel buffer

Other major improvement to box blur algorithm, and the whole program, happens with tweaking how SDL-library and pixel buffer is initialized at the beginning. Here is what the original looks like:

bool Screen::init() {
	if (SDL_Init(SDL_INIT_VIDEO) < 0) {
		return false;
	}

	m_window = SDL_CreateWindow("Particle Fire Explosion",
	SDL_WINDOWPOS_UNDEFINED, SDL_WINDOWPOS_UNDEFINED, SCREEN_WIDTH, SCREEN_HEIGHT, SDL_WINDOW_SHOWN);

	if (m_window == NULL) {
		SDL_Quit();
		return false;
	}

	m_renderer = SDL_CreateRenderer(m_window, -1, SDL_RENDERER_PRESENTVSYNC);
	
m_texture = SDL_CreateTexture(m_renderer, SDL_PIXELFORMAT_RGBA8888,
			SDL_TEXTUREACCESS_STATIC, SCREEN_WIDTH, SCREEN_HEIGHT);

	if (m_renderer == NULL) {
		SDL_DestroyWindow(m_window);
		SDL_Quit();
		return false;
	}

	if (m_texture == NULL) {
		SDL_DestroyRenderer(m_renderer);
		SDL_DestroyWindow(m_window);
		SDL_Quit();
		return false;
	}

	m_buffer1 = new Uint32[SCREEN_WIDTH * SCREEN_HEIGHT];
	m_buffer2 = new Uint32[SCREEN_WIDTH * SCREEN_HEIGHT];

	memset(m_buffer1, 0, SCREEN_WIDTH * SCREEN_HEIGHT * sizeof(Uint32));
	memset(m_buffer2, 0, SCREEN_WIDTH * SCREEN_HEIGHT * sizeof(Uint32));

	return true;
}

When renderer is initialized on the line “m_renderer = SDL_CreateRenderer(...” instead of using SDL_RENDERER_PRESENTVSYNC-flag, some increase in performance can be have with SDL_RENDERER_ACCELERATED-flag instead. As SDL-wiki states the with this flag, renderer uses hardware acceleration, meaning that computer’s graphics processing unit is used for graphics calculation when available. While SDL uses hardware most of the the time, even if hardware acceleration flag would not be “raised”, this change to code makes sure that all the workload is not pushed to the processor only and other hardware plays their part when displaying the program. Since SDL 2.0 uses hardware rendering most of the time when it is available, this tweak does not provide a lot of difference, but is there to make sure that everything is initialized correctly from the beginning. The next one however has a much bigger effect. Line “m_texture = SDL_CreateTexture(...” initializes a texture to window. For better explanation to what exactly is happening please refer to last section of swarm object explanation.

What CreateTexture function does is places a texture for window and renderer created that were initialized earlier. There are also several flags that can be set to modify how the texture behaves. First one is selecting pixelformat. For understanding how these pixelformats works please refer to this wiki-page. What is used on the original program is SDL_PIXELFORMAT_RGBA8888-flag while there many many pixel formats where to choose from, the bottom line is, the simpler the pixelformat is, the faster the program will perform. However it is still better to use 24-bit color for making sure that there are nice variety of color available for the purpose of this program. So instead of RGBA8888-pixelformat something that behaves closely the same as the the original program but still faster would be SDL_PIXELFORMAT_RGB888. This format drops unnecessary alpha channel, for setting up color transparency. Transferring from 32-bit color to 24-bit color helps the box blur perform much faster, since there is no need to bitwise shift so many pixels. The process of bitwise shifting is documented on color-wiki. Here is what has to be done with 32-bit color:

Uint8 red = color >> 24;
Uint8 green = color >> 16;
Uint8 blue = color >> 8;

redTotal += red;
greenTotal += green;
blueTotal += blue;

Bitwise shifting must be done three times when 32-bit color is used, when it can be reduced to two when using 24-bit color and also tweak how the bitwise shifting is done and how values are stored in to the variables:

Uint8 blue(color);
blue_total += blue;

color >>=8;
Uint8 green(color);
green_total += green;

color >>=8;
red_total += color;

This one less bitwise shifting and one less variable initialization enhances box blur algorithm performance significantly, which can be seen in final screencaps at the last header of this wiki.

Final SDL tweak is done by replacing renderer’s SDL_TEXTUREACCESS_STATIC-flag to SDL_TEXTUREACCESS_STREAMING since according to this tutorial STREAMING-flag allows editing the texture's pixels and it is better suited to programs where pixels are constantly changing, such as this one has. Finally there is two pixel buffers, which is not actually needed when using SDL_TEXTUREACCESS_STREAMING and because of that, this code block can be completely removed from the program:

Uint32 *temp = m_buffer1;
m_buffer1 = m_buffer2;
m_buffer2 = temp;

Since this is no longer necessary to be executed on every loop, it makes algorithm perform faster.

Other improvements

Wikibook on optimization provided couple of good points how to make programs bit faster and here are some of them used in this project also:

  1. The most efficient types states that variables should be defined according to their use. This is why Uint8, Uint16 and Uint32 variables provided by SDL are used as much as possible since they are guaranteed to be certain size and do not depended on compiler or platform, thus being much more efficient. Likewise where there is better to use standard int and long they are used.
  2. Static member functions static data methods and static data members are used as much as possible, which were already used in the original program as well.
  3. Initializations, so that variables are initialized with (), rather than assigned with = upon their definition.
  4. And of course Wikibook’s page about performance worsening features to make sure, that program doesn’t have something that makes performance worse.

Final performance tests

Here are results of each program running both with same resolution 800x600, without frame rate limitation and 5000 particles on screen. Final difference in processor usage between original program and a program with these improvements is around 70% as can be seen on these performance profiler reports:

Particle fire simulation box blur

Particle fire simulation revision box blur

Without optimization the code uses around 60000 processor cycles in a minute, of which 60% are used on the box blur algorithm only. With optimization presented on this wiki amount of cycles can be reduced to some 17500 cycles per minute, of which 80% are used on the box blurring. As can be seen these performance tweaks effect is major and it makes the program run with much faster frame rates:

Particle fire simulation fps

Particle fire simulation revision fps

Running the same programs with tweaks is around 50 frames per second faster. While this animation is very smooth on both programs, biggest difference can be witnessed with standard HD-resolutions, such as 1280x720 and especially on Full HD resolution 1920x1080. Where the non-revisioned program struggles to get past 24 frames per second, which might look bit jerky on certain screen, while revisioned version still runs at solid 60 frames per second.

Clone this wiki locally