Profiling and Timing performance

UPDATE: The corrected code snippet can be found at the end of this topic: https://community.arduboy.com/t/profiling-and-timing-performance/9325/9?u=gamecre8ion

Hi All,

This week I received my Arduboy and really enjoying experimenting with it. Nice piece of hardware! I cannot believe I missed out on this the last couple of years…

While playing around I was thinking about measuring some performance of the Arduboy. I did not found any specific information on the forum (could be my searching skills…).

When I run the snippet below it gives a performance of 4 micro-seconds, which would be wonderful but in my opinion not a realistic value for doing 1024 reciprocal square root calculations.
I am wondering what I am doing wrong; is there something wrong in my minimalistic approach or is it maybe the compiler optimizing the loop? (is the AVR compiler running the reciprocal square root only once for the value 1024)?

Any feedback is appreciated!

#include <Arduboy2.h>
#include <stdint.h>

#define NR_TESTS 1024

Arduboy2 arduboy;

uint32_t startTime = 0;
uint32_t endTime = 0;
uint32_t perfTime = 0;
float answer = 0.0f;

void runTests(uint16_t nrTests)
{
    // test expensive operation
    startTime = micros();
    for(uint16_t i = 1; i <= nrTests; i++)
    {
      answer = 1.0f / sqrt((float)i);
    }
    endTime = micros();
    perfTime = endTime - startTime;    
}

void setup() 
{
  arduboy.begin();
  runTests(NR_TESTS);
  arduboy.setFrameRate(15);  
}

void loop() 
{  
  if((!arduboy.nextFrame()))
  {
    return;
  }
  
  arduboy.clear();

  arduboy.print("Inverse Sqrt\n");
  arduboy.print("Op time = ");
  arduboy.print(perfTime);  
  arduboy.print(" us\n");
    
  arduboy.display();
}

Worse. The entire loop is eradicated:

void runTests(uint16_t nrTests)
{
    // test expensive operation
    startTime = micros();
    16b2:	0e 94 07 05 	call	0xa0e	; 0xa0e <micros>
    16b6:	6b 01       	movw	r12, r22
    16b8:	7c 01       	movw	r14, r24
    16ba:	c0 92 50 01 	sts	0x0150, r12	; 0x800150 <__data_end>
    16be:	d0 92 51 01 	sts	0x0151, r13	; 0x800151 <__data_end+0x1>
    16c2:	e0 92 52 01 	sts	0x0152, r14	; 0x800152 <__data_end+0x2>
    16c6:	f0 92 53 01 	sts	0x0153, r15	; 0x800153 <__data_end+0x3>
    for(uint16_t i = 1; i <= nrTests; i++)
    {
      answer = 1.0f / sqrt((float)i);
    }
    endTime = micros();
    16ca:	0e 94 07 05 	call	0xa0e	; 0xa0e <micros>
    perfTime = endTime - startTime;    
    16ce:	6c 19       	sub	r22, r12
    16d0:	7d 09       	sbc	r23, r13
    16d2:	8e 09       	sbc	r24, r14
    16d4:	9f 09       	sbc	r25, r15
    16d6:	60 93 5c 01 	sts	0x015C, r22	; 0x80015c <perfTime>
    16da:	70 93 5d 01 	sts	0x015D, r23	; 0x80015d <perfTime+0x1>
    16de:	80 93 5e 01 	sts	0x015E, r24	; 0x80015e <perfTime+0x2>
    16e2:	90 93 5f 01 	sts	0x015F, r25	; 0x80015f <perfTime+0x3>

If a value is only written to and never actually read from then there’s no need to store it,
hence the compiler just gets rid of the variable.
With no variable to be written, any attempt to write to said variable is futile,
hence the compiler gets rid of the attempts to write to it.
Without the attempt to write to answer the body of the loop is empty,
and thus the loop itself is no longer needed either so the compiler gets rid of that too.

Flawless Vulcan logic.

1 Like

Ohhh… I can’t see why I overlooked this… almost embarrassing… almost… :smiley:

Thank you @Pharap; btw could I ask how you disassemble this code? using avr-objdump? How did you find the exact function location in the disassembly? I am still learning all the tools available.

“Logic is the beginning of wisdom… not the end.” - Spock

Btw in order to be complete a simple “fix” could thus be changing the loop and “printing” the last result.

void loop() 
{  
  if((!arduboy.nextFrame()))
  {
    return;
  }
  
  arduboy.clear();

  arduboy.print("Inverse Sqrt\n");
  arduboy.print("Last result = ");
  arduboy.print(answer);
  arduboy.print("\n");
  arduboy.print("Op time = ");
  arduboy.print(perfTime);  
  arduboy.print(" us\n");
    
  arduboy.display();
}

You could also make answer volatile, thus forcing it to be written.
volatile float answer = 0.0f;

There’s a slight chance that you’re solution would only run the for loop once for the final answer (if the compiler is smart enough to recognise that’s all it needs).

You can use Compiler Explorer to look at the code generated by GCC with various options and optimization levels. There is a setting to use the AVR compiler for Arduino in the tool drop-down. See https://gcc.godbolt.org

Yes, I used objdump.

It should be located at <Arduino's installation path>/hardware/tools/avr/avr/bin and named objdump rather than avr-objdump.
(Specifically objdump.exe on Windows.)

I ran it with the -d and -s flags.
I believe the former produces the assembly code and the latter is what causes the source to be intermingled with the assembly, but I’m not entirely sure, I haven’t played around with it much.

It’s not something I use often but it comes in handy from time to time when you really want to know what the compiler is actually doing.

Ctrl+f, “runTests”.

As I mentioned before, -d -s will spit out the assembly code intermingled with the source code so all you have to do is redirect it into a text file, open that in a text editor and do a simple find operation.

As @MLXXXp mentioned you should probably also mark answer as volatile to try to prevent the compiler from performing certain optimisations that would eliminate other code that you definitely want to run.

It’s funny how a few decades ago the problem was that the compilers weren’t smart enough to produce code that was small and fast enough for general use, whereas now the problem is that the compilers are too smart and it makes benchmarking harder. (The newer problem is probably the nicer one to have though.)


If you’re only benchmarking you could probably just stick everything in setup and do away with the nextFrame check. (Not that it particularly matters.)

1 Like

I fully agree, much cleaner and more explicit in its intention.

Only one way to find out :slight_smile: I will try it as an exercise just for the fun of it!

Thank you! I just recently learned about godbolt and did not realize there would be a AVR GCC compiler. Which again proofs that I should not assume, but check first.

This is really helpful. The combination with Notepad++ works like a charm indeed for a quick scan.

Completely agree, I must confess that I already have a small set of boiler plate code which I just copy and paste for building quick tests.

Thanks all for the extensive clear explanations and feedback!

A community forum question I was wondering about; I would like to adapt the proposed minor code changes. What is the accepted way of handling the change:

  1. repost the revised code in a new reply
  2. edit the code in the original post with the notification that it changed based on the discussion
  3. leave it as it is (not my desired option)
1 Like

1 or 2 is fine. If you put it in a new post, you could edit the first post to refer to the post with the new code.

1 Like

Thanks all, with some delay hereby the corrected code snippet:

#include <Arduboy2.h>
#include <stdint.h>

#define NR_TESTS 1024

Arduboy2 arduboy;

uint32_t startTime = 0;
uint32_t endTime = 0;
uint32_t perfTime = 0;
volatile float answer = 0.0f;

void runTests(uint16_t nrTests)
{
    // test expensive operation
    startTime = micros();
    for(uint16_t i = 1; i <= nrTests; i++)
    {
      answer = 1.0f / sqrt((float)i);
    }
    endTime = micros();
    perfTime = endTime - startTime;    
}

void setup() 
{
  arduboy.begin();
  runTests(NR_TESTS);   
}

void loop() 
{   
  arduboy.clear();

  arduboy.print("Inverse Sqrt\n");
  arduboy.print("Op time = ");
  arduboy.print(perfTime);  
  arduboy.print(" us\n");
    
  arduboy.display();
}
2 Likes