Arduboy2 library paintScreen() optimisation

Continuing the discussion from Arduboy2 - The now recommended alternative to the Arduboy Library:

Just about every game will use interrupts. The Arduino millisecond timer is always running unless a sketch specifically disables it.

@Mr.Blinky’s code is open loop. If an interrupt occurs, it will just delay an SPI clock/bit transmission. There’s not really anything you can do to improve overall sketch performance when this happens. It’s best to just not worry about it.

1 Like

The performance of the cycle accurate function will be the same. But you can say that the delay in the cycle accurate function wouldn’t be necessary in case of an interrupt. However since an interrupt during paintScreen doesn’t happen often (max 2 times: 18 cycles /16MHz x 1024 bytes = 1.152 msec) it’s not really relevant.

HaHa nice! you kicked out 4 bytes!!

IIRC I tried CPSE in an earlier version but I couldn’t get the timings right. Also In your version the loop takes 18 cycles without clear and 19 cycles with clear.

Since this naggs me and it’s my turn again :slight_smile:

here’s my counter move version: 18 cycles and even two more bytes less (tested).

void Arduboy2Core::paintScreen(uint8_t image[], bool clear)
{
  //data only transfer with clear support at 18 cycles per transfer
  asm volatile (
    "     ldi   r24,%[len_lsb]                     \n\t"
    "     ldi   r25,%[len_msb]                     \n\t"
    ".l1: ld    r20, Z                  ;2         \n\t"
    "     out   %[spdr], r20            ;1         \n\t" 
    "     cpse  %[clear], __zero_reg__  ;1/2       \n\t" //if (clear) *(image++) = 0 
    "     mov   r20, __zero_reg__       ;1         \n\t" 
    "     st    Z+, r20                 ;2         \n\t" 
    "     rcall .l2                     ;3+4       \n\t" //call 'ret' of this function
    "     sbiw  r24, 1                  ;2         \n\t"
    "     brne  .l1                     ;1/2 : 18  \n\t"
    ".l2:                                          \n\t" 
    :
    : [ptr]     "z" (image),
      [spdr]    "I" (_SFR_IO_ADDR(SPDR)),
      [len_msb] "M" (WIDTH * (HEIGHT / 8) >> 8),
      [len_lsb] "M" (WIDTH * (HEIGHT / 8) & 0xFF),
      [clear]   "a" (clear)
    : "r20", "r24", "r25"
  );
  //note: do not add C code and declare with __attribute__ ((noinline)) in header file due to the assembly calling the functions own ret instruction

Make sure you add the no inline attribute in Arduboy2core.h:

    void static paintScreen(uint8_t image[], bool clear = false) __attribute__ ((noinline));

@veritazz any moves left? :sunglasses:

1 Like

Why do you need this? If you force the compiler to create a standalone function, it will probably take more code outside the function to call it.

The function must remain a subroutine. otherwise it won’t have a return instruction.

1 Like

Man you are a genius. :star_struck: Awesome, this is so much fun! :joy: The trick with the ret instruction is unbelievable and a smart move. I need to think about your move. As a reflex I can only write it down with less lines but same amount of flash and cycles:

void Arduboy2Core::paintScreen(uint8_t image[], bool clear)
{
  //data only transfer with clear support at 18 cycles per transfer
  asm volatile (
    "     ldi   r24,%[len_lsb]                     \n\t"
    "     ldi   r25,%[len_msb]                     \n\t"
    ".l1: ld    r20, Z                  ;2         \n\t"
    "     out   %[spdr], r20            ;1         \n\t" 
    "     cpse  %[clear], __zero_reg__  ;1/2       \n\t" //if (clear) *(image++) = 0 
    "     st    Z+, __zero_reg_             ;2         \n\t" 
    "     call .l2                     ;4+4       \n\t" //call 'ret' of this function
    "     sbiw  r24, 1                  ;2         \n\t"
    "     brne  .l1                     ;1/2 : 18  \n\t"
    ".l2:                                          \n\t" 
    :
    : [ptr]     "z" (image),
      [spdr]    "I" (_SFR_IO_ADDR(SPDR)),
      [len_msb] "M" (WIDTH * (HEIGHT / 8) >> 8),
      [len_lsb] "M" (WIDTH * (HEIGHT / 8) & 0xFF),
      [clear]   "a" (clear)
    : "r20", "r24", "r25"
  );
  //note: do not add C code and declare with __attribute__ ((noinline)) in header file due to the assembly calling the functions own ret instruction

I know it is not a valid move, I need to think about it :sunglasses:
Maybe tomorrow morning I need to give up…

So the compiler will have to add a call and return. So is there an actual net savings compared to if the compiler could inline?

No sorry, my move was incorrect… :frowning: Forget what I wrote. I need to rethink… :smiley:

Yes it is! and rewarding :slight_smile:

If the function is put inline it may save 2 or 4 bytes for the call. but you may also loose a few more depending on how many local variables the incapsulating function uses. When the compiler decides it needs another pointer or use lower registers it will stack more registers or use the stack for local variables.

paintScreen seems mostly used twice though. Once in begin/boot and once in main loop.

BTW anyone know how to call a label outside a function directly with inline assembly? (using extern “C” doesn’t work as avr-gcc is the compiler being used instead of avr-g++)

Ok here you go. Not sure if the move is legal but I could shove off another 2 bytes :sunglasses:
Therefore I had to change the meaning of the paramter. It is now called noClear. So if set to 1 nothing changes, if it is 0 the buffer gets cleared. Any other value might create nice effects in the buffer (NEW feature, tada!). And the function now takes 18 cycles with or without clear. I think that was it already in your original version but all the optimization might screwed it up.

void Arduboy2Core::paintScreen(uint8_t image[], uint8_t noClear)
{
  //data only transfer with clear support at 18 cycles per transfer
  asm volatile (
"     ldi   r24,%[len_lsb]                      \n\t"
"     ldi   r25,%[len_msb]                      \n\t"
".l1: ld    r20, Z                   ;2         \n\t"
"     out   %[spdr], r20             ;1         \n\t" 
"     mul   r20, %[noClear]          ;2         \n\t" 
"     st    Z+, __tmp_reg__          ;2         \n\t" 
"     rcall .l2                      ;3+4       \n\t" //call 'ret' of this function
"     sbiw  r24, 1                   ;2         \n\t"
"     brne  .l1                      ;1/2 : 18  \n\t"
".l2:                                           \n\t" 
:   
: [ptr]     "z" (image),
  [spdr]    "I" (_SFR_IO_ADDR(SPDR)),
  [len_msb] "M" (WIDTH * (HEIGHT / 8) >> 8), 
  [len_lsb] "M" (WIDTH * (HEIGHT / 8) & 0xFF),
  [noClear]   "a" (noClear)
: "r20", "r24", "r25"
  );
  //note: do not add C code and declare with __attribute__ ((noinline)) in header file due to the assembly calling the functions own ret instruction

We can write this version also for 17 cycles (use AND instead of MUL) but I remember you said 18 is the minimum otherwise we will loose data, right?

Not sure if there is a move left, but who knows. :thinking:

Btw, this is so enlightening. I cannot express how I enjoy this!

HaHa awesome idea to use MUL:D I wouldn’t have thought of it :+1:

Though the reversed Clear state would make it incompatible with existing sketches. ie you’de have to edit them. You won’t save two more bytes though as you will need to add an EOR to restore the zero reg.

for using the AND I’m asuming noClear would have states 0x00/0xFF right? or did you have something else in mind?

Edit:

Sticking with Clear and sacrificing two bytes. We can use:

    "     neg   %[clear]                           \n\t"

and use

    "     and   r20, %[clear]           ;1         \n\t" 

Hmm… fix timing…saving bytes,Need to get back on that.

2 Likes

Okay here’s my last optimized paintScreen version thats a 100% replacement (no alternate clear states). Technically the core takes up two more bytes. but since the ret instruction is no longer mandatory and I fixed a inline operand typo it saves two more bytes.

  • Got rid of the inpopular CALL RET trick making inline possible again
  • used temp register to clutter one register less
  • changed operand register for clear from “a” to “r” saving 2 bytes on an unneccesary mov “a”, r24 instruction (I’m still wondering why I didn’t notice this before :crazy_face:

Compiles to 24 bytes or 26 bytes inclusive ret and takes 18 cycles per SPI transfer (tested)

void Arduboy2Core::paintScreen(uint8_t image[], bool clear)
{
  asm volatile (
    "     ldi   r26,%[len_lsb]                      \n\t" //for (len = WIDTH * HEIGHT / 8;
    "     ldi   r27,%[len_msb]                      \n\t"
    "1:   ld    __tmp_reg__, Z            ;2        \n\t" //tmp = *(image)
    "     out   %[spdr], __tmp_reg__      ;1        \n\t" //SPDR = tmp
    "     cpse  %[clear], __zero_reg__    ;1/2      \n\t" //if (clear) tmp = 0;
    "     mov   __tmp_reg__, __zero_reg__ ;1        \n\t" 
    "2:   sbiw  r26, 1                    ;2        \n\t" //len --
    "     sbrc  r26, 0                    ;1/2      \n\t" //loop twice for cheap delay
    "     rjmp  2b                        ;2        \n\t"
    "     st    Z+, __tmp_reg__           ;2        \n\t" //*(image++) = tmp
    "     brne  1b                        ;1/2 :18  \n\t" //len > 0
    :
    : [ptr]     "z" (image),
      [spdr]    "I" (_SFR_IO_ADDR(SPDR)),
      [len_msb] "M" (WIDTH * (HEIGHT / 8 * 2) >> 8),   // 8: pixels per byte
      [len_lsb] "M" (WIDTH * (HEIGHT / 8 * 2) & 0xFF), // 2: for delay loop multiplier
      [clear]   "r" (clear)
    : "r26", "r27"
  );

I just tested Dark Under that uses display() and puts paintScreen() inline.

  • The standard Arduboy2 version of paintScreen compiles to 38 bytes inline code (clear=false, clearcode optimized out)
  • My new version compiles to 28 bytes (2 extra bytes for clear = false instruction) that’s 10 bytes less :slight_smile:
2 Likes

Wow, again an awesome move. So much things that I learned just by sharing assembly code with you. I think I will do so more often from now and wait for your comments :star_struck:

We should do these discussions more often. Sad that CALL RET trick has to leave but I can understand that this is nothing for a library.

I will think about the code you shared. I like it. I even see that you managed to make the cycle count the same with and without clear, nice :+1:
Looks pretty tight this time. Great job. You are the true hero of inline assembly. I am excited to see more of your magic. :mage:

1 Like

Thanks. Please do. These things are fun and educational. I’m sure I won’t forget your MUL trick anytime soon.

Added to my list of optimalisation tricks:

CPSE state, __zero_reg__ //if (state) int8_t var = 0; (state =0, <>0)
MOV 'r', __zero_reg__
MUL  'r', state          //if (!state) int8_t var = 0;((state =0, 1; __zero_reg__ remains 0)
AND  'r', state          //if (!state) int8_t var = 0;((state =0, 0xFF)

I m currently optimizing the paintScreen code for the alternate displays. After that the next challenge will probably be drawCompressed(). I had a look at the compiled code of @Pharap’s optimized drawCompressed() and saw that the compiler really struggled with it. With a bit of luck 0.5K can be kicked out of the 0.75K function

2 Likes

Thanks. Glad you liked one of my moves :wink:

Ok, that sounds like a big job but if you really manage to do so then this is a big achievement.
I also have bulky functions in my current project which I think blow up the code unnecessarily.
The day I will post the code of this project, I hope you can have a look at it and give some hints. :sunglasses:

I wouldn’t be surprised if drawCompressed can be shrunk down further. Most of my changes were just obvious stylistic changes like getting rid of the global variable in favour of a local one, turning the struct into a proper class and neatening things up a bit, I left most of the original code intact because I wanted to be a bit cautious and not go overboard with changes.

(To be honest a lot of the variable names are still pretty confusing and there’s hardly any documentation of the format.)

Edit:
A quick glance over it makes me think brow is a likely bottleneck.
Like I said before though, the code’s quite messy so frankly I’m not even sure what some of it is doing (brow included) despite understanding the format.

Yes they are. Some documentation would have been nice. But lack of documentation will make you look at the code more closely.
The readBits part was easy to follow and reasonably to assembly optimize though. Kicked out 100+ bytes of that part already. But the rest is more challenging.

How about starting a new topic if you’re going to continue discussing optimisation of drawCompressed() (like this paintScreen() topic was split from the general Arduboy2 library discussion)?

1 Like

Good idea. I like to see shared code how people think they can optimize or already optimized certain functions. It is enlightening for both the programmer (getting feedback) who is doing the optimization and the people starring at the magic happening (they learn and give feedback).

Beside that maybe a thread about assembler optimization during development would be great. So other people can see if they have the same issue or have a solution for an issue somebody else has.

Yeah I probably should start one or start a general Arduboy2 optimisations thread ?

Separate topics for specific cases may be easier to follow for someone wanting to read through from the beginning at a later date. You can always link from a previous topic to allow easily finding other related topics. (Click on the time/date at the top right of a post that you want to start a new topic from and select: + New Topic)