The new paintScreen() code has now been commited to GitHub for inclusion in the next Arduboy2 library release.
I just discovered that the screen of Evade 2 will roll sideways to the right in game. it seems that the OLED display fails to see the invert() command as commands and outputs it as data on screen causing the roll.
After some testing it seems that only adding:
while (!(SPSR & _BV(SPIF)));
at the end of the function will resolve this issue.
OK, I’ll add this before it’s released.
For anyone still interested in this topic, there’s a small error in the last code listing I posted above.
The line:
"2: sbiw %a[count], 1 ;2 \n\t" //len --
should be
"2: sbiw %A[count], 1 ;2 \n\t" //len --
with a %A instead of %a
The %a will convert to the name of a pointer register pair (X, Y, Z).
The %A will convert to the name of the lower register of a special register pair (r24, r26, r28, r30).
If the compiler chooses X for count, which is 26:r27 (or Y r28:r29 or Z r30:r31) then
sbiw %a[count], 1
will convert to
sbiw X, 1
Even though in this case the proper parameter for the sbiw instruction should be r26, the compiler seems to accept X instead, and compile correctly. The same goes for Y and Z.
However, if the compiler chooses special register pair r24:r25 for count it can’t convert the %a to X, Y or Z so the compile fails with:
error: invalid 'asm': address operand requires constraint for X, Y, or Z register
(And since all assember lines get concatenated as one long string, the compiler doesn’t point to exactly which assembler line caused the problem, making it difficult to find.)
By using %A the compiler will properly convert the special register pair it chooses for count to the proper lower register name:
r26 for X, r28 for Y, r30 for Z and r24 for pair r24:r25 (which doesn’t have a single letter name because it’s not a pointer register).
It’s these kind of "gotcha"s that make programming inline assembler dangerous. It was one specific sketch that I was using for testing a different change that caused this problem to show, long after I had finished testing the new paintScreen() code.
I just came across the Arduboy2 library as I want to integrate it in my current project. For this project I am hunting for performance so I had a look at the “paintScreen” function. It is quite optimized already but I had the following idea:
Lets replace the for loop in paintScreen with the following code:
for (int i = 0; i < HEIGHT * WIDTH / 8; i++) {
uint8_t data = pgm_read_byte(image + i);
while (!(SPSR & _BV(SPIF))) {}
SPDR = data;
}
while (!(SPSR & _BV(SPIF))) {}
The idea is to use the fact that pgm_read_byte might take 3 cycles. So for the second round in the loop we can read and easily use 3 cycles and only after then wait for the SPI to finish. That way we can use the cycles that the “nop” would have otherwise burned.
I have not a good environment to test it but maybe somebody else here has the right setup to test the different between the original code and the one above.
@Mr.Blinky has created PR #17, which includes paintScreen() written in assembler, optimised for speed (not size). I’m currently evaluating this for the next Arduboy2 library release.
EDIT: Sorry, I didn’t read carefully enough to notice that you’re talking about the PROGMEM version. This function is rarely used since it’s not common to want to paint an entire 1024 byte constant image stored in program memory. It think it’s better to consider the size of this function, rather than speed.
Oh sorry, my fault. I referenced the wrong code.
No I meant the function that is used to update the screen, not the PROGMEM version. But anyway, if @Mr.Blinky has an assembly version I will look at it. I am pretty sure he also considered similar optimizations
@Mr.Blinky I had a look at the assembly code. I think you can squeeze out another 2 cycles if you do the same trick that I proposed in the C version. e.g. you ld the new data while spi still transfers the old, so you can replace one of the rjmp +0 with the ld instruction in the loop… like this (not tested nor compiled )
+ asm volatile (
+ " ldi r24,%[len_lsb] \n\t"
+ " ldi r25,%[len_msb] \n\t"
+ " ld r20, Z ;2 \n\t"
+ ".l1: out %[spdr], r20 ;1 \n\t"
+ " cp %[clear], __zero_reg__ ;1 \n\t" //if (clear) *(image++) = 0
+ " breq .l2 ;1/2 : 5/6 \n\t"
+ " st Z+, __zero_reg__ ;2 \n\t"
+ " rjmp .l3 ;2 \n\t"
+ ".l2: \n\t"
+ " adiw r30, 1 ;2 \n\t" // else *(image++)
+ " nop ;1 \n\t"
+ ".l3: \n\t"
+ " ld r20, Z ;2 \n\t"
+ " rjmp .+0 ;2 \n\t"
+ " rjmp .+0 ;2 \n\t"
+ " sbiw r24, 1 ;1 \n\t"
+ " brne .l1 ;1/2 : 18 \n\t"
+ :
+ : [ptr] "z" (image),
+ [spdr] "I" (_SFR_IO_ADDR(SPDR)),
+ [len_msb] "M" (WIDTH * (HEIGHT / 8) >> 8),
+ [len_lsb] "M" (WIDTH * (HEIGHT / 8) & 0xFF),
+ [clear] "a" (clear)
+ : "r20", "r24", "r25"
+ );
What do you think?
Ah and even a little more with this. (skip store to image if no clear desired, do not inc Z on clear store, will be done anyway by adiw)
+ asm volatile (
+ " ldi r24,%[len_lsb] \n\t"
+ " ldi r25,%[len_msb] \n\t"
+ " ld r20, Z ;2 \n\t"
+ ".l1: out %[spdr], r20 ;1 \n\t"
+ " cpse %[clear], __zero_reg__ ;3 \n\t" //if (clear) *(image) = 0
+ " st Z, __zero_reg__ ;2 \n\t"
+ " adiw r30, 1 ;2 \n\t" // else *(image++)
+ " nop ;1 \n\t"
+ ".l3: \n\t"
+ " ld r20, Z ;2 \n\t"
+ " rjmp .+0 ;2 \n\t"
+ " rjmp .+0 ;2 \n\t"
+ " sbiw r24, 1 ;1 \n\t"
+ " brne .l1 ;1/2 : 18 \n\t"
+ :
+ : [ptr] "z" (image),
+ [spdr] "I" (_SFR_IO_ADDR(SPDR)),
+ [len_msb] "M" (WIDTH * (HEIGHT / 8) >> 8),
+ [len_lsb] "M" (WIDTH * (HEIGHT / 8) & 0xFF),
+ [clear] "a" (clear)
+ : "r20", "r24", "r25"
+ );
Sorry for posting too much, probably had too much coffee
Maybe too fast now for the SPI?
@veritazz I had code that clocks out SPI data continiously at 16 cycles. But the OLED display couldn’t handle it. it occasionally lost a byte. The display needs the additional two cycles. So it was rewitten to clock out the data continiously at 18 cycles per byte.
The code comes at a cost of 20 extra bytes compared to the C version . But just 2 extra bytes when the clear option is used.
There’s about a 10 cycle 6 cycle per SPI transfer advantage as polling the SPIF bit code takes 12 cycles 8 cycles between SPI transfers according to my logic analyzer.
EDIT:
Timings where not those of Arduboy2 library paintScreen
Ah ok. So if the display controller needs the 18 cycles anyway then there is no need to optimize further.
So the only thing we can do for this function is use less progmem at the same speed. Look’s really tight, maybe not worth the effort
Ok one last try, I think the below code takes maybe (not measured) 2 cycles more that yours (if at all, not sure) but consumes less progmem. What do you think?
+ asm volatile (
+ " ldi r24,%[len_lsb] \n\t"
+ " ldi r25,%[len_msb] \n\t"
+ " ld r20, Z ;2 \n\t"
+ ".l1: out %[spdr], r20 ;1 \n\t"
+ " cpse %[clear], __zero_reg__ ;3 \n\t" //if (clear) *(image) = 0
+ " st Z, __zero_reg__ ;2 \n\t"
+ " adiw r30, 1 ;2 \n\t" // else *(image++)
+ " ld r20, Z ;2 \n\t"
+ ".l2: \n\t"
+ " sbic %[spsr], 7 ;1 \n\t"
+ " rjmp .l2 ;2 \n\t"
+ " sbiw r24, 1 ;1 \n\t"
+ " brne .l1 ;1/2 : 18 \n\t"
+ :
+ : [ptr] "z" (image),
+ [spdr] "I" (_SFR_IO_ADDR(SPDR)),
+ [spsr] "I" (_SFR_IO_ADDR(SPSR)),
+ [len_msb] "M" (WIDTH * (HEIGHT / 8) >> 8),
+ [len_lsb] "M" (WIDTH * (HEIGHT / 8) & 0xFF),
+ [clear] "a" (clear)
+ : "r20", "r24", "r25"
+ );
I don’t think you can use SPIC
on SPSR
, since it’s not one of the lower 32 I/O addresses.
Damn it. You are right. Then this would need another instruction. Not sure if is still less progmem then before…
Ok I think I need to stop thinking about it for now, thought I could land a lucky shot
Thanks for keep trying to crunch it down.
I had another look at my code to see if I could crunch it down more.I can squeeze out two more bytes by replacing
" rjmp .+0 ;2 \n\t"
" rjmp .+0 ;2 \n\t"
" rjmp .+0 ;2 \n\t"
by reading dummy bytes from progmem:
" lpm __tmp_reg__, z ;3 \n\t"
" lpm __tmp_reg__, z ;3 \n\t"
and… I can remove two more bytes by replacing:
" adiw r30, 1 ;2 \n\t" // else *(image++)
" nop ;1 \n\t"
with another dummy read and Z+
" lpm __tmp_reg__, z+ ;3 \n\t" (Z++ and dummy reads for 3 cycle delay)
I wouldn’t have had another look at my code if it wasn’t for you @veritazz and saved 4 more bytes
OK first a small rectification
In the post earlier I mentioned the wrong timings for Arduboy2 paintScreen funtion. The time gap between SPI transfers in Arduboy2 library is 8 CPU cycles (0.5 microSeconds) instead of 12 CPU cycles.
Heres a screen capture of logic analyzer:
You can see from the ‘T’ indicator that a SPI clock cycle takes 0.125 micro seconds (2 CPU cycles). At the right side you can see that the time between the A1 and A2 timer markers takes 0.625 microseconds. Thats 0.125 micro seconds (2 CPU cycles) for the last SPI clock cycle and 0.5 micro seconds (8 CPU cycles) for the time between SPI transfers.
Below you’ll see the new assembly optimized version that takes 18 CPU cycles:
The last SPI cycle and gap take 0.25 micro seconds together. That’s 0.125 micro seconds (2 CPU cycles) for the last SPI clock cycle and 0.125 micro seconds (2 CPU cycles) for the time between SPI transfers.
Also the line:
while (!(SPSR & _BV(SPIF))); // wait for the last transfer to finish
Is obselete as when the function ends (returns) enough time has passed for the last SPI transfer to complete. Removing that line saves another 6 bytes. So that’s a total of 10 less bytes now
Here’s the new and tested paintScreen version (SSD1306 only version taking up 34 bytes (including return instruction):
void Arduboy2Core::paintScreen(uint8_t image[], bool clear)
{
//data only transfer with clear support at 18 cycles per transfer
asm volatile (
" ldi r24,%[len_lsb] \n\t"
" ldi r25,%[len_msb] \n\t"
".l1: ld r20, Z ;2 \n\t"
" out %[spdr], r20 ;1 \n\t"
" cp %[clear], __zero_reg__ ;1 \n\t" //if (clear) *(image++) = 0
" breq .l2 ;1/2 : 5/6 \n\t"
" st Z+, __zero_reg__ ;2 \n\t"
" rjmp .l3 ;2 \n\t"
".l2: \n\t" // else *(image++)
" lpm __tmp_reg__, z+ ;3 \n\t" //(Z++,dummy read for 3 cycle delay)
".l3: \n\t"
" lpm __tmp_reg__, z ;3 \n\t" //(dummy reads for 3 cycle delays)
" rjmp .+0 ;2 \n\t"
" sbiw r24, 1 ;2 \n\t"
" brne .l1 ;1/2 : 18 \n\t"
:
: [ptr] "z" (image),
[spdr] "I" (_SFR_IO_ADDR(SPDR)),
[len_msb] "M" (WIDTH * (HEIGHT / 8) >> 8),
[len_lsb] "M" (WIDTH * (HEIGHT / 8) & 0xFF),
[clear] "a" (clear)
: "r20", "r24", "r25"
);
}
Will review the alternative displays this weekend.
Wow that is awesome. I like these kind of optimizations.
Just to challenge you again below is another try from my side to squeeze a few more bytes out of it. Did not count the bytes but is should have 2 less instructions and hopefully the same amount of cycles (… maybe one more, did not have enough coffee till now).
void Arduboy2Core::paintScreen(uint8_t image[], bool clear)
{
//data only transfer with clear support at 18 cycles per transfer
asm volatile (
" ldi r24,%[len_lsb] \n\t"
" ldi r25,%[len_msb] \n\t"
".l1: ld r20, Z ;2 \n\t"
" out %[spdr], r20 ;1 \n\t"
" cpse %[clear], __zero_reg__ ;1/2 \n\t" //if (clear) *(image++) = 0
" st Z, __zero_reg__ ;2 \n\t"
".l2: \n\t" // else *(image++)
" lpm __tmp_reg__, z+ ;3 \n\t" //(Z++,dummy read for 3 cycle delay)
".l3: \n\t"
" lpm __tmp_reg__, z ;3 \n\t" //(dummy reads for 3 cycle delays)
" lpm __tmp_reg__, z ;3 \n\t" //(dummy reads for 3 cycle delays)
" sbiw r24, 1 ;2 \n\t"
" brne .l1 ;1/2 : 18 \n\t"
:
: [ptr] "z" (image),
[spdr] "I" (_SFR_IO_ADDR(SPDR)),
[len_msb] "M" (WIDTH * (HEIGHT / 8) >> 8),
[len_lsb] "M" (WIDTH * (HEIGHT / 8) & 0xFF),
[clear] "a" (clear)
: "r20", "r24", "r25"
);
}
Your turn
Another thing that came to my mind while counting cycles and bytes is that maybe some games will use interrupts to some extend which will probably make the performance of the cycle accurate function worse. For instance it could happen that an interrupt occurs right after the out instruction. Then the interrupt handler code will burn the cycles for us.
Then after return from interrupt we additionally burn cycles which will lead to less performance as expected. So for this scenario we might come up with another optimized assembly version that is reading the SPSR just to make sure we transfer bytes whenever the SPI send it out.
What do you think?
Arduboy2 - The now recommended alternative to the Arduboy Library[quote=“veritazz, post:93, topic:1887”]
Your turn
[/quote]
@veritazz, @Mr.Blinky, @filmote,
I’m coninuing the paintScreen() discussion in a dedicated topic. We shouldn’t be cluttering the general Arduboy2 topic with something so specific.