Just about every game will use interrupts. The Arduino millisecond timer is always running unless a sketch specifically disables it.
@Mr.Blinky’s code is open loop. If an interrupt occurs, it will just delay an SPI clock/bit transmission. There’s not really anything you can do to improve overall sketch performance when this happens. It’s best to just not worry about it.
The performance of the cycle accurate function will be the same. But you can say that the delay in the cycle accurate function wouldn’t be necessary in case of an interrupt. However since an interrupt during paintScreen doesn’t happen often (max 2 times: 18 cycles /16MHz x 1024 bytes = 1.152 msec) it’s not really relevant.
HaHa nice! you kicked out 4 bytes!!
IIRC I tried CPSE in an earlier version but I couldn’t get the timings right. Also In your version the loop takes 18 cycles without clear and 19 cycles with clear.
Since this naggs me and it’s my turn again …
here’s my counter move version: 18 cycles and even two more bytes less (tested).
void Arduboy2Core::paintScreen(uint8_t image[], bool clear)
{
//data only transfer with clear support at 18 cycles per transfer
asm volatile (
" ldi r24,%[len_lsb] \n\t"
" ldi r25,%[len_msb] \n\t"
".l1: ld r20, Z ;2 \n\t"
" out %[spdr], r20 ;1 \n\t"
" cpse %[clear], __zero_reg__ ;1/2 \n\t" //if (clear) *(image++) = 0
" mov r20, __zero_reg__ ;1 \n\t"
" st Z+, r20 ;2 \n\t"
" rcall .l2 ;3+4 \n\t" //call 'ret' of this function
" sbiw r24, 1 ;2 \n\t"
" brne .l1 ;1/2 : 18 \n\t"
".l2: \n\t"
:
: [ptr] "z" (image),
[spdr] "I" (_SFR_IO_ADDR(SPDR)),
[len_msb] "M" (WIDTH * (HEIGHT / 8) >> 8),
[len_lsb] "M" (WIDTH * (HEIGHT / 8) & 0xFF),
[clear] "a" (clear)
: "r20", "r24", "r25"
);
//note: do not add C code and declare with __attribute__ ((noinline)) in header file due to the assembly calling the functions own ret instruction
Make sure you add the no inline attribute in Arduboy2core.h:
Man you are a genius. Awesome, this is so much fun! The trick with the ret instruction is unbelievable and a smart move. I need to think about your move. As a reflex I can only write it down with less lines but same amount of flash and cycles:
void Arduboy2Core::paintScreen(uint8_t image[], bool clear)
{
//data only transfer with clear support at 18 cycles per transfer
asm volatile (
" ldi r24,%[len_lsb] \n\t"
" ldi r25,%[len_msb] \n\t"
".l1: ld r20, Z ;2 \n\t"
" out %[spdr], r20 ;1 \n\t"
" cpse %[clear], __zero_reg__ ;1/2 \n\t" //if (clear) *(image++) = 0
" st Z+, __zero_reg_ ;2 \n\t"
" call .l2 ;4+4 \n\t" //call 'ret' of this function
" sbiw r24, 1 ;2 \n\t"
" brne .l1 ;1/2 : 18 \n\t"
".l2: \n\t"
:
: [ptr] "z" (image),
[spdr] "I" (_SFR_IO_ADDR(SPDR)),
[len_msb] "M" (WIDTH * (HEIGHT / 8) >> 8),
[len_lsb] "M" (WIDTH * (HEIGHT / 8) & 0xFF),
[clear] "a" (clear)
: "r20", "r24", "r25"
);
//note: do not add C code and declare with __attribute__ ((noinline)) in header file due to the assembly calling the functions own ret instruction
I know it is not a valid move, I need to think about it
Maybe tomorrow morning I need to give up…
If the function is put inline it may save 2 or 4 bytes for the call. but you may also loose a few more depending on how many local variables the incapsulating function uses. When the compiler decides it needs another pointer or use lower registers it will stack more registers or use the stack for local variables.
paintScreen seems mostly used twice though. Once in begin/boot and once in main loop.
BTW anyone know how to call a label outside a function directly with inline assembly? (using extern “C” doesn’t work as avr-gcc is the compiler being used instead of avr-g++)
Ok here you go. Not sure if the move is legal but I could shove off another 2 bytes
Therefore I had to change the meaning of the paramter. It is now called noClear. So if set to 1 nothing changes, if it is 0 the buffer gets cleared. Any other value might create nice effects in the buffer (NEW feature, tada!). And the function now takes 18 cycles with or without clear. I think that was it already in your original version but all the optimization might screwed it up.
void Arduboy2Core::paintScreen(uint8_t image[], uint8_t noClear)
{
//data only transfer with clear support at 18 cycles per transfer
asm volatile (
" ldi r24,%[len_lsb] \n\t"
" ldi r25,%[len_msb] \n\t"
".l1: ld r20, Z ;2 \n\t"
" out %[spdr], r20 ;1 \n\t"
" mul r20, %[noClear] ;2 \n\t"
" st Z+, __tmp_reg__ ;2 \n\t"
" rcall .l2 ;3+4 \n\t" //call 'ret' of this function
" sbiw r24, 1 ;2 \n\t"
" brne .l1 ;1/2 : 18 \n\t"
".l2: \n\t"
:
: [ptr] "z" (image),
[spdr] "I" (_SFR_IO_ADDR(SPDR)),
[len_msb] "M" (WIDTH * (HEIGHT / 8) >> 8),
[len_lsb] "M" (WIDTH * (HEIGHT / 8) & 0xFF),
[noClear] "a" (noClear)
: "r20", "r24", "r25"
);
//note: do not add C code and declare with __attribute__ ((noinline)) in header file due to the assembly calling the functions own ret instruction
We can write this version also for 17 cycles (use AND instead of MUL) but I remember you said 18 is the minimum otherwise we will loose data, right?
Not sure if there is a move left, but who knows.
Btw, this is so enlightening. I cannot express how I enjoy this!
HaHa awesome idea to use MUL:D I wouldn’t have thought of it
Though the reversed Clear state would make it incompatible with existing sketches. ie you’de have to edit them. You won’t save two more bytes though as you will need to add an EOR to restore the zero reg.
for using the AND I’m asuming noClear would have states 0x00/0xFF right? or did you have something else in mind?
Edit:
Sticking with Clear and sacrificing two bytes. We can use:
" neg %[clear] \n\t"
and use
" and r20, %[clear] ;1 \n\t"
Hmm… fix timing…saving bytes,Need to get back on that.
Okay here’s my last optimized paintScreen version thats a 100% replacement (no alternate clear states). Technically the core takes up two more bytes. but since the ret instruction is no longer mandatory and I fixed a inline operand typo it saves two more bytes.
Got rid of the inpopular CALL RET trick making inline possible again
used temp register to clutter one register less
changed operand register for clear from “a” to “r” saving 2 bytes on an unneccesary mov “a”, r24 instruction (I’m still wondering why I didn’t notice this before
Compiles to 24 bytes or 26 bytes inclusive ret and takes 18 cycles per SPI transfer (tested)
Wow, again an awesome move. So much things that I learned just by sharing assembly code with you. I think I will do so more often from now and wait for your comments
We should do these discussions more often. Sad that CALL RET trick has to leave but I can understand that this is nothing for a library.
I will think about the code you shared. I like it. I even see that you managed to make the cycle count the same with and without clear, nice
Looks pretty tight this time. Great job. You are the true hero of inline assembly. I am excited to see more of your magic.
MUL 'r', state //if (!state) int8_t var = 0;((state =0, 1; __zero_reg__ remains 0)
AND 'r', state //if (!state) int8_t var = 0;((state =0, 0xFF)
I m currently optimizing the paintScreen code for the alternate displays. After that the next challenge will probably be drawCompressed(). I had a look at the compiled code of @Pharap’s optimized drawCompressed() and saw that the compiler really struggled with it. With a bit of luck 0.5K can be kicked out of the 0.75K function
Ok, that sounds like a big job but if you really manage to do so then this is a big achievement.
I also have bulky functions in my current project which I think blow up the code unnecessarily.
The day I will post the code of this project, I hope you can have a look at it and give some hints.
I wouldn’t be surprised if drawCompressed can be shrunk down further. Most of my changes were just obvious stylistic changes like getting rid of the global variable in favour of a local one, turning the struct into a proper class and neatening things up a bit, I left most of the original code intact because I wanted to be a bit cautious and not go overboard with changes.
(To be honest a lot of the variable names are still pretty confusing and there’s hardly any documentation of the format.)
Edit:
A quick glance over it makes me think brow is a likely bottleneck.
Like I said before though, the code’s quite messy so frankly I’m not even sure what some of it is doing (brow included) despite understanding the format.
Yes they are. Some documentation would have been nice. But lack of documentation will make you look at the code more closely.
The readBits part was easy to follow and reasonably to assembly optimize though. Kicked out 100+ bytes of that part already. But the rest is more challenging.
How about starting a new topic if you’re going to continue discussing optimisation of drawCompressed() (like this paintScreen() topic was split from the general Arduboy2 library discussion)?
Good idea. I like to see shared code how people think they can optimize or already optimized certain functions. It is enlightening for both the programmer (getting feedback) who is doing the optimization and the people starring at the magic happening (they learn and give feedback).
Beside that maybe a thread about assembler optimization during development would be great. So other people can see if they have the same issue or have a solution for an issue somebody else has.
Separate topics for specific cases may be easier to follow for someone wanting to read through from the beginning at a later date. You can always link from a previous topic to allow easily finding other related topics. (Click on the time/date at the top right of a post that you want to start a new topic from and select: + New Topic)