Improve Flash read time

An idea just popped up in my head and I want to share it here.
Usually when reading flash we need to:

  1. select the flash
  2. send read command and address to read from (24bit)
  3. send a byte (usually zero) to read a byte from flash (full duplex)
  4. repeat 3 until done
  5. deselect the flash

Now when we write to the display we do:

  1. select the display
  2. write byte of display data or command
  3. repeat 2 until done
  4. deselect display

The idea is now to combine both, so we will read data from the flash while updating the display:

  1. select the flash
  2. send read command and address to read from (24bit)
  3. select the display
  4. write byte of display data or command, simultaneously receive a byte from flash
  5. repeat 4 until display is updated
  6. deselect display
  7. deselect flash

With this we could read 1kib of data almost for free while updating the display. There is only little overhead at the beginning of the display update. The data could be stored anywhere but maybe even in the frame buffer.

So some games could load a game screen each time they update the display (instead of zeroing it). Then when the update is done the frame buffer is already filled with the game screen and it does not need to be drawn.
I wanted to use it to prefetch data from the flash either for rendering or for music playback.

Sounds reasonable @Mr.Blinky?


Nice I didn’t think of that ( I did think about reading flash data directly to the display but that has very limited use)

that would require a seperate clear before writing new display data. But copying data from the buffer to the target location and clearing the buffer could be faster then a seperate load from flash sequence.

That sounds like one of most usefull things (to preset the screen buffer with a default screen like HUDs, borders, frames, etc) when there is little spare CPU time left and easy to implement.

Kind of like an embeded flash memcpy in the displayroutine? Hmmm I like this idea!

FX::displayPrefetch(uint24_t addr, uint8_t* target, uint16_t length, uint8_t displayMode);
1 Like

You can simultaneously read and write? On the SPI bus? is there a byte long buffer or something that allows this?

You have to simultaneously read and write. There is only one clock line and the controller doesn’t know whether you’re reading or writing. It’s based on the commands that you send for the particular device you’re talking to.

If you don’t expect valid read data when writing, you can ignore or discard it. If you’re reading, you usually just set up to send zeros.

The thing that would make it work in this case, for reading the flash while writing to the display, is the fact that the MISO line for the display isn’t connected (the display doesn’t have one), so you don’t end up with two MISO lines active at the same time.

1 Like

In the display routine we do a lot of useful stuff to fill the TX wait time until we can send the next byte. Do you think there is room to squeeze in the copy of SPDR into “target”?

So “target” could be any buffer. Via “displayMode” we can control whether the data from the flash goes into “target” or the frame buffer (I mean the buffer in memory where the display data is stored)?
If “target” points somewhere in memory the amount of data written to it needs to be limited I guess (as the display write is always 1kib)?

Oh yes. Forgot to think about this. You are totally right! :sweat_smile: Thanks for checking this.

I wouldn’t want anything that slowed down the display routine. It would be better to have a separate function that could be called instead of display() for your needs. Then, you could take as long as you want to do whatever you want.

If your routine is smart enough, for reads less than 1K you could drop CS for the flash after you’ve read as much as you want. For reads more than 1K you could drop CS for the display after 1K has been read from the flash.

1 Like

There are still a few wait cycles in the display code so theres some room. Can’t tell if it all fits in 18 cycles per byte untill I played with some code.

target cloud be sBuffer so that’s really not necessary. But it should tell if the display buffer needs to be cleared on the fly or not.

Yes when the length is < 1K the excess read data is just ignored. There’s no length check needed for length > 1K as max read is 1K

as FX::displayPrefetch() hints, it would be part of the FX library. There will probably also be FX::display() function that calls the Arduboy2::display() function that handles the chip selects before and after the call.

It doesn’t really matter at what point you drop CS as at the moment you do, the flash command is terminated and you cannot continue reading data from the current flash memory pointer

That wouldn’t be very useful in our case. We want to ensure the whole 1K display is updated in one go every frame.

1 Like

Great. Thanks for all the good input @Mr.Blinky, @MLXXXp. I try to get something done with the music stuff so in case something is ready I can give it a try.
Do we have a FX game that could benefit from this?

I’m not sure I understand what you’re saying. If you want to read more than 1K from flash while writing to the display you:

  1. Set CS for both the display and flash.
  2. Write to the display and read from flash for 1K bytes.
  3. Drop CS for the display.
  4. Continue reading from the flash for as many extra bytes over 1K that you want.

This assumes you have a big enough buffer to hold the read data. If you’re reading to the screen buffer you would have to somehow allocate the screen buffer to be as large as the amount you want to read.

Ah I now understand. You want the display function to continue reading the remaing flash data into memory. :+1:


Looks like I can get all working in 18 cycles/byte. Will be working on this tonight :slight_smile:

Is that a retorical question?

Don’t know of any. but realizing that a preset display buffer doesn’t have to be a static image. I can think of a vertical smups with a super detailed background at no extra costs (MCU cycles ) :wink:


That’s a real curiosity for me on this, could you even do level streaming fast enough to actually require this? i.e. you can already load a lot of data between frames.

We were already able to do full motion video at the max frame rate weren’t we?

Oh you are outperforming me. I will give my best to have something ready so I can try this with background music as well. Once you have something updated just give me a ping and I will pull in the changes in my project.

Not really. My project has a really lame HUD at the moment. There would be a little benefit as well but I had some music prefetch in mind with this. I am still stuck with the converter from .mod to ATMlib2. Just realized that for putting it in the flash I would need a converter from the #defines to binary.

Just tested my 18 cycle version including @MLXXXp idea. But ran into an unforseen problem:

I can only read the received SPI data on the 18th cycle so I disable OLED CS after I write to SPI data register (I’m initiating an extra write) This works fine as the display discards a pending transfer when CS becomes inactive. Hower when an interrupt occures the extra byte gets transfered and the display is shifted :confused:

let’s have another go at this…

1 Like

Okay had to drop the >1K support to maintain 18 cycle/byte.

Use CLEAR_BUFFER as clear flag to clear the the display buffer.
The screen buffer is cleared before the prefetched data is stored.

Heres the function :

void Cart::displayPrefetch(uint24_t address, uint8_t* target, uint16_t len, bool clear)
  asm volatile
  ( "dbg:\n"
    "   ldi     r30, lo8(%[sbuf])               \n" // uint8_t* ptr = Arduboy2::sBuffer;
    "   ldi     r31, hi8(%[sbuf])               \n"
    "   ldi     r25, hi8(%[end])                \n" 
    "   in      r0, %[spsr]                     \n" // wait(); //for 1st target data recieved (can't enable OLED mid transfer)
    "   sbrs	r0, %[spif]                     \n"
    "   rjmp	.-6                             \n"
    "   cbi     %[csport], %[csbit]             \n" // enableOLED();
    "1:                                         \n" // while (true) {
    "   ld      r0, Z                   ;2  \   \n" // uint8_t displaydata = *ptr;
    "   in      r24, %[spdr]            ;1  /3  \n" // uint8_t targetdata = SPDR;
    "   out     %[spdr], r0             ;1  \   \n" // SPDR = displaydata;
    "   cpse    %[clear], r1            ;1-2    \n" // if (clear) displaydata = 0;
    "   mov     r0, r1                  ;1      \n"
    "   st      Z+, r0                  ;2      \n" // *ptr++ = displaydata;
    "   subi    %A[len], 1              ;1      \n" // if (--len >= 0) *target++ = targetdata;
    "   sbci    %B[len], 0              ;1      \n"
    "   brmi    3f                      ;1-2    \n" // branch ahead and back to 2 to burn 4 cycles
    "   nop                             ;1      \n"
    "   st      %a[target]+, r24        ;2  /11 \n"
    "2:                                         \n"
    "   cpi     r30, lo8(%[end])        ;1  \   \n" // if (ptr >= Arduboy2::sBuffer + WIDTH * HEIGHT / 8) break;
    "   cpc     r31, r25                ;1      \n"
    "   brcs    1b                      ;1-2/4  \n" // } 
    "3:                                         \n"
    "   brmi    2b                      ;1-2    \n" // branch only when coming from above brmi
    : [target]   "+e" (target),
      [len]      "+d" (len)
    : [sbuf]     ""   (Arduboy2::sBuffer),
      [end]      ""   (Arduboy2::sBuffer + WIDTH * HEIGHT / 8),
      [clear]    "r" (clear),
      [spsr]     "I" (_SFR_IO_ADDR(SPSR)),
      [spif]     "I" (SPIF),
      [spdr]     "I" (_SFR_IO_ADDR(SPDR)),
      [csport]   "I" (_SFR_IO_ADDR(CS_PORT)),
      [csbit]    "I" (CS_BIT)
    : "r24", "r25", "r30", "r31"


This is awesome. I will give it a try tonight. Another silly idea about how to clear the display:

What if we just assign 1kib of flash data as empty screen (all zeroes). With the above routine the display data would be cleared and we do not need the clear flag.
This might free some cycles (that we have to burn for now) for other fancy stuff.

So you continue to read after the display is updated? Would it work if you read the data >1kib before and the last 1kib during display update?

Then you’d loose the option to prefetch of other data. Each loop takes 18 cycles regardless if clear flag is set or not.

No I removed that option due to the problem I mentioned earlier and I didn’t found an acceptable solution (for now).

Only if we remove the seek bit from the function.

With seek outside the displayPrefetch (and disable removed). You could just call the function to show a animation of 1k frames

1 Like

Sorry for asking. In which repo can I find this?

It’s not in a repo yet. Going over some other things. but you can copy the above code in the mean time.

1 Like