Bootloader and File Format [Wiki]

My gut instinct is to be worried about that because compiler extensions cause various issues (mainly related to portability, backwards compatibility and compiler bugs), but I think creating an 24-bit type manually would be quite a bit of effort so it’s probably the easiest option for now.

It’s late here so I’ll spend more time thinking about it tomorrow.

My immediate thoughts are are mainly:

  • init should be initialise
    • (And possibly include initialize as a synonym to appease Oxfordians)
  • readWord should be something like read16 or readUint16.
    • The size of a processor word varies between CPU models, but a uint16_t is universally and unambiguously an unsigned 16 bit integer.
  • read should perhaps be either read8, readUint8 or readByte to avoid any ambiguity

There’s probably a contradiction of their etymologies,
but not a contradiction of their definition within the scope of the language.
(Technically the variable can still vary in value, just not at runtime. :P)

Perhaps, but I can’t think of a single good reason to prefer a macro over constexpr for constants,
and I trust the words of the standard C++ foundation about what’s good and what’s bad.

Currently it’s still in the source.

I’m not sure whether it’s worthwhile or not, I’d have to think about it,
but I know for definite that a type alias would be better than a macro.

Theoretically it’s possible that uint24_t could be added to stdint.h as part of the standard,
in which case #ifndef uint24_t wouldn’t actually do anything because uint24_t wouldn’t exist during the preprocessor stage (because it would be a type, not a macro).
Worse still, if other code was using uint24_t and it was expecting the type to behave differently to __uint24 at some point then the code could still compile but introduce some silent bugs.

With a type alias, at least you’d get an immediate error because the type uint24_t would already exist and thus cause a name clash.
(Fail fast, fail early, fail at compile time rather than runtime, blah blah blah.)

Good to know.

Mainly I’m thinking of adding template functions so that instead of:

uint8_t someBuffer[16];
Cart::readBytes(someBuffer, sizeof(someBuffer));

You can just write:

uint8_t someBuffer[16];

(And the same for readDataBlock and readSaveBlock.)

I’d also like to add a template version of writeSavePage that checks that the array you feed to it is large enough to prevent buffer overruns.

// Large enough, runs fine
uint8_t bufferA[256];
Cart::writeSavePage(page, bufferA);

// Buffer not large enough, get a 'static assertion failed' error
uint8_t bufferB[128];
Cart::writeSavePage(page, bufferB);

Also maybe some templates for reading classes/structures like EEPROM’s get.

// Read an array of 16 points
// and the compiler takes care of all the hassle for you
Point points[16];
readData(points, pageAddress);

(Now I see that, I’m thinking it makes more sense to have the page address first.)

And also making sure to mark the buffer on writeSavePage as const,
I.e. static void writeSavePage(uint16_t page, const uint8_t* buffer);
(const effectively means “I promise not to modify this”.)

Only if it has static storage duration (i.e. it’s a global).

There’s not much that can be done about that,
the compiler handles that sort of thing.

I’m not sure what this is supposed to be demonstrating.
(Also, do people actually use memset?)

Wrap what exactly?

See my suggestion about templates earlier in the comments.
readData(state, pageAddress); is certainly doable.

However, it would technically only work for certain objects.
I’m not sure what the full set of requirements would be,
but I’m pretty sure the class would have to be TriviallyCopyable or PODType.

It might be possible.

I don’t know enough about the flash chip and SPI to make any guesses as to what the performance increase would be like (or if there would even be a performance increase).

One problem is that there’s only one flash chip and potentially multiple streams.
Sharing resources can be very hard without threads, locking, mutexes, semaphores et cetera.

Either way it definitely needs to be more OOP:

FXStream stream { pageAddress };
uint8_t value = stream.read8Unsafe();

Thats exactly why I omitted the full word :stuck_out_tongue: Nay I don’t like the long word :slight_smile: I’m just thinking maybe call it boot or begin? as the Arduboy(2) library uses those?

I know it can be confusing to use Word. But when a word is 32-bit it is usually called a long(word).

Yeah when going for read16 we would also need to go for read8. Which kind of reads odd. The number 16 is more magical :slight_smile: taking the Arduboy(2) library in mind maybe we should go for Short? as there’s a delayShort() function in there.

HaHa so true. Unless the code is running on a very instable system :stuck_out_tongue:

Ah it is? wasn’t sure about that.

Yes that would be better. I’m just used at pulling defines out of my hat.

You’re examples make me understand templates better everytime. I haven’t added the ealier templates back after the class change.

Your other ideas sound good.

Yes but a few bytes of data may not be worthwhile. It’s mainly more usefull to store larger data there. Progmem can also be freed by not using certain exlusive functions. Like you could ommit using print functions by drawing text as bitmaps.

That would be nice if we could do that. But that would require a rewrite of avr-libc.

Yeah I know. each SPI read/write cycle requires 18 cycles. I’ll be writing optimized versions for the functions that handle multiple bytes. The one that will bennefit the most of this will be the drawBitmap (or drawSprite)

It’s possible to read a sequence of bytes like a stream using Seek once and then multiple readByte, ReadWord, ReadBytes until the flash chip is deslected.
You could also use something as myvar = SPDR; which would be the fastest. But you’re at the compilers mercy ensuring 17 cycles have passed (unless adding additional SPSR tests loosing some performance again)

Your thoughts are appreciated.

1 Like

Ok I see.

I wanted to show that many games have code that is initializing a whole list of variables to some values before running the game logic. Usually this ends up with quite some code and data from PROGMEM. If we could show that using a kind of state object (either a whole class instance or a struct) then the user could load that data in one shot from external flash. As you wrote you might have some constrains and need to use TriviallyCopyable or PODType classes.

listOfPeopleUsingMemset += veritazz

Maybe “wrap” was the wrong word. I meant to find a good interface in the library that makes it easy for the user to see it is good practice to use the provided library interface to init its game states or not set global variables right away but use the library functions. I think it needs to be kind of self explanatory. Maybe we need to provide application note alongside.

Yes for sure. Just thought we might also rework parts of the Arduboy2 library to reduce PROGMEM usage here and there. It can also serve as application example on how to efficiently use the external memory. I guess the library is the first source of examples for beginners.

Yeah something like that. There will be no mercy I guess but for full performance gain something like that might be required. I agree that for many things the standard functions are sufficient (e.g. huge animations, start/help/… screens, savegames …) but if you have data that is repeatedly read throughout the game logic (e.g. lookup tables) then some nice way to read a byte without actually waiting for it and instead do some other calculations would be great. Maybe I am just thinking about it because my game would benefit here.

Either of those would be better than init.

It depends on the platform.

On some platforms a ‘word’ is 32 bits and a 16 bit value is called a ‘halfword’.

Part of the reason for confusion is that the Windows API defines WORD as 16 bits, DWORD as 32 bits and QWORD as 64 bits because when it was originally written x86 was still primarily 16-bit and the 32-bit changeover was just beginning.
From then on that terminology was kept as backwards compatibility.
There’s a similar story with Intel’s x86/64 assembly language.

readUint8 and readUint16 don’t look that odd.
They probably make more sense than read1 or read2,
and they allow the creation of readInt8 and readInt16.

As an example from a ‘real world’ API,
C#'s BinaryReader class uses ReadByte and ReadUInt16.

Another alternative is readU8 and readU16 (and readS8 and readS16).

Specifying the actual size might not be the ‘prettiest’ way,
but it’s completely unambiguous, which is more important.

The problem with that is that short is another type that varies between platform.
short is allowed to be 32 bits or 64 bits.
Legally char, short, int and long could all be 64 bytes and thus sizeof(long) could legally be 1.
That’s precisely why the fixed width types (uint8_t et cetera) were introduced in the first place.

I’m pretty sure it’s called delayShort because it can only handle short delays (65.535 seconds max).

It depends what happens to the constexpr variable.

A lot of the time if it’s possible to encode the value as an immediate in an assembly instruction, the compiler will choose to do that, so despite being a ‘variable’ the actual object might not exist in memory.

A bit like how a local variable may never be stored on the stack if the compiler decided to keep it in a register for its entire lifetime.

If the variable did end up in memory then yes, it could change value,
but the same could happen with a macro, or possibly even the actual machine code.

Unless I’m looking at the wrong part:

I’m assuming that’s the ‘canonical’ version of cart.h?

If you want I could sit down and explain them properly.

The basics of templates aren’t really that hard to understand.
The only hard parts are how the compiler can infer the template parameters from the arguments and SFINAE.
The only scary part about templates is some of the complex wizardy that they can be used for.
(*hides shrine to Hermaeus Mora*)

The simplest version is that they’re basically blueprints for classes and functions,
and the compiler uses those blueprints to generate classes and functions.
Sometimes it can infer the template parameters,
sometimes they must be explicitly specified.

It depends on the usage and how the compiler optimises it.
Overall I don’t think it will be much of an issue.

With most games it’s the graphics and level data that eat the PROGMEM.

What do you mean by ‘whole class instance’?

class and struct are effectively the same thing.

It’s probably cheaper (or at least no more expensive) to just use a for loop.
Or (in a constructor) you can just use “in-class member initialisers” as demonstrated here.

I think it’s a bit early to conclude that.
We need to actually trial different approaches and see how they behave.
There’s a lot of rules about how things are initialised,
so what you expect to happen might not be the case.

We’ll definitely need examples, documentation and possibly a tutorial.

Beginners don’t tend to read the library source code though, so offloading (for example) the Arduboy logo onto the flash chip wouldn’t really be of an benefit to beginners.

I’m pretty sure most beginners look at existing games rather than the library,
so really we’d need to write some simple example games or just make sure the games we’re developing set a good example.

What you’re essentially asking for is a set of asynchrous read functions.
(To be technical.)

1 Like

To get back on that. You could put all your global (initialized) variables in a structure and then initialize it in setup() function:

struct GameState
 //all your globals here

GameState gameState;

Cart::readBytes(*gameState, size_of(gameState) , gameStateFlashAddress);

It would be prone to bugs though. as you’d have to carefully keep track of your gameState structure and it’s data in the data(script)file.

As on saving PROGMEM It would be better to use the Arduboy2Base class instead of using Arduboy2 class. When creating an Arduboy2 object the virtual functions bootLogoExtra and write will always be included and may take up some ~2K

I’d probably go for begin then. It’s more inline with Arduino and easier for beginners

That’s why I try to ommit numbers in function names.

Yes. and It made me just realise it’s actually not a fully correct naming as a short represents a signed 16-bit number.

I’ll keep your suggestions in mind.

Sorry it was a rhetorical question. I’ll change it an alias in the next version. like this right?

using uint24_t = __uint24;

Thanks. I’ll let you know when I’m ready for that (Just trying to keep my focus on coding the library functions)


After the changes, hopefully something more like:

Cart::readObject(gameStateFlashAddress, gameState);

The reason for putting the address first is because that’s what EEPROM does, so people will be more familiar with it.

Now you mention it, I don’t think I’ve ever seen anyone use bootLogoExtra.
Maybe we should petition to have it removed from the library?

I must admit, I’ve often wished the Arduboy library didn’t pick begin because begin is expected to be used for containers with iterators.

But begin would be easier for beginners so that will probably suffice.

Sometimes it’s the best option though.
Being informative is more important than being ‘pretty’.

Something I forgot to mention in my last comment:
The documentation will have to specify whether the data is read in little endian or big endian format (assuming there won’t be different functions for reading little endian and big endian).

Like I say, I’m pretty sure the choice of the name delayShort is nothing to do with the short type,
it’s just supposed to reflect that the maximum delay length is relatively short.

That’s right. (using is so much clearer than typedef.)

Bear in mind that at the moment uint24_t isn’t actually used elsewhere in the code.

I was thinking about that too. Will swap.

Basically It’s the part that is responsible for printing the username under the Arduboy logo and is used by begin and bootLogoShell derivatives

The flashimage ‘file’ system is big endian, the data files (content) will be be little endian. so the readUInt16 function will read a uint16_t in little endian format.

Now that we’re at it. @Pharap do you know any smart C++ trick to treat a byte as just the MSB and LSB in uint16_t format? myint16_t = ((uint16_t) msb << 8) | lsb; doesn’t really compile well.

I’m aware. I’ve probably read the entire source of the library at least 2-3 times by now.

The point of the function being virtual is so people can write classes that inherit from Arduboy2 that then override that function so they can do their own logo drawing.

But as I say, I don’t think I’ve ever seen anyone use it.
If anyone makes a splash screen or logo they tend to just write their own from scratch.

That’s slightly confusing.
I’m assuming the file system has to be big endian for a reason?

I know of three ways it could be done.

One way is to use a union,
which typically does work but is technically undefined behaviour so demons could fly out of your nose

One way is to use std::memcpy, which is less typing but probably more costly and a bit overkill.
One of the advantages of std::memcpy is that it can circumvent ‘strict aliasing’ rules.
(Not that ‘strict aliasing’ applies here.)

And the final way and which is the way I’d recommend doing it is via ‘type aliasing’.
Basically you take the address of the uint16_t, reinterpret it as an unsigned char * and modify the bytes that way.
Normally reinterpreting a type like that is a bad idea, but char, unsigned char and std::byte are special cases for which the modification is well defined.

uint16_t makeValue(uint8_t mostSignificantByte, uint8_t leastSignificantByte)
	uint16_t result;
	unsigned char * alias = reinterpret_cast<unsigned char *>(&result);
	alias[0] = leastSignificantByte;
	alias[1] = mostSignificantByte;
	return result;

It’s ugly, but it’s supposed to be.

It was easier for debugging and also code wise at some stage. Later on once the project got out and others started playing with it too. I didn’t want to change the indianness.

I’m doubting now if I should stick with little endian. Using big endian have a few advantages.

  • Assembly level code could be a bit more optimized.
  • It would be easier to reconize/read multi byte numbers when editing a datafile in an hex editor in early stages of development (ie there is no data packer tool yet)

I don’t think the programmer has to worry about endianness because he/she will use library functions and the data packer tool will create a datafile from a text based script

Thanks! I tried something similar like the type aliasing in the style of (uint8_t*)&Value[0] but it turns out the compiler decides to put value on the stack wasting even more code.

How so?
I would have thought using the processor’s native endianness would have been easier.

I think the focus should be on what’s more efficient for the Arduboy.

We can always write tools to compensate for any complexity in the data format,
but if the format isn’t efficient for the Arduboy then we’ll be stuck with that inefficiency for a long time.

Doesn’t that kind of contradict the argument about the hex editor?

According to the operator precedence rules,
that should go subscript, address-of, C-style cast,
which probably isn’t what you’d want unless Value is an array of non-uint8_ts.

Also you should avoid C-style casting because it makes it easy to accidentally cast away const.
Aside from which the rules for what it does are too complicated to remember and it has some unspecified behaviour.

There’s a good reason for that.
You can’t take the address of a register.

A fourth option would be to use inline assembly, but your results may vary depending on where and how the code is being used.

A small note.
Technically my earlier code should have actually been something like:

uint16_t makeValue(uint8_t mostSignificantByte, uint8_t leastSignificantByte)
	uint16_t result;
	void * address = static_cast<void *>(&result);
	unsigned char * alias = static_cast<unsigned char *>(address);
	alias[0] = leastSignificantByte;
	alias[1] = mostSignificantByte;
	return result;

But reinterpret_cast does the same thing on all major compilers as far as I’m aware.

AVR doesn’t have 16 bit load/store functions so it doesn’t matter.

efficiency for Arduboy rules.

I was talking about the time there are no tools yet.

Thanks for that. I’ve tried your example and it result in practically the same code

using the stack:
;readUInt16() little endian
1cae:	0e 94 0f 05 	call	0xa1e	;r24 = read()
1cb2:	89 83       	std	Y+1, r24	;alias[0] = read()
1cb4:	0e 94 0f 05 	call	0xa1e	;r24 = read()
1cb8:	8a 83       	std	Y+2, r24	;alias[0] = read()
1cba:	89 81       	ldd	r24, Y+1	;result lsb = r24
1cbc:	9a 81       	ldd	r25, Y+2	;result 

If it was big endian only the indices at std instructions would change.
The code looks okay but there’s also some code involved to prepare the stack:

;preparing the stack:
1692:	cf 93       	push	r28     ;Save Y register pair
1694:	df 93       	push	r29
1696:	00 d0       	rcall	.+0     ;neat trick to create two bytes space on stack
1698:	cd b7       	in	r28, 0x3d   ;make Y point to created space on stack
169a:	de b7       	in	r29, 0x3e

The ‘standard’ code is still more efficient:

uint16_t result;
result = read();
return ((uint16_t)read() << 8) | result;
compiles to:
;readUInt16() little endian
1c9c:	0e 94 0f 05 	call	0xa1e    ;lsb = read()
1ca0:	88 2e       	mov	r8, r24      ;move lsb to temp register
1ca2:	0e 94 0f 05 	call	0xa1e	 ;msb = read()
1ca6:	28 2d       	mov	r18, r8      ;expand lsb to 16-bit
1ca8:	30 e0       	ldi	r19, 0x00	
1caa:	a9 01       	movw	r20, r18 ;move to another temp register
1cac:	58 2b       	or	r21, r24     ;or msb << 8
1cae:	ca 01       	movw	r24, r20 ;move to result register 

Note: If the above code was actually big endian it would use two instructions/4 bytes more so in this case it turns out big endian is less efficient.

However when using assembly it can be better optimized using big endian.
The reason for this is that the read function uses a single lower register (r24) which we can also assign to our result register. With little endian the result value needs to be saved into a temporary register

Here's big endian assembly:
uint16_t result asm("r24");//assign register pair r25:r24 to result

Now the assembly code:

call read     ;r24 = read()
mov r25, r24  ;move to MSB result
call read     ;r24 = read() directly into LSB result

AVR is lacking a lot of things.
The more I learn, the more I realise why they’re so (relatively) cheap.

(Honestly though, the lack of a barrel shifter is still the biggest kick in the teeth.)

Frankly if everyone’s going to be writing their data with hex editors it’s going to be fiddly regardless of the endianness.

I would have expected that to be the case (unless it was inlined).

Should probably be something like:

return ((static_cast<uint16_t>(read()) << 8) | (static_cast<uint16_t>(read()) << 0));

Or at least:

const uint16_t result = read();
return ((static_cast<uint16_t>(read()) << 8) | (result << 0));

At any rate, I would have hoped that would be the most well-optimised way because it’s the way I’d expect people to write it.
The ‘type alias’ approach is very uncommon (and that’s probably a good thing).

Just to check, when you say ‘efficient’ do you mean it’s faster, smaller, or both?

Ok, I see what you mean.
It’s a shame the result can’t be read into r25.

Out of interest, is read too big to be inlined in this case?
It makes sense to not inline it for user code,
but for the readUInt16 function it might make sense to have some or all of read inline.

Yep that works.

It’s not too big. But there wouldn’t be much speed gain (if at all) so I optimize these basic functions more for size.

There’s there could be some speed gain currently when inlining. But at cost of increasing code size.

Basically the read function writes a dummy byte to SPI and then after 18 cycles the received byte can be read out. There will by cycles lost because of polling the SPI status register, return cycles and next read call cycles. the call return cycles could be saved and some of the wait cycles could be used more useful.

But I have another idea wasting cycles without inlining. The seek function will write an inital dummy byte to start shifting in the 1st byte of the selected address into the SPI data register.

The new read function will wait until there is data ready, reads it in and writes another dummy byte to start the next data being shifted in and returns. This way the cycles that where otherwise waisted will now execute program code.

1 Like

Depending on what difference is like it might make sense to have a readUInt16Fast,
but that’s not really a priority at the moment.

Is there somewhere where this is all documented?

I might be able to think of something useful if I understood the process and timings better.

That seems like a good idea.

Simple understanding of SPI:


In a Nutshell

SPI is a full duplex synchronous serial communication protocol.
A master device supplies a clock pulse that clocks out (shift) a data bit and
clocks in a data bit at the same time (although on different edges of the clock pulse)
There’s a chip select (slave select) for the slave device tosychronize with.

Arduboy is set to fastest speed the ATMEGA32U4 can handle and requires two clock cycles (@16MHz) per shifted bit and another 2 cycles* for internal control. The OUT instruction is used to write to the SPI data register (SPDR) and takes 1 cycle.17 cycles must pass before the next write ( OUT) to the SPDR register is executed otherwise data will be lost.

I’ve added it already. Now having problem for the naming. read() function as before and readUInt8() for the ‘cached’ read :stuck_out_tongue:

Since you mentioned readUInt16Fast earlier I’ll append Fast to the ‘cached’ functions or should it be readFastUInt16 :grinning:

1 Like

I understood ‘serial communication protocol’. :P

Seems I’ve got a lot of reading to do…

So it’s 2 cycles * 8 bits = 16 cycles.
(Either there’s no shift register or the shift register is particularly slow.)
I don’t know where the 17th cycle comes from though.

Personally I think the ‘caching’ technique should be called ‘prefetched’ mode,
because the byte is being ‘fetched’ before it’s asked for.
(And apparently I didn’t make that up, “prefetching” is a genuine computer science term/etchnique.)

In which case the functions would then be readUInt8(),
for the original and readPrefetchedUInt8() for the latter.

Or readUInt8Prefetched() if people want to argue about it.

Personally I always stand by putting adjectives before nouns in identifiers,
because that’s how they appear in English sentences,
but I know there’s a lot of people who tend to put the adjective at the end,
similar to the grammar of French.

Cheateau unChateauBleu { Couleur::Bleu }; :P

1 Like

full duplex == sending and receiving at the same time

There is and quite some control logic too. It’s not really slow as it’s driven by 16MHz main clock (fun fact: the actual max SPI speed is CPU clock/4 but there’s a 2X speed boost option for master mode which doubles it to 8MHz) Here’s a block diagram of the SPI logic from the ATMEGA32U4 datasheet:

Let’s say it’s some breathing time the logic needs. and because the SPI clock is CPU clock /2 it cant shift on half cycles

Yes prefectching is better. But I don’t know about calling it ReadPrefetchedUint8 :thinking:
It may be more accurate (although the prefetch could also still be pending) something like ReadFastUInt8 is easier to understand.

What about ReadAhead?

Yes. That is great. Now you could do an Unsafe version of this without waiting. For code portions that are for sure taking at least 17 cycles.

BTW. Can the flash stay selected all the time or does it need the CS toggle to properly send bytes.
E.g. I keep the flash selected all over the time and only for screen updates I deselect it.