Adding Hashing Capabilities to EEPROM

Do you think your hash function could be added to the Arduboy2 lib? It would be great to have a standardised interface to write/read data to EEPROM, based on best practises.

It could, but whether or not it should is a more complex issue…


For a start, you have to ask what the actual objective is.

If the objective is just to have a simple way to produce save data that can detect (with a reasonable though not perfect level of certainty) when it’s been ‘trampled’ by another game then a hash function would achieve that goal.


After that comes the big question of “Which hash function?”, which depending on how seriously you want to treat it can be quite a deep question.

If you want to be nonchalant about it then the hash function I used has been proved to work, which may make it ‘good enough’.

If you want to be more serious about it then you could go to the point of comparing various different hash functions based on characteristics like the size of the hash value they produce*, the amount of machine code they produce, how fast they are to compute, and what mathematical properties they hold.

(* I would expect an argument over whether a 2 or 4 byte hash is more desirable.)


Then there’s the issue of how it should actually be implemented, which includes considerations like:

  • Avoiding name clashes (since Arduboy2 doesn’t use namespaces)
  • Whether the hash calculation facilities should be a separate entity to the eeprom facilities
    • From a ‘proper programmer’ point of view they should be because that would be good separation of concerns and makes it easier to provide alternative hash functions without cluttering up a single class, but on the other hand that might make life more difficult for the beginners and it’s going to litter the global namespace more (again, since Arduboy2 doesn’t use namespaces)
  • Where this would stand in relation to the idea of producing an eeprom API with a ‘commit’ function to assist devices that use flash-simulated eeprom
1 Like

I’d just love a simple interface where I can pass a struct and have it saved or loaded! :crazy_face:

You mean EEPROM.put and EEPROM.get?

… but with error checking magically built in :slight_smile:

Here is some really rough code that seems to work:

Essentially, you extend the StructWithHash.h to include your details (as I did in the TestStruct.h), then call the get() and put() as I did in the main .ino file.

If you get() a structure and the checksum is not valid, it will call the reset() method. You can override that in the class that extends the base StructWithHash.h class.

Yes, it could be cleaned up to be better - a lot better. For example, when calculating the hash, I read the data from the EEPROM when I could have cast the struct to a uint8_t array and read back from it in memory. Also the hash is a simple ‘add em up’ version - nothing fancy.

1 Like

No you can’t, because reset isn’t virtual, so (static_cast<StructWithHash>(t)).reset(); will explicitly call the reset in StructWithHash, which does nothing.

If you remove the static_cast, then it will work providing that the type T provides a reset member function.


I must admit, I find it odd that you effectively end up having to do testStruct.get(offset, testStruct), and that you could end up writing testStructA.get(offset, testStructB) or even something like testStructA.put(offset, 10), which would have some surprising effects.

Mmmm … I have removed the cast as you suggested.

As I mentioned, its really rough code but it demonstrates the point. I wanted to have just one simple class that you extend.

Feel free to contribute. Can I somehow, in the get and put template methods, refer to ‘this’?

1 Like

I was hoping to write a single-header library myself when I have the time to write and test it, but so far I’ve not had time.

You could, but that won’t work either because of the lack of virtuality.
And yet you can’t introduce virtual functions either because the vtable pointer will screw with things.

Really inheritance is the wrong tool for this (which is one of the big issues I have with Pokitto’s cookie system).

The best way is to make the get and put functions either static member functions of a class that doesn’t need to be inherited, or to make them free functions (ideally inside a namespace), which is what I was planning to do (when I have a minute).

When I have chance to implement one, I’ll provide a version that uses your hash as well to demonstrate.

1 Like

Just going to post since this thread is is already here, I noticed on my Arduboy FX, after playing a few games, most games that I load now have EEPROM collision, and the highscores are all messed up. One thing I saw in one of the games, in the highscore section, there were already highscores there but I never played the game yet, and there were sort of two cards and a sort of mask in those highscores.

Finally got chance to implement something over dinner…

The ‘just add everything up’ approach is implemented as SummationHash, with SummationHash<uint16_t> providing the same behaviour @filmote’s implementation was using.

There’s no need to inherit anything, just do

using Hash = Hashing::SummationHash<uint16_t>;
Hashing::EEPROM::putWithHash<Hash>(address, object);

for writing and

using Hash = Hashing::SummationHash<uint16_t>;
if(!Hashing::EEPROM::getWithHash<Hash>(address, object))
{
	// Code in here runs if the hash was invalid
}

for reading. Which is quite close to how EEPROM.get and EEPROM.put would normally be used.


Obviously it’s a bit wordy, but for something you’re only going to be using 2-3 times per game, it doesn’t really matter if it’s a bit long.

You can do things like:

using namespace Hashing;

using Hash = SummationHash<uint16_t>;

EEPROM::putWithHash<Hash>(address, object);

Though be wary that Hashing::EEPROM may end up clashing with ::EEPROM. It seems the compiler’s smart enough to pick the right one, but best be on your toes. I could change the name, but I couldn’t think of a better option that wouldn’t be even longer (e.g. HashedEEPROM).

If it bothers you enough then you can do something like using HashedEEPROM = Hashing::EEPROM;, or whatever other alias you might prefer. After all, type aliases are nothing to be afraid, they’re no more complex than variables.


Alongside simple summation I’ve also included a slightly better algorithm from the works of Donald Knuth, included under the name KnuthHash.

Unless having a 4-byte hash code is a deal breaker, prefer KnuthHash to SummationHash because it’s better quality and thus less likely to give false positives.

I collected about dozen different hashes, compared the size of the code generated (see the statistics here) and Knuth was the cheapest one of reasonable quality. I may include some of the others later, though I’m wary of going overboard. I think as long as there’s a few of varying speed, size and quality then that’s enough really.


Lastly, it works with function objects (functors) and thus lambda expressions, which means you can write custom hash functions without needing to create a dedicated class or struct:

auto hash = [](const /*object type*/ & object) -> /*hash type*/
{
	// Your hash function here
};

Hashing::EEPROM::putWithHash(address, object, hash);

Ideally I would have liked to support a function signature of (const unsigned char *, size_t) too, but Arduino doesn’t really have the right support classes for detecting function signatures (i.e. either stuff from <type_traits> or C++20’s concept and requires), so I had to pick one and this is the better object because it allows for things like return object.getHashCode();.

Besides which, reinterpret_cast<const unsigned char *>(&object) and sizeof(object) are easy enough to write anyway. (Or at least they ought to be for anyone capable of writing a decent custom hash function.)

Just be wary of the return type. You may wish to use an explicit static_cast or -> (as I have done in the example).

2 Likes

Yes, that will happen. Generally there’s no way to avoid it.
A good game will at least offer you the option of wiping the highscore table, but failing that you’d have to upload an EEPROM editor and wipe the relevant bytes of EEPROM.

EEPROM isn’t like a file system, it’s effectively just an array of 1024 bytes that retain their memory after the power is switched off.

Those are actually valid characters in Arduboy2’s font.

The font Arduboy2 uses is based on (though not the same as) the character set of Code Page 437:

Most games’ high score systems only let you use letters and numbers though, and sometimes symbols like underscores, spaces, or hyphens, so those card suit and smiley face characters likely only appear because the game is loading ‘junk data’ - normally you probably wouldn’t be able to use them.

Amazing! I wanted to retrofit ASE by @pmwasson (I’m using it a lot recently). I’m not sure when I will have time… but will post an update when I can.
I think the 32bits is great. Many programs already use 2x char as a magic signature, so this (replacement?) only costs two more bytes.

BTW - I’m thinking for the use case of ASE, it might be desirable to save each sprite as it’s own ‘packet’, so that corruption loss is limited. Aside from the extra overhead, I presume that’d work… :thinking:

Not exactly.

If two games use the same hash function, the same address and the same size of game data then they’ll end up considering each other’s game data to be valid.

Thus a (pseudo-)randomly chosen game identifier could still be useful in some circumstances.
But that depends on the odds of the aforementioned scenario ocurring, which I think is relatively unlikely.
(Unlikely or not though, I still used one in Minesweeper.)


The significance of a 32-bit hash is more to do with collision odds. 32 bits gives you 4,294,967,296 possible hash values compared to just 65,536 possible hash values using 16 bits.

Though the quality of the hash function is also important - a poor hash function (such as unaltered summation) will mean data below a certain size only ends up using a tiny fraction of those possible values.

(I could go into what makes a good hash function and why summation does a bad job, but I think I’m the only one who would actually be interested.)

The only 16-bit ‘quality’ hash function I could find was Fletcher16, which produced a fair bit more machine code than the Knuth function (most likely due to the % 255s).

I also tried a 16-bit one’s complement (the same algorithm used for checksums in UDP), but I’m fairly sure that’s going to have exactly the same flaws as bare summation.


Yes, that would work. As long as you’re careful to include the size of the hash when calculating where the images are to be saved to. E.g. if you were using Hashing::KnuthHash then you’d use sizeof(Hashing::KnuthHash::hash_type) to get the size of the hash.

You could also just make a new struct that contained a Hashing::KnuthHash::hash_type hash and uint8_t array[SpriteBuffer::bufferSize] (or possibly SpriteBuffer since it seems that has no other member variables) and then take the size of that struct with sizeof. A bit of a hack since you wouldn’t be instantiating the struct, but potentially less error prone.

(Bear in mind it would be saving the raw image data, not the generated C++ output.)

1 Like

Yes, absolutely true.

I am keen to see this added to the base library so people have no excuse not to know about it!

1 Like

Unfortunately that’s likely to take more time.

With a separate library dedicated to hashing I could theoretically just keep adding new functions and people can choose the one that works best for them, but for the Arduboy2 library it would likely have to pick only the best 1-3 hash functions to reduce maintenance costs, (which means going through and testing all the candidates to determine their runtime characteristics,) and possibly use a slightly simpler interface (the Arduboy2 library seems allergic to template code for reasons I cannot fathom :P).

The reality is collisions are so unlikely that you should just pick one and run with it. Why give choices? No one except you will be bothered to understand the pros and cons of each choice.

The performance is basically irrelevant as you are probably only saving eeprom changes when a user actually makes a change / beats a high score. Don’t worry about it.

The hash is probably also irrelevant as long as it is optimised for small bursts of data rather than large tracts. Just pick one.

As long as the code is in its own namespace then it should be no issue to merge. I am sure that @MLXXXp will merge it if there is no impact on existing library functions and therefore games.

Don’t over think it. There is no need for multiple hash algorithms and the like. Be autocratic, not accomodating.

The above is blunt I know … but let’s get it done.

1 Like

Even if only one is chosen, it must be chosen methodically, not at random.

Once chosen, it can’t be replaced without breaking games, so it’s better to choose wisely in the first place.

(For the Arduboy2 library anyway.)

In the case of the stand-alone library, because it could potentially be opened up to all of Arduino (or even non-Arduino environments), and with that wider audience someone probably will care.

And even if nobody cares, making the user responsible for choosing the hash function means they can’t complain about my choice or having limited options. :P

In the case of Arduboy2, just the one may well be enough, but it’s not really for me to decide - I can argue one way or the other, but ultimately it’s @MLXXXp that decides.

It likely is, but there’s more to the characteristics of a hash function than its speed, and obviously you don’t want it so horrendously slow that it causes noticeable lag.

That’s one of the things that can’t really be known without testing/analysing a function’s behaviour.

Unfortunately whilst things like PRNGs tend to be well analysed and have published statistics and even proper academic papers written about them, information about the behaviour of particular hash functions is more thin on the ground. (And what is available is loaded with the kind of maths notation and jargon that melts my poor little brain.)

The version for use in Arduboy2 probably wouldn’t be in a separate namespace because the Arduboy2 library doesn’t tend to use namespaces.

I have a vague idea in my head that there would be an Arduboy2EEPROM class that would host the getWithHash/putWithHash functions (or more preferably readWithHash and writeWithHash or updateWithHash).

That class could then also be used to replicate the EEPROM API (or a slightly modified version) and provide a commit function - something that was discussed a little while back as a means of supporting environments with simulated EEPROM.

Again, Arduboy2 isn’t my library to ‘rule’. I don’t hold the ‘krátos’.

It’s for @MLXXXp to make the final decisions - I can only make proposals and arguments.

Though if @MLXXXp is available and interested, I’d be happy to put together some proposals and discuss the matter.

1 Like

I have proposed the following API to @MLXXXp and he seems satisfied with it.

#include <avr/eeprom.h>

struct Arduboy2EEPROM
{
	static void beginWrite()
	{
	}
	
	static void endWrite()
	{
	}

	static unsigned char readByte(uint16_t address)
	{
		return eeprom_read_byte(reinterpret_cast<const unsigned char *>(address));
	}

	static void writeByte(uint16_t address, unsigned char byte)
	{
		eeprom_update_byte(reinterpret_cast<const unsigned char *>(address), byte);
	}

	template<typename Type>
	static void read(uint16_t address, Type & object)
	{
		auto pointer = reinterpret_cast<unsigned char *>(object);
		
		for(size_t index = 0; index < sizeof(object); ++index)
			pointer[index] = readByte(address + index);
	}

	template<typename Type>
	static void write(uint16_t address, const Type & object)
	{
		auto pointer = reinterpret_cast<const unsigned char *>(object);
		
		for(size_t index = 0; index < sizeof(object); ++index)
			writeByte(address + index, pointer[index]);
	}
	
	using hash_type = uint32_t;
	
	template<typename Type>
	static hash_type hash(const Type & object)
	{
		return hash(reinterpret_cast<const unsigned char *>(object), sizeof(object));
	}
	
	static hash_type hash(const unsigned char * data, size_t size)
	{
		hash_type value = size;
		
		for(size_t index = 0; index < size; ++index)
			value = (((value << 5) ^ (value >> 27)) ^ data[index]);
			
		return value;
	}

	template<typename Type>
	static bool readWithHash(uint16_t address, Type & object)
	{
		hash_type storedHash;
	
		read(address, storedHash);
		read(address + sizeof(hash_type), object);
		
		return (storedHash == hash(object));
	}

	template<typename Type>
	static void writeWithHash(uint16_t address, const Type & object)
	{
		write(address, hash(object));
		write(address + sizeof(hash_type), object);
	}

	template<typename Hash, typename Type>
	static bool readWithHash(uint16_t address, Type & object, Hash hash)
	{
		using hash_type = decltype(hash(object));
	
		hash_type storedHash;
	
		read(address, storedHash);
		read(address + sizeof(hash_type), object);
		
		return (storedHash == hash(object));
	}

	template<typename Hash, typename Type>
	static void writeWithHash(uint16_t address, const Type & object, Hash hash)
	{
		using hash_type = decltype(hash(object));
		
		write(address, hash(object));
		write(address + sizeof(hash_type), object);
	}
};
Explanations and Justifications...
  • beginWrite and endWrite for accomodating systems with faux EEPROM.
    • I believe flash-emulated EEPROM would only require the beginWrite (called commit in other APIs), but having beginWrite would facilitate devices with proper file systems, allowing an eeprom file to be opened and closed properly
    • Personally I’d rather have an actual Arduboy2File object of some sort that closes when the object is destroyed so that nobody forgets to call endWrite, but I think less experienced programmers might struggle with it
    • If it were possible to only enable it during debugging then it would probably have been worth having a bool that tracks if begin and end are called to help catch bugs, but that might not be an option
  • readByte and writeByte do what they say on the tin.
    • I’m proposing uint16_t rather than int mostly to keep negative values away from the system. size_t would also be a reasonable candidate, but may be more confusing.
    • They are implemented directly through eeprom_read_byte and eeprom_update_byte - relying on avr-libc, not Arduino’s EEPROM
    • There’s only read and write, no update, and ‘write’ is implemented as ‘update’.
      • Hardly anyone would ever want to write instead of updating so it likely makes sense to just leave that possibility out entirely.
      • I’m presuming that calling them ‘read’ and ‘update’ would probably confuse people more.
  • read and write are the templated versions that read and write whole objects.
    • Using eeprom_read_block and eeprom_update_block might be more efficient in terms of time or memory, I haven’t tested.
    • The compiler should optimise away the loops when sizeof(object) is 1, but I haven’t tested at all.
    • If the compiler doesn’t optimise, there are ways to add special versions for particular types, but depending on what exactly needs to be done it may require a bit more boilerplate. Something like read(uint16_t, const unsigned char &) is easy, something like ‘specialise for all Type where sizeof(Type) is 1’ would potentially be more complicated.
  • hash has a non-templated version that just uses a pointer and a size, and a templated version that accepts any type.
    • The templated version exists to hide the reinterpret_cast and sizeof, which avoids confusing beginners, saves users some work and reduces the risk of error introduced by mistyping.
    • I haven’t tested, but I’m almost certain it’ll be optimised away as if the user had written the reinterpret_cast and sizeof themselves.
    • The hash is reportedly from The Art Of Computer Programming Volume 3, Chapter 6.4, by Donald E. Knuth. I have only verified that it produces very little machine code and is better quality than a plain summation, I have not assessed any other qualities such as speed or behavioural characteristics.
  • readWithhash and writeWithHash are the functions that started this whole escapade - a way to write an object with a prepended hash code and read it back whilst varifying that the hash is correct.
    • readWithHash returns true if the hash was valid and false otherwise - it is the caller’s responsibility to decide how to react to an incorrect hash, lest the library pen the user into a corner.
    • Again, likely to be inlined because each only consists of a few lines and are likely to only be called once each, but I haven’t done any tests to that effect.
  • Two extra versions of readWithHash and writeWithHash that will accept any object that implements an operator(), thus allowing users to provide their own hash functions if necessary.

Does anybody have any quibbles or queries before I begin writing the documentation?

1 Like