Adding Hashing Capabilities to EEPROM

— Sorry I wasn’t very clear; I just mean add an example to the document. You highlight the unlikely (but possible) issue of two games colliding, with the same address and number of bytes stored. This adds some mitigation.

— Great point. That might be the ‘best practice’ example to give ~ constexpr uint16_t signature = 0xABCD with some note about assigning 4 ‘random’ hexadecimal digits…? (Thinking about it, perhaps just 1 byte is sufficient? The odds are already pretty low…?)

The 2x char is just what has a long history of use in the community. Perhaps an advantage is people are more likely to change it for their game? Agreed, the entropy is pretty low (~26*26 =676, in reality nearer to 2^9, so ~9 bits).


— Fair enough. Was just thinking you’ve explained most of the basics in great detail- so it’s ideal for beginners. Although it can be dangerous, having boilerplate code to copy-paste is a great way for newcomers to get started.


— Perfect justification. Understood and agreed.


Complete aside: It’s only just occurred to me that the first 4 bytes of the stored range (with hash) will be getting updated (almost) every write now. (Assuming most data blocks only have a few changed bytes that get written).I don’t think there’s any way to reduce the impact of this, through wear levelling or reducing the hash size. :person_shrugging:

1 Like

Even then, I’m not sure it’s worth doing.It would take quite a bit of extra work to cover all the relevant information:

  • How to use the technique
  • Why it works
  • When it doesn’t work
  • What the alterantives are (e.g. using a different hash function)
  • When/why the alternatives are better

Also, it’s potentially better to handle the issue when it actually happens, since it’ll draw attention to which games have the incompatibility.


Although I’m not intending to add the information to the document, I’ll discuss the situation a bit here briefly.

(It’s much easier to explain here in terms that people familiar with the scenario will understand than trying to explain it in a more accessible way.)

Consider this…

Presume game A and game B have save data consisting of 10 bytes, and are both storing data at address 40, and are both using the built in hash.

Now, game B inserts 2 characters in front of their data, bringing the data size to 12 bytes. Note that the values of the 2 chars don’t actually matter at this point because suddenly games A and B have different sizes, which is (theoretically) enough to change the hash value.

The author of game A also decides to add 2 characters, bringing the sizes of game A’s data and game B’s data to the same value: 12 bytes. Let’s presume they both chose different values for these identifying char values. All is good.

Game C, however, uses 12 bytes of data and stores that data at address 40. (Can you see where this is going?)
Previously it didn’t clash with anything, and now suddenly it’s now clashing with game A and game B.

Suddenly games are chasing each other’s tails.
(Granted it’s statistically very unlikely, but stastistically unlikely scenarios must be considered.)

Now, rewind back to the start a moment. What would have happened if game B had decided to switch to another hash function?

It would no longer be clashing with A, and as long as A didn’t use the same hash function as B, A would also no longer clash with B. (The odds of two hash functions producing the same hash for the same data is probably quite tiny.)

If both games had switched to the same hash function, one of them would have to change again, but they don’t run the risk of getting game C involved because the size of the data didn’t change.

While I think of it, I should point out that the very act of a game’s data increasing in size will actually invalidate existing save data and will run the risk of clashing with other games that were previously using the same address but a different size of data.

Incidentally, I actually future-proofed Minesweeper by building in a mechanism that would allow save data to grow and shrink as necessary without invalidating the hash value.
(I get the feeling you’re going to ask if that could be added to Arduboy2EEPROM and then I’m going to regret telling you… ( -_-'))


Hopefully I’ve just explained why inserting characters is not a complete solution.
Granted, neither is using a different hash really.

I can think of half a dozen other approaches (my favourite is probably to exclusive or the hash value with a randomly chosen per-game or per-user identifier since that seems to combine the best of both worlds), but pretty much everything is going to fall prey to some stastistical anomaly somewhere down the line, and I’m not really a good enough mathematician to statistically prove which occurances are more likely, I can only prove by logic and example.

More accurately, Filmote started doing it for his games and then taught several other people to do it.

Granted it was better than not doing it because it at least detects whether the game has actually been run before, but in the grand scheme of things it’s not a massive improvement.

It’s also a great way to encourage people to copy and paste code without attempting to understand what the code is doing, which is frankly just setting them up for a fall later on, and it allows the no-effort script kiddies to prosper.

The code I wrote intentionally requires more than just the code presented to actually do anything useful. It showcases the important parts such that someone who understands the features used could work out what’s going on and adapt it, but it won’t work through copy and pasting alone, it needs to be modified and integrated.

Well that was easier than I was expecting.

To clarify, if the size of HashType or the hash implementation ever changed, that would be a breaking change, hence I had to give an example of a change that would technically not be a ‘breaking’ change.

(The more I think about it, the more I think perhaps I should go ahead with the struct approach just so there’s no ambiguity as to what the interface/API is supposed to be…)

This is true, but it’s the price that must be paid.

You could do some crazy attempt at wear levelling by moving the data around each time, but that’ll just end up occupting more of the already scarce EEPROM.

Besides which, some games will have save patterns like that anyway. A simple scoreboard might not, but an RPG is potentially going to be overwriting the player coordinates every time. Anything attempting some kind of time stamp will definitely be overwriting that every time.

It could possibly be mitagated by using 4×8-bit hashes or 2×16-bit hashes, and treating the data as 2-4 streams of parallel bytes, which would theoretically make certain bytes of the hash change less frequently, but I would expect that to weaken the hash (though I have no clue by what factor) .

1 Like

Alternative:
Rather than consuming precious EEPROM, an option to pass a unique ‘seed’ (first byte) to the hash function (i.e. not in the main data struct). Would need to pass the same key on write and read. Default is unused / 0x00, perhaps an overloaded function? Then it just uses PROGMEM / RAM if someone wants to pass their unique seed value. Ownership could be managed with a simple shared document… until all 256 values are claimed… :wink:
I’m staking my claim to 0xAD !!

1 Like

— Me: …weakly raises hand and waves… :grimacing:

It’s possible, but I’m still reluctant if it isn’t actually needed.

It would be backwards compatible, so it could be delayed until a problem has actually been demonstrated to exist.

The problem there would be getting people to register their value, since we have users beyond the forum.


To clarify: the user would pick some 4-byte number and then it would be ^ed with the hash. It would have most of the same properties as your suggestion, with a larger number value. I’m as good as certain that it would guard against the scenario where games A and B use the same address and size by giving them different ids, but I am worried about the potential of one game’s hashValue ^ identifier clashing with another games purely because of the nature of ^.

E.g. if hash A were 0x55555555 before the exor, and it were exored with 0xAAAAAAAA (id A) to get 0xFFFFFFFF, and hash B were 0x44444444 before the exor, and it were exored with 0xBBBBBBBB (id B) then the result would also be 0xFFFFFFFF. Surely the odds must be pretty high, but I don’t have the mathematical skill to say how high or how it compares to the odds of injecting a single byte resulting in the same effect.


Seriously though, there are people out there who just copy and paste code they find online without even attempting to understand it.

On StackOverflow I’ve seen some pretty dour “I copied and pasted this code that I found online and it isn’t working, what’s wrong with it?” questions.

They had so many homework questions that someone wrote an ‘open letter’ begging people to be more considerate and to stop cheating themselves out of the learning experience.

Even as a joke, it’s unfair to compare yourself to those kind of self-sabotaging time wasters.
(Fortunately we don’t really get many here on the forum.)


(Of course, if there comes a time that you want to get more into programming or try some trickier stuff, you know there’s plenty of people here who can help you out.)

1 Like

Fair enough. Here’s some data:
Out of a survey of 287 games, of which 140 used EEPROM, there are 5 address+length collisions, affecting 11 games, so ~3.8% of games surveyed.

Details:

Start: 16 , End: 16 —

  1. Ardulem
  2. Blocks
  3. Harambes Revenge

Start: 16 , End: 19 —

  1. Fatsche
  2. Pyoro

Start: 16 , End: 22 —

  1. Pipes
  2. SFCave

Start: 16 , End: 70 —

  1. Shattered Lands: Towers of Perdition
  2. Shattered Lands 2: Sea of Despair
    – Hmmm… this might be deliberate. Not sure how we should handle that @filmote !?..

Start: 100 , End: 134 —

  1. Asteroids
  2. Snake

I’m not sure if ~3% is a rare problem…?
Personally, I will still use & advocate 1–2 bytes of signature.
There’s also a theoretical benefit of passing in more data, regarding ‘spinning-up’ the state of the hash (relevant when only storing a few bytes).


– Yeah, my instinct is that’s not good, but practically may be ok… :man_shrugging:

Harambe’s Revenge is actually an exceptional case because it fills the entirety of EEPROM due to how it’s designed.

I think I asked if we had permission to change that many moons ago and the answer was ‘yes’. I’ll check the PM tomorrow.

It seems that nearly all of these developers are actually starting from the beginning of (permissible) EEPROM instead of trying to spread their data, which is why every single collision in that list has a start address of 16.

In which case, I’d argue that’s the real problem here.

@filmote could probably bump pipes down a few addresses.

Adding hash support would break backwards compatibility anyway, so the rest of these may as well have their addresses moved if/when they’re updated to the hashing technique.

Shattered Lands wasn’t a @filmote game. That’s @tuxinator2009’s work.

But it probably is intentional. I think it’s supposed to be like LodeRunner - multiple executables reusing the same save data.

(I’m not sure Sea of Despair was ever even finished?)

This I would take to be a genuine example, except I happen to know (without even clicking the links) those are both by the same person (because I remember him making them), thus that’s probably a case of copy & pasted code not properly updated.


I’m uncertain. I ran some numbers but couldn’t get my head around what I actually calculated.

I think I discovered that out of the 18,446,744,073,709,551,616 (232×232) possible (ordered) pairs of 32-bit operands to the exor operator, 576,460,752,303,423,488 (1⁄32) will produce the same value. I.e. that each 32-bit number as 576,460,752,303,423,488 possible (ordered) ‘factors’ (or summands?).

This was based on observing that by drawing the exor tables for 1 bit, 2 bit and 3 bit quantities, a duplicate value occupied 1⁄2, 1⁄4 and 1⁄8 of the table respectively. So actually maybe it’s 1⁄64? I don’t know, I’m not very good at maths.

Even if that’s accurate, I’m not sure whether or not that implies any collision statistics. Even if it does, I don’t know what the statistics of the other technique would be either.

I’ll think about it more when I next have chance.

1 Like

I checked.

Among other things, I said:

One user has already created a version that only uses a fixed amount of EEPROM

And the response was:

I think probably it would be a good idea to update the main branch with it, I’m just wondering how much it was tested.

So I’m presuming Akkera’s version was used?

And I think I found the source for that version this time (I didn’t spot it when responding to the other thread earlier):

In which case, yes, it probably collides with the other two, though it actually already uses a three character guard value, so the collision wouldn’t be a problem.

Also, I seem to calculate it as actually using 11 bytes (3 characters plus an 8 byte HighScoreEntry struct), meaning it spans from 16 to 27.

Hang on, these games use only 1 byte at address 16?
Are you sure about that?

Thinking about them again now that I’m not keeling over at 7 to midnight…

Of those 11 clashing games you identified:

  • Asteroids and snake were written by the same person, and are a case where the code was just reused and the address not changed, so I’m discounting those.
  • Shattered Lands 2 was never finished, so that value either wasn’t supposed to be the final value, or it was intended that the player’s data should carry across to the new game, so I’m ruling those out too.

That leaves 7 games that don’t have mitigating circumstances, and all of them start at address 16, so I’m inclined to declare that this problem only exists here because the developers didn’t attempt to pick a different address for their game.

If all those developers had done the proper thing (pick a random address further along the data pool) then your list wouldn’t be anywhere near as long.

Thus I think it’s safe to rule the clashing problem as uncommon enough (for the moment anyway) that it’s not worth adding any kind of extra library support. If it becomes a problem later down the line then it can be reexamined, but for now I think it’s better to just forget about it.

To put it another way: I could sit down and do a bunch of boring and awkward attempts at calculating some statistics to decide what’s best to fix a problem that only (just about) affects 11 games (out of a possible 287 - i.e. 276 games do not have this problem), or I could spend that time focusing on more realistic issues like writing help documents and deciding whether it’s worth having a ‘safe’ and ‘unsafe’ version of the API.

I think it’s also worth pointing out that there’s nothing the Arduboy2EEPROM library will do that people couldn’t do themselves, so if anyone wanted to attempt any of the techniques discussed (the 2-char value, a uint16_t value, an extra byte inserted into the hash calculation, exoring the hash with a uint32_t) they are completely free to do so, providing that they have the skill/ability to do so.

But a guard - like I use - isn’t enough.

No, but the scenario we’re currently discussing is (hypothetically) if every game were using the hash feature that will be provided by the Arduboy2EEPROM library.

Under that scenario, any two games that use exactly the same start address and size of data would end up determining each other’s hashes to be valid (because they’d find the hash in the same place and be attempting to validate the same amount of data) and thus misinterpreting each other’s data. (Which is why @acedent was highlighting 11 games that fall under that collision criteria.)

Some kind of game identifier value (be it 2-3 characters or a numeric identifier) would be enough for the game to know whether or not the data was its own even if the hash gave a false positive.

Ah … sorry I mis-understood. Hence including a guard in the gash would be enough (hopefully) to make it unique.

In this case at least it would mean Harambe’s Revenge wouldn’t have to worry.

The other two games it’s purported to clash with would still have to worry, however after a quick skim I’m still not sure the values attested for Harambe’s Revenge are correct. It looks to me like it should actually start at address 16 and continue to address 26 (i.e. 11 bytes of data).

(Note: should have said 26 in the above comment, not 27. I’m used to dealing with half-open ranges on account of using C++ using half-open iterators, so fully inclusive ranges tend to throw me.)


Summary version of the rest of the conversation:

  • Acedent suggested using a 2-char identifier, as you’ve done with many of your games. (I.e. with the implication that the library would have some kind of support function for that.)
  • I suggested that:
    • A) It wouldn’t really be necessary because I don’t think it’s likely that many games are actually going to end up using both the same address and the same data if everyone is following the advice to pick a randomish address to store their data at
    • B) If there were going to be some second measure, simply exoring a 4-byte ID with the hash might be enough to act as a second layer of protection. The advantage there would be not wasting any extra EEPROM.
  • Acedent also suggested inserting a 1-byte ID value into the hash calculation as an alternative way to avoid using up more EEPROM.

On Wednesday night I made a brief, falling-asleep-at-my-desk attempt to calculate the odds of the exor technique also failing.

This morning I’m back to arguing that I think it’s a waste of time considering a secondary guard (i.e. beyond the hash function) after reexamining the 11 game list and reaffirming that 4 are exceptional circumstances and the rest are all saving their data at address 16, which goes against the ‘pick a random address’ recommendation.

1 Like

I’m ok with the outcome. As you rightly say, this is clearly rare and unusual. It can be fixed easily (by offsetting the start address by 1 byte+)… and personally I’m ok with adding a 2 char signature… (just those bytes will be in the struct and written to EEPROM, rather than passed RAM to the functions).

1 Like

Again, the user could just make their own function for it if they really wanted to.

void saveData(char guardA, char guardB, const PlayerData & playerData)
{
	// Note:
	// The hash won't reflect the guard values,
	// but that doesn't really matter.
	const auto hashValue = Arduboy2EEPROM::hash(playerData);
	
	uintptr_t address = saveAddress;
	
	Arduboy2EEPROM::write(address, hashValue);
	address += sizeof(hashValue);
	
	Arduboy2EEPROM::write(address, guardA);
	address += sizeof(guardA);
	
	Arduboy2EEPROM::write(address, guardB);
	address += sizeof(guardB);
	
	Arduboy2EEPROM::write(address, playerData);
}

That’s the whole point of taking a minimalist approach:
don’t try to predict everything the user will need, just give the user the tools to build the functionality they need.

(Though 2 bytes of RAM isn’t a huge sacrifice if they want to do it the lazy way - including the chars in the struct - and/or don’t know how to write the above function and its ‘load’ equivalent.)

Actually, by passing as a function argument it’s registers that you hope the values to end up in. (For data as small as chars anyway.)

1 Like

Thanks – I appreciate the example implementation.
Although it’s still 2 bytes written to EEPROM rather than a RAM based seed value, passed to the hash function. Not sure there’s any way around that.

Out of curiosity, do you have a dislike for overloaded functions? Perhaps there’s a good reason to avoid them?

Edit: Never mind – the overloaded code gets ugly esp. with the extensive Doyxgen comments… although it might be manageable if the prototypes were split out and we had foo.h and foo.cpp.

Well, I have satisfied my curiosity and now totally agree. Keep It Simple!

Yeah, in fairness you can’t really do that without basically reimplementing the hash function simply because of how the hash function works.

No, I have nothing against overloads.

What I do have is a dislike for adding too many functions, because every extra overload has to be documented and maintained, and I’m the one who has to write that documentation, and either I or MLXXXp (or both of us) will be the one to maintain it.

Ultimately the use of a feature must outweigh the cost of adding the feature.

The cost of adding a feature includes writing, testing, documenting, and maintaining it.
When you add that up, even ten lines of code can become quite expensive.


Hrm, that’s not quite what I was expecting you meant.

I thought you meant trying to insert an extra value into the sequence:

static HashType hash(const unsigned char * data, size_t size, uint8_t extra)
{
	HashType value = size;
	
	value = (((value << 5) ^ (value >> 27)) ^ extra);
	
	for(size_t index = 0; index < size; ++index)
		value = (((value << 5) ^ (value >> 27)) ^ data[index]);
	
	return value;
}

I’m not sure if replacing the initial value with a user-defined one would invalidate any of the properties of the hash. Sometimes it’s only a convinient way of having a non-zero initial value, other times it’s important. I don’t really like messing around with hash algorithms because even the slightest thing can change their characteristics.

By the way your version would violate one of the guarantees I made for the seedless version:

If size is 0, the returned hash code will also be 0.

For your version the guarantee would have to change to:

If size is 0, the returned hash code will also be the value of seed.

I get the impression you didn’t realise how many functions would have to be overloaded.

Technically there’s a way to avoid needing the extra overloads for the templated functions, but it would be detrimental to the clarity of the documentation. It’s much better to have individually documented overloads.

It’s technically possibly to shunt the Doxygen comments into a separate file, but I’m not entirely sure how, and someone still has to write the documentation.

You can only do that for non-template functions.

Template function definitions must remain accessible from the header because the compiler needs to know the full definition ito be able to instantiate the function if requested to do so.

(For more detail see: “Why can’t I separate the definition of my template class from its declaration and put it inside a .cpp file?”.)

You can technically declare the template functions and then define them later on, so you could shunt the definitions into another header, but I try to avoid doing that unless there’s a tangible benefit.

1 Like

– My instinct here, is that it’s a cheap way to seed the hash. I should probably read up on the actual function… here. Chapter 6.4, p.513. (That was the cited source… but haven’t found it yet!?)

Update: From my skim reading, I haven’t found the algorithm in the cited source. Perhaps it’s not by DEK? Or it’s only the 1st edition? However I did find this ~
Here, under the heading ‘A common weakness’.
Here, under ‘Hashing Sequences of Characters’…
This is a nice explanation of how the shift parameters were tuned for ASCII text. I’m not sure if this reduces its merit as a general hash?..

It was cited as “under the topic of sorting and search chapter 6.4.”, no page given.

I’m begining to suspect that that it might not be.

Though if it were, it might be expressed in mathematical terms.
E.g. exor might actually be expressed as addition modulo 2, and shifting might be expressed as multiplication or division by a power of 2.
That would make it harder to recognise.

That C code looks older than I am! It’s using pre-ANSI parameter syntax.
That makes it both invalid C++ and invalid C99 code.

I didn’t read the whole thing, but I did test its assertion about hash("EXXXXXB") == hash("AXXXXXC"), and it’s true:

using System;

namespace ConsoleApplication1
{
    class Program
    {
        static uint hash(string key)
        {
            uint result = (uint)key.Length;

            for (int index = 0; index < key.Length; ++index)
                result = ((result << 5) ^ (result >> 27)) ^ (byte)key[index];

            return result;
        }

        static void Main(string[] args)
        {
            var keys = new string[] { "EXXXXXB", "AXXXXXC" };

            foreach (var key in keys)
                Console.WriteLine("{0} = {1:X}", key, hash(key));

            Console.ReadKey();
        }
    }
}

However, those Xs aren’t accidental - those intermediate values are likely to affect the hash.

Yes, it claims it’s a CRC, which could be correct.
(I’ve never found a satisfactory answer to the question of how CRCs work.)

I’ve got plenty of other hashes to choose from if this one is duff.

Though as you can see, doing it ‘properly’ isn’t easy.
Really it’s a job for a mathematician, which I’m certainly not.
(My highest maths qualification is a B at GCSE level.)

Potentially.

ASCII characters will always have a 0 in bit 7 by definition, and the frequency of values is likely to gravitate towards the letters, space, and digits, with some symbol and many unprintable characters never being encountered.

Arduboy games would not have those restrictions on their save data.
Without an analysis, it would be hard to say exactly what byte values might be expected from Arduboy save data, so I would just presume ‘anything and everything’.


By the way, now that @Mr.Blinky is back I’m going to try to give him some time to catch up a bit and see what his thoughts are on what’s been said in regards to whether the API is suitable for the FX, so I may not reply to any future replies for a little while, just to give him time.

1 Like

One small thing:
I’ve run the only test I really know how to do with a hash, and resolved to replace the supposed ‘Knuth’ hash with John Skeet’s hash because it appears to express superior properties.

I can provide a demo program on request, but you’ll need over 4GB spare to run it because it allocates a 4GB array. (Tricking the compiler into allowing me to allocate an array of that size took a bit of fiddling. I’m still not completely sure why some ways work and others don’t.)

1 Like

I’ve not had time to try, but I found there is a popular hash test suite ‘smhasher’ (like ‘diehard’). Also some more functions here and here.

Is it Jon Skeet’s post on StackOverflow (using magic primes and multiply) ?
If so, its source is ‘Effective Java: Second Edition’, 2008, p.47 onwards, by Joshua Bloch.