Alternative:
Rather than consuming precious EEPROM, an option to pass a unique ‘seed’ (first byte) to the hash function (i.e. not in the main data struct
). Would need to pass the same key on write
and read
. Default is unused / 0x00
, perhaps an overloaded function? Then it just uses PROGMEM / RAM if someone wants to pass their unique seed value. Ownership could be managed with a simple shared document… until all 256 values are claimed…
I’m staking my claim to 0xAD
!!
— Me: …weakly raises hand and waves…
It’s possible, but I’m still reluctant if it isn’t actually needed.
It would be backwards compatible, so it could be delayed until a problem has actually been demonstrated to exist.
The problem there would be getting people to register their value, since we have users beyond the forum.
To clarify: the user would pick some 4-byte number and then it would be ^
ed with the hash. It would have most of the same properties as your suggestion, with a larger number value. I’m as good as certain that it would guard against the scenario where games A and B use the same address and size by giving them different ids, but I am worried about the potential of one game’s hashValue ^ identifier
clashing with another games purely because of the nature of ^
.
E.g. if hash A were 0x55555555
before the exor, and it were exored with 0xAAAAAAAA
(id A) to get 0xFFFFFFFF
, and hash B were 0x44444444
before the exor, and it were exored with 0xBBBBBBBB
(id B) then the result would also be 0xFFFFFFFF
. Surely the odds must be pretty high, but I don’t have the mathematical skill to say how high or how it compares to the odds of injecting a single byte resulting in the same effect.
Seriously though, there are people out there who just copy and paste code they find online without even attempting to understand it.
On StackOverflow I’ve seen some pretty dour “I copied and pasted this code that I found online and it isn’t working, what’s wrong with it?” questions.
They had so many homework questions that someone wrote an ‘open letter’ begging people to be more considerate and to stop cheating themselves out of the learning experience.
Even as a joke, it’s unfair to compare yourself to those kind of self-sabotaging time wasters.
(Fortunately we don’t really get many here on the forum.)
(Of course, if there comes a time that you want to get more into programming or try some trickier stuff, you know there’s plenty of people here who can help you out.)
Fair enough. Here’s some data:
Out of a survey of 287 games, of which 140 used EEPROM, there are 5 address+length collisions, affecting 11 games, so ~3.8% of games surveyed.
Details:
Start: 16 , End: 16 —
Start: 16 , End: 19 —
Start: 16 , End: 22 —
Start: 16 , End: 70 —
- Shattered Lands: Towers of Perdition
-
Shattered Lands 2: Sea of Despair
– Hmmm… this might be deliberate. Not sure how we should handle that @filmote !?..
Start: 100 , End: 134 —
I’m not sure if ~3% is a rare problem…?
Personally, I will still use & advocate 1–2 bytes of signature.
There’s also a theoretical benefit of passing in more data, regarding ‘spinning-up’ the state of the hash (relevant when only storing a few bytes).
– Yeah, my instinct is that’s not good, but practically may be ok…
Harambe’s Revenge is actually an exceptional case because it fills the entirety of EEPROM due to how it’s designed.
I think I asked if we had permission to change that many moons ago and the answer was ‘yes’. I’ll check the PM tomorrow.
It seems that nearly all of these developers are actually starting from the beginning of (permissible) EEPROM instead of trying to spread their data, which is why every single collision in that list has a start address of 16.
In which case, I’d argue that’s the real problem here.
@filmote could probably bump pipes down a few addresses.
Adding hash support would break backwards compatibility anyway, so the rest of these may as well have their addresses moved if/when they’re updated to the hashing technique.
Shattered Lands wasn’t a @filmote game. That’s @tuxinator2009’s work.
But it probably is intentional. I think it’s supposed to be like LodeRunner - multiple executables reusing the same save data.
(I’m not sure Sea of Despair was ever even finished?)
This I would take to be a genuine example, except I happen to know (without even clicking the links) those are both by the same person (because I remember him making them), thus that’s probably a case of copy & pasted code not properly updated.
I’m uncertain. I ran some numbers but couldn’t get my head around what I actually calculated.
I think I discovered that out of the 18,446,744,073,709,551,616 (232×232) possible (ordered) pairs of 32-bit operands to the exor operator, 576,460,752,303,423,488 (1⁄32) will produce the same value. I.e. that each 32-bit number as 576,460,752,303,423,488 possible (ordered) ‘factors’ (or summands?).
This was based on observing that by drawing the exor tables for 1 bit, 2 bit and 3 bit quantities, a duplicate value occupied 1⁄2, 1⁄4 and 1⁄8 of the table respectively. So actually maybe it’s 1⁄64? I don’t know, I’m not very good at maths.
Even if that’s accurate, I’m not sure whether or not that implies any collision statistics. Even if it does, I don’t know what the statistics of the other technique would be either.
I’ll think about it more when I next have chance.
I checked.
Among other things, I said:
One user has already created a version that only uses a fixed amount of EEPROM
And the response was:
I think probably it would be a good idea to update the main branch with it, I’m just wondering how much it was tested.
So I’m presuming Akkera’s version was used?
And I think I found the source for that version this time (I didn’t spot it when responding to the other thread earlier):
In which case, yes, it probably collides with the other two, though it actually already uses a three character guard value, so the collision wouldn’t be a problem.
Also, I seem to calculate it as actually using 11 bytes (3 characters plus an 8 byte HighScoreEntry
struct), meaning it spans from 16 to 27.
Hang on, these games use only 1 byte at address 16?
Are you sure about that?
Thinking about them again now that I’m not keeling over at 7 to midnight…
Of those 11 clashing games you identified:
- Asteroids and snake were written by the same person, and are a case where the code was just reused and the address not changed, so I’m discounting those.
- Shattered Lands 2 was never finished, so that value either wasn’t supposed to be the final value, or it was intended that the player’s data should carry across to the new game, so I’m ruling those out too.
That leaves 7 games that don’t have mitigating circumstances, and all of them start at address 16, so I’m inclined to declare that this problem only exists here because the developers didn’t attempt to pick a different address for their game.
If all those developers had done the proper thing (pick a random address further along the data pool) then your list wouldn’t be anywhere near as long.
Thus I think it’s safe to rule the clashing problem as uncommon enough (for the moment anyway) that it’s not worth adding any kind of extra library support. If it becomes a problem later down the line then it can be reexamined, but for now I think it’s better to just forget about it.
To put it another way: I could sit down and do a bunch of boring and awkward attempts at calculating some statistics to decide what’s best to fix a problem that only (just about) affects 11 games (out of a possible 287 - i.e. 276 games do not have this problem), or I could spend that time focusing on more realistic issues like writing help documents and deciding whether it’s worth having a ‘safe’ and ‘unsafe’ version of the API.
I think it’s also worth pointing out that there’s nothing the Arduboy2EEPROM
library will do that people couldn’t do themselves, so if anyone wanted to attempt any of the techniques discussed (the 2-char
value, a uint16_t
value, an extra byte inserted into the hash calculation, exoring the hash with a uint32_t
) they are completely free to do so, providing that they have the skill/ability to do so.
But a guard - like I use - isn’t enough.
No, but the scenario we’re currently discussing is (hypothetically) if every game were using the hash feature that will be provided by the Arduboy2EEPROM
library.
Under that scenario, any two games that use exactly the same start address and size of data would end up determining each other’s hashes to be valid (because they’d find the hash in the same place and be attempting to validate the same amount of data) and thus misinterpreting each other’s data. (Which is why @acedent was highlighting 11 games that fall under that collision criteria.)
Some kind of game identifier value (be it 2-3 characters or a numeric identifier) would be enough for the game to know whether or not the data was its own even if the hash gave a false positive.
Ah … sorry I mis-understood. Hence including a guard in the gash would be enough (hopefully) to make it unique.
In this case at least it would mean Harambe’s Revenge wouldn’t have to worry.
The other two games it’s purported to clash with would still have to worry, however after a quick skim I’m still not sure the values attested for Harambe’s Revenge are correct. It looks to me like it should actually start at address 16 and continue to address 26 (i.e. 11 bytes of data).
(Note: should have said 26 in the above comment, not 27. I’m used to dealing with half-open ranges on account of using C++ using half-open iterators, so fully inclusive ranges tend to throw me.)
Summary version of the rest of the conversation:
- Acedent suggested using a 2-char identifier, as you’ve done with many of your games. (I.e. with the implication that the library would have some kind of support function for that.)
- I suggested that:
- A) It wouldn’t really be necessary because I don’t think it’s likely that many games are actually going to end up using both the same address and the same data if everyone is following the advice to pick a randomish address to store their data at
- B) If there were going to be some second measure, simply exoring a 4-byte ID with the hash might be enough to act as a second layer of protection. The advantage there would be not wasting any extra EEPROM.
- Acedent also suggested inserting a 1-byte ID value into the hash calculation as an alternative way to avoid using up more EEPROM.
On Wednesday night I made a brief, falling-asleep-at-my-desk attempt to calculate the odds of the exor technique also failing.
This morning I’m back to arguing that I think it’s a waste of time considering a secondary guard (i.e. beyond the hash function) after reexamining the 11 game list and reaffirming that 4 are exceptional circumstances and the rest are all saving their data at address 16, which goes against the ‘pick a random address’ recommendation.
I’m ok with the outcome. As you rightly say, this is clearly rare and unusual. It can be fixed easily (by offsetting the start address by 1 byte+)… and personally I’m ok with adding a 2 char
signature… (just those bytes will be in the struct and written to EEPROM, rather than passed RAM to the functions).
Again, the user could just make their own function for it if they really wanted to.
void saveData(char guardA, char guardB, const PlayerData & playerData)
{
// Note:
// The hash won't reflect the guard values,
// but that doesn't really matter.
const auto hashValue = Arduboy2EEPROM::hash(playerData);
uintptr_t address = saveAddress;
Arduboy2EEPROM::write(address, hashValue);
address += sizeof(hashValue);
Arduboy2EEPROM::write(address, guardA);
address += sizeof(guardA);
Arduboy2EEPROM::write(address, guardB);
address += sizeof(guardB);
Arduboy2EEPROM::write(address, playerData);
}
That’s the whole point of taking a minimalist approach:
don’t try to predict everything the user will need, just give the user the tools to build the functionality they need.
(Though 2 bytes of RAM isn’t a huge sacrifice if they want to do it the lazy way - including the char
s in the struct - and/or don’t know how to write the above function and its ‘load’ equivalent.)
Actually, by passing as a function argument it’s registers that you hope the values to end up in. (For data as small as chars anyway.)
Thanks – I appreciate the example implementation.
Although it’s still 2 bytes written to EEPROM rather than a RAM based seed value, passed to the hash function. Not sure there’s any way around that.
Out of curiosity, do you have a dislike for overloaded functions? Perhaps there’s a good reason to avoid them?
Edit: Never mind – the overloaded code gets ugly esp. with the extensive Doyxgen comments… although it might be manageable if the prototypes were split out and we had foo.h
and foo.cpp
.
Well, I have satisfied my curiosity and now totally agree. Keep It Simple!
Yeah, in fairness you can’t really do that without basically reimplementing the hash function simply because of how the hash function works.
No, I have nothing against overloads.
What I do have is a dislike for adding too many functions, because every extra overload has to be documented and maintained, and I’m the one who has to write that documentation, and either I or MLXXXp (or both of us) will be the one to maintain it.
Ultimately the use of a feature must outweigh the cost of adding the feature.
The cost of adding a feature includes writing, testing, documenting, and maintaining it.
When you add that up, even ten lines of code can become quite expensive.
Hrm, that’s not quite what I was expecting you meant.
I thought you meant trying to insert an extra value into the sequence:
static HashType hash(const unsigned char * data, size_t size, uint8_t extra)
{
HashType value = size;
value = (((value << 5) ^ (value >> 27)) ^ extra);
for(size_t index = 0; index < size; ++index)
value = (((value << 5) ^ (value >> 27)) ^ data[index]);
return value;
}
I’m not sure if replacing the initial value with a user-defined one would invalidate any of the properties of the hash. Sometimes it’s only a convinient way of having a non-zero initial value, other times it’s important. I don’t really like messing around with hash algorithms because even the slightest thing can change their characteristics.
By the way your version would violate one of the guarantees I made for the seedless version:
If
size
is0
, the returned hash code will also be0
.
For your version the guarantee would have to change to:
If
size
is0
, the returned hash code will also be the value ofseed
.
I get the impression you didn’t realise how many functions would have to be overloaded.
Technically there’s a way to avoid needing the extra overloads for the templated functions, but it would be detrimental to the clarity of the documentation. It’s much better to have individually documented overloads.
It’s technically possibly to shunt the Doxygen comments into a separate file, but I’m not entirely sure how, and someone still has to write the documentation.
You can only do that for non-template functions.
Template function definitions must remain accessible from the header because the compiler needs to know the full definition ito be able to instantiate the function if requested to do so.
(For more detail see: “Why can’t I separate the definition of my template class from its declaration and put it inside a .cpp file?”.)
You can technically declare the template functions and then define them later on, so you could shunt the definitions into another header, but I try to avoid doing that unless there’s a tangible benefit.
– My instinct here, is that it’s a cheap way to seed the hash. I should probably read up on the actual function… here. Chapter 6.4, p.513. (That was the cited source… but haven’t found it yet!?)
Update: From my skim reading, I haven’t found the algorithm in the cited source. Perhaps it’s not by DEK? Or it’s only the 1st edition? However I did find this ~
Here, under the heading ‘A common weakness’.
Here, under ‘Hashing Sequences of Characters’…
This is a nice explanation of how the shift parameters were tuned for ASCII text. I’m not sure if this reduces its merit as a general hash?..
It was cited as “under the topic of sorting and search chapter 6.4.”, no page given.
I’m begining to suspect that that it might not be.
Though if it were, it might be expressed in mathematical terms.
E.g. exor might actually be expressed as addition modulo 2, and shifting might be expressed as multiplication or division by a power of 2.
That would make it harder to recognise.
That C code looks older than I am! It’s using pre-ANSI parameter syntax.
That makes it both invalid C++ and invalid C99 code.
I didn’t read the whole thing, but I did test its assertion about hash("EXXXXXB") == hash("AXXXXXC")
, and it’s true:
using System;
namespace ConsoleApplication1
{
class Program
{
static uint hash(string key)
{
uint result = (uint)key.Length;
for (int index = 0; index < key.Length; ++index)
result = ((result << 5) ^ (result >> 27)) ^ (byte)key[index];
return result;
}
static void Main(string[] args)
{
var keys = new string[] { "EXXXXXB", "AXXXXXC" };
foreach (var key in keys)
Console.WriteLine("{0} = {1:X}", key, hash(key));
Console.ReadKey();
}
}
}
However, those Xs aren’t accidental - those intermediate values are likely to affect the hash.
Yes, it claims it’s a CRC, which could be correct.
(I’ve never found a satisfactory answer to the question of how CRCs work.)
I’ve got plenty of other hashes to choose from if this one is duff.
Though as you can see, doing it ‘properly’ isn’t easy.
Really it’s a job for a mathematician, which I’m certainly not.
(My highest maths qualification is a B at GCSE level.)
Potentially.
ASCII characters will always have a 0 in bit 7 by definition, and the frequency of values is likely to gravitate towards the letters, space, and digits, with some symbol and many unprintable characters never being encountered.
Arduboy games would not have those restrictions on their save data.
Without an analysis, it would be hard to say exactly what byte values might be expected from Arduboy save data, so I would just presume ‘anything and everything’.
By the way, now that @Mr.Blinky is back I’m going to try to give him some time to catch up a bit and see what his thoughts are on what’s been said in regards to whether the API is suitable for the FX, so I may not reply to any future replies for a little while, just to give him time.
One small thing:
I’ve run the only test I really know how to do with a hash, and resolved to replace the supposed ‘Knuth’ hash with John Skeet’s hash because it appears to express superior properties.
I can provide a demo program on request, but you’ll need over 4GB spare to run it because it allocates a 4GB array. (Tricking the compiler into allowing me to allocate an array of that size took a bit of fiddling. I’m still not completely sure why some ways work and others don’t.)
I’ve not had time to try, but I found there is a popular hash test suite ‘smhasher’ (like ‘diehard’). Also some more functions here and here.
Is it Jon Skeet’s post on StackOverflow (using magic primes and multiply) ?
If so, its source is ‘Effective Java: Second Edition’, 2008, p.47 onwards, by Joshua Bloch.
I dread to think how long that will take to run if my tests were anything to go by. Though perhaps they’re less brute force.
Either way, I’m not sure it’s worth it, and I’m already convinced that Skeet hash is the right hash for the job.
Yes that one.
Not quite. You’re only looking at the first implementation, which uses some pretty miniscule primes.
Later on in the answer Skeet provides a second one, implied to be of his own devising (hence I named it ‘Skeet hash’, not ‘Bloch hash’) with some much larger numbers, which is what I’ve been testing.
Basically my test consists of:
- Allocate a 4GB array of bytes, all set to zero
- Effectively
uint8_t array[4294967296]
, but I have to dynamically allocate it and jump through some hoops to get the compiler to not reject it.
- Effectively
- For
x
=
0
to4294967295
:value = hash(x);
-
if(array[value] < 255)
++array[value];
- Declare
uint64_t count[256];
- Declare `uint64_t total = 0;
- For
x
=
0
to4294967295
:++count[array[value]];
total += array[value];
-
if(total >= 4294967296)
-
break;
- exit early if all the remaining entries will be 0
-
I ran that on both the Knuth hash and the Skeet hash.
The Knuth hash spat out this:
Knuth:
0 = 4286578688
1 = 0
2 = 0
3 = 0
4 = 0
5 = 0
6 = 0
7 = 0
8 = 0
9 = 0
10 = 0
11 = 0
12 = 0
13 = 0
14 = 0
15 = 0
16 = 0
17 = 0
18 = 0
19 = 0
20 = 0
21 = 0
22 = 0
23 = 0
24 = 0
25 = 0
26 = 0
27 = 0
28 = 0
29 = 0
30 = 0
31 = 0
32 = 0
33 = 0
34 = 0
35 = 0
36 = 0
37 = 0
38 = 0
39 = 0
40 = 0
41 = 0
42 = 0
43 = 0
44 = 0
45 = 0
46 = 0
47 = 0
48 = 0
49 = 0
50 = 0
51 = 0
52 = 0
53 = 0
54 = 0
55 = 0
56 = 0
57 = 0
58 = 0
59 = 0
60 = 0
61 = 0
62 = 0
63 = 0
64 = 0
65 = 0
66 = 0
67 = 0
68 = 0
69 = 0
70 = 0
71 = 0
72 = 0
73 = 0
74 = 0
75 = 0
76 = 0
77 = 0
78 = 0
79 = 0
80 = 0
81 = 0
82 = 0
83 = 0
84 = 0
85 = 0
86 = 0
87 = 0
88 = 0
89 = 0
90 = 0
91 = 0
92 = 0
93 = 0
94 = 0
95 = 0
96 = 0
97 = 0
98 = 0
99 = 0
100 = 0
101 = 0
102 = 0
103 = 0
104 = 0
105 = 0
106 = 0
107 = 0
108 = 0
109 = 0
110 = 0
111 = 0
112 = 0
113 = 0
114 = 0
115 = 0
116 = 0
117 = 0
118 = 0
119 = 0
120 = 0
121 = 0
122 = 0
123 = 0
124 = 0
125 = 0
126 = 0
127 = 0
128 = 0
129 = 0
130 = 0
131 = 0
132 = 0
133 = 0
134 = 0
135 = 0
136 = 0
137 = 0
138 = 0
139 = 0
140 = 0
141 = 0
142 = 0
143 = 0
144 = 0
145 = 0
146 = 0
147 = 0
148 = 0
149 = 0
150 = 0
151 = 0
152 = 0
153 = 0
154 = 0
155 = 0
156 = 0
157 = 0
158 = 0
159 = 0
160 = 0
161 = 0
162 = 0
163 = 0
164 = 0
165 = 0
166 = 0
167 = 0
168 = 0
169 = 0
170 = 0
171 = 0
172 = 0
173 = 0
174 = 0
175 = 0
176 = 0
177 = 0
178 = 0
179 = 0
180 = 0
181 = 0
182 = 0
183 = 0
184 = 0
185 = 0
186 = 0
187 = 0
188 = 0
189 = 0
190 = 0
191 = 0
192 = 0
193 = 0
194 = 0
195 = 0
196 = 0
197 = 0
198 = 0
199 = 0
200 = 0
201 = 0
202 = 0
203 = 0
204 = 0
205 = 0
206 = 0
207 = 0
208 = 0
209 = 0
210 = 0
211 = 0
212 = 0
213 = 0
214 = 0
215 = 0
216 = 0
217 = 0
218 = 0
219 = 0
220 = 0
221 = 0
222 = 0
223 = 0
224 = 0
225 = 0
226 = 0
227 = 0
228 = 0
229 = 0
230 = 0
231 = 0
232 = 0
233 = 0
234 = 0
235 = 0
236 = 0
237 = 0
238 = 0
239 = 0
240 = 0
241 = 0
242 = 0
243 = 0
244 = 0
245 = 0
246 = 0
247 = 0
248 = 0
249 = 0
250 = 0
251 = 0
252 = 0
253 = 0
254 = 0
255 = 8388608
Which basically means nearly everything clashed at least 255 times (potentially a lot more).
(I don’t have the hard drive space to spare to give you the exact distribution, a text representation would be on the order of tens of gigabytes, and it would likely take a few hours to churn that out.)
The Skeet hash on the other hand:
Skeet:
0 = 0
1 = 4294967296
Literal perfection. This is what no collisions looks like.
For 4-byte inputs, the Skeet hash is bijective, whereas the Knuth hash is non-injective, non-surjective.
The reason I thought to do this test is because it’s what I did many moons ago to the xorshift hash I nicknamed ‘SkullHash’ (which also got a perfect result) - the one used in Minesweeper.
(I acually tried to do it in Haskell, did a terrible job of it and ended up sending my computer into a paging panic. On the bright side, I’ll never forget to check if I’m using the lazy or strict foldl
again.)
At any rate, if you want to run the smhasher tests and look for alternatives feel free to, but I am 100% sold on the benefits of the Skeet hash, and intend to use that to replace the Knuth hash .
HashType hash { 2166136261 };
for(size_t index = 0; index < size; ++index)
hash = ((hash * 16777619) ^ pointer[index]);
return hash;
I can move the starting address and repack the game if needed. Where is it safe to save data?