I am trying to implement a fast 256bit integer addition with overflow detection. The fastest code I have so far is:
void add256b(uint32_t* a, uint32_t* b){
uint64_t sum_carry = 0;
for (int8_t i=0; i<8; i++){//calculate a + b
sum_carry += (uint64_t)a[i] + b[i];
a[i] = sum_carry;
if (sum_carry > 4294967295) sum_carry = 1;
else sum_carry = 0;
}
...
}
I am looking for ideas of how to make it faster. I was trying to use the internal carry flag that is bit zero in the SREG variable. With it one should in principle be able to do something like
a[i] += b[i] + carry;
carry = SREG & 0x01;
This would be helpful, because then the carry only needs to be a one byte integer and a[] and b[] could be 64 bit integers, reducing the number of passes through the loop. However this does not work, because SREG may not hold the flags from the prior addition. The C optimizer can mix things up, because it doesnāt understand the dependence of SREG to the prior addition. Any ideas?
Maybe just do inline assembly? An annoying part of C is that carry bits are not catered for, whereas if you code directly for AVR you can add-with-carry (ADC) down at the register level. (Itās an 8-bit processor, so all the 32-bit stuff in C gets converted to 8-bit instructions by the compiler anyway).
If you need something portable this should work, but I havenāt tested it.
Iāve left some comments so it should be easy enough to follow, but basically the idea is that it exploits the fact that when doing a 16-bit addition the lowest bit of the most significant byte is effectively the same as the overflow for an 8-bit addition.
It thus captures that bit and feeds it back into the algorithm.
Note that the body of the loop is branchless, and it could be unrolled to potentially achieve more speed if the compiler isnāt already doing that.
As has been mentioned an assembly version might work out better if you donāt need portability, but if you do need portability thereās no sense in me wasting a trip to the AVR ASM documentation. :P
For the record, the reason using uint64_t and/or uint32_t is a bad idea is that the compiler is likely to be generating a bunch of add instructions in a loop for them anyway, so youāre probably creating more work than if you just do a byte-for-byte addition yourself.
Thank you for the code. This is a clever idea to first ignore the carry and then figure it from the results. However, it isnāt faster on the Arduboy. But I played around with the SREG and the following seems to work and is quite a bit faster than everything else I tested so far (besides dropping down to assembler):
uint8_t add256b(uint32_t* a, const uint32_t* b) {
uint8_t carry1 = 0;
uint8_t carry2 = 0;
for (int8_t i=0; i<8; i++){//calculate a + b
a[i] += b[i];
carry2 = SREG & 0x01;
if (carry1) {
a[i] += carry1;
carry1 = SREG & 0x01;
}
carry1 += carry2;
}
return carry1;
}
The trick was to introduce a second carry and split the add into two lines, one adding a + b and then the second one adding the carry.
Yes, I am exploring cryptography applications, but at this point it is primarily to find the limits of the Arduboy. How long does it take to calculate certain cryptographic primitives? Is it feasible on the Arduboy or not?
Does anybody have an idea how much faster it would get using assembler? Are we talking 10-20% faster? Or 2x faster? Or 10x faster?
I have not considered this option, but it is an interesting thought. One could have a range of similar hardware additions to modify the Arduboy: flash-memory, crypto chip, real time clock, wifi ⦠If each one is only a few bucks that would increase the application space for the Arduboy considerably.
Ok, I will try that if I can figure out how to add inline assembler with the Arduino IDE.
Not really, because the Arduboy already comes in a really nice package. If all one needs to do is add a little extension board then this is much cheaper and easier than a complete redesign with a new microcontroller.
It all depends on how costly such extension boards are and how difficult they are to add. Ideal would be some kind of connector already on the PCB on the Arduboy to which small extension boards can be added to the i2c or SPI bus, and everything is still inside the enclosure.
When it comes down to it, āArduinoā code is just C++ with an extra preprocessing phase and specific GCC flags (e.g. -fpermissive, which is a nightmare to deal with), so normal GCC inline assembly rules apply, but using AVR assembly.
I am not thinking all boards together, but one board of your choosing. If you want more flash, add the flash board, if crypto is needed add a crypto board, if wifi is needed add wifi. Perhaps some functions can be combined on the same tiny board, but already being able to add one function for a few bucks would be a nice way to extend the Arduboy.
Thanks. For the time being it is something I will keep in mind when I run into hard limits. Letās see how much assembler code can speed up my little 256b addition project.