Fast 256 bit integer addition

I am trying to implement a fast 256bit integer addition with overflow detection. The fastest code I have so far is:

void add256b(uint32_t* a, uint32_t* b){
  uint64_t sum_carry = 0;
  for (int8_t i=0; i<8; i++){//calculate a + b
    sum_carry += (uint64_t)a[i] + b[i];
    a[i] = sum_carry;
    if (sum_carry > 4294967295) sum_carry = 1;
    else sum_carry = 0;
  }
  ...
}

I am looking for ideas of how to make it faster. I was trying to use the internal carry flag that is bit zero in the SREG variable. With it one should in principle be able to do something like

a[i] += b[i] + carry;
carry = SREG & 0x01;

This would be helpful, because then the carry only needs to be a one byte integer and a[] and b[] could be 64 bit integers, reducing the number of passes through the loop. However this does not work, because SREG may not hold the flags from the prior addition. The C optimizer can mix things up, because it doesn’t understand the dependence of SREG to the prior addition. Any ideas?

Maybe just do inline assembly? An annoying part of C is that carry bits are not catered for, whereas if you code directly for AVR you can add-with-carry (ADC) down at the register level. (It’s an 8-bit processor, so all the 32-bit stuff in C gets converted to 8-bit instructions by the compiler anyway).

If you want to stay strictly in C++ for portability this code will most likely be faster on AVR than using uint64_t

bool ADC_Inplace(uint8_t *dst, const uint8_t *src, uint8_t len, bool carry = false)
{
	while(len--){
		uint8_t pre = *dst;
		uint8_t post = pre + (*src++) + carry;
		*dst++ = post;
		carry = (post < pre) || (carry && (post == pre));
	}
	return carry;
}

forcing the compiler to do 8x uint64_t math and compare is a killer on AVR.

With the code above you only need 1 function for any size of integer.

If you need faster than this you should really use an unrolled assembly code version.

This seems like a strange thing to want to do.
Is it for cryptography, and does it have to be portable?

If you need something portable this should work, but I haven’t tested it.

I’ve left some comments so it should be easy enough to follow, but basically the idea is that it exploits the fact that when doing a 16-bit addition the lowest bit of the most significant byte is effectively the same as the overflow for an 8-bit addition.

It thus captures that bit and feeds it back into the algorithm.

Note that the body of the loop is branchless, and it could be unrolled to potentially achieve more speed if the compiler isn’t already doing that.

As has been mentioned an assembly version might work out better if you don’t need portability, but if you do need portability there’s no sense in me wasting a trip to the AVR ASM documentation. :P


For the record, the reason using uint64_t and/or uint32_t is a bad idea is that the compiler is likely to be generating a bunch of add instructions in a loop for them anyway, so you’re probably creating more work than if you just do a byte-for-byte addition yourself.

That’s a big number yo

Thank you for the code. This is a clever idea to first ignore the carry and then figure it from the results. However, it isn’t faster on the Arduboy. But I played around with the SREG and the following seems to work and is quite a bit faster than everything else I tested so far (besides dropping down to assembler):

uint8_t add256b(uint32_t* a, const uint32_t* b) {
  uint8_t carry1 = 0;
  uint8_t carry2 = 0;
  for (int8_t i=0; i<8; i++){//calculate a + b
    a[i] += b[i];
    carry2 = SREG & 0x01;
    if (carry1) {
      a[i] += carry1;
      carry1 = SREG & 0x01;
    }
    carry1 += carry2;
  }
  return carry1;
}

The trick was to introduce a second carry and split the add into two lines, one adding a + b and then the second one adding the carry.

Yes, I am exploring cryptography applications, but at this point it is primarily to find the limits of the Arduboy. How long does it take to calculate certain cryptographic primitives? Is it feasible on the Arduboy or not?

Does anybody have an idea how much faster it would get using assembler? Are we talking 10-20% faster? Or 2x faster? Or 10x faster?

Have you considered doing your own mod-chip and instead of adding flash memory, to add some crytpo chip?

I have not considered this option, but it is an interesting thought. One could have a range of similar hardware additions to modify the Arduboy: flash-memory, crypto chip, real time clock, wifi … If each one is only a few bucks that would increase the application space for the Arduboy considerably.

Theoretically if they all sit on the i2c bus it’s quite trivial

I thought that might be the case. ‘256 bits’ practically screams ‘256 bit encryption block’. :P

I’m going to reply to that with a quote from Read Admiral Grace Murray Hopper:

One accurate measurement is worth a thousand expert opinions.

The best I can give you is “it’s probably faster than these portable methods”.

If you don’t care about portability it’s worth concocting an assembly implementation.

If you load the X register with the left hand pointer and the Z register with the right hand pointer you could do something like 32 copies of:

LD r0, X ; Read left
LD r1, Z+ ; Read right
ADC r0, r1 ; Add with carry
ST X+, r0 ; Store to left

A waste of progmem, but potentially very fast.


By that point haven’t you just reinvented a more up market ARM chip?

Going for wifi probably means ending up with an ESP attached for that matter.

If possible it might be worth sticking the one that needs the most throughput on an SPI link.

Ok, I will try that if I can figure out how to add inline assembler with the Arduino IDE.

Not really, because the Arduboy already comes in a really nice package. If all one needs to do is add a little extension board then this is much cheaper and easier than a complete redesign with a new microcontroller.

It all depends on how costly such extension boards are and how difficult they are to add. Ideal would be some kind of connector already on the PCB on the Arduboy to which small extension boards can be added to the i2c or SPI bus, and everything is still inside the enclosure.

When it comes down to it, ‘Arduino’ code is just C++ with an extra preprocessing phase and specific GCC flags (e.g. -fpermissive, which is a nightmare to deal with), so normal GCC inline assembly rules apply, but using AVR assembly.

Useful references:

@MLXXXp and/or @Mr.Blinky would probably be the best people to ask about that.

Personally speaking I’d be surprised if it all fits, but I’m not a hardware person.

I am not thinking all boards together, but one board of your choosing. If you want more flash, add the flash board, if crypto is needed add a crypto board, if wifi is needed add wifi. Perhaps some functions can be combined on the same tiny board, but already being able to add one function for a few bucks would be a nice way to extend the Arduboy.

If you have a specific chip for a specific project and want a little help shoehorning it into the thing just let me know.

In which case the 3D-printed back @Mr.Blinky made for the early Team Falcon members might be of use.

The ‘cartridge’ design might well work for having different cartridges with different board types.

(Unfortunately I don’t have a photo to hand. There should be one somewhere in the Project Falcon section.)

Thanks. For the time being it is something I will keep in mind when I run into hard limits. Let’s see how much assembler code can speed up my little 256b addition project.

1 Like

Ah yes, it was actually @n602 who made the original replacement backplate.

And it evolved over time

2 Likes