Fast 256 bit integer addition

I am not thinking all boards together, but one board of your choosing. If you want more flash, add the flash board, if crypto is needed add a crypto board, if wifi is needed add wifi. Perhaps some functions can be combined on the same tiny board, but already being able to add one function for a few bucks would be a nice way to extend the Arduboy.

If you have a specific chip for a specific project and want a little help shoehorning it into the thing just let me know.

In which case the 3D-printed back @Mr.Blinky made for the early Team Falcon members might be of use.

The ‘cartridge’ design might well work for having different cartridges with different board types.

(Unfortunately I don’t have a photo to hand. There should be one somewhere in the Project Falcon section.)

Thanks. For the time being it is something I will keep in mind when I run into hard limits. Let’s see how much assembler code can speed up my little 256b addition project.

1 Like

Ah yes, it was actually @n602 who made the original replacement backplate.

And it evolved over time

2 Likes

For interest, here are the cycle timings for this on ATMega32U4:

LD r0, X ; Read left = 1 cycle
LD r1, Z+ ; Read right = 2 cycles
ADC r0, r1 ; Add with carry = 1 cycle
ST X+, r0 ; Store to left = 2 cycles

total=6 cycles, or over 256 bits, 32x6=192 cycles.

Arduboy running at 16MHz (16 cycles per microsecond) would therefore run the code in 192/16=12 microseconds. You could do a million 256-bit additions in 12 seconds. HTH.

Thanks. I just timed my fastest non-asm routine and I get 36us, meaning 3x slower than your theoretical asm performance. The micros() function has a resolution of 4us so 36us may not be totally accurate. Maybe there is some other overhead not captured in the theoretical cycle count, but this should be a fairly accurate estimate. 3x faster is worth the trouble in my case.

(If you want a more accurate timing estimate, run your function many times, possibly in a loop, ideally back-to-back copies of the code to minimize the influence of the loop code (or a mix of both) and divide the result by the number of runs. )

Did a more careful averaged timing run and it comes out to 32.4us, or 2.7x slower.

Note when using r1. Make sure you clear r1 (clr r1) at the end of the assembly code as r1 is used as zero register and is expected to be 0

Just for fun, here’s a an assembly version that uses a loop to not waste too much PROGMEM. Each add part takes 6 cycles and 3 cycles for looping:

inline bool add256b(uint8_t * a, uint8_t* b)
{
  uint8_t result;
  uint8_t tmp;
  asm volatile
  ( 
    "   ldi     %[result], 256/8    \n" // use result as bitcounter
    "   clc                         \n" // clear initial carry
    "1:                             \n"
    "   ld      r0, %a[src]+        \n" 
    "   ld      %[tmp], %a[dst]     \n"
    "   adc     %[tmp], r0          \n" // a[] += b[]
    "   st      %a[dst]+, %[tmp]    \n"
    "   dec     %[result]           \n" // bitcounter-- (dec does not affect carry flag)
    "   brne    1b                  \n" // loop until bitcount == 0
    "   sbc     %[result], %[result]\n" // 0 if final carry is clear, 255 if final carry is set
    : [src]    "+e" (b),
      [dst]    "+e" (a),
      [tmp]    "=r" (tmp),
      [result] "=d" (result)
    :
  );
  return result;
}

If you want to unfold the code to save those loop cycles and a few setup cycles:

inline bool add256b(uint8_t * a, uint8_t* b)
{
  uint8_t result;
  asm volatile
  ( 
    "   ld      r0, %a[src]        \n" 
    "   ld      %[result], %a[dst] \n"
    "   add     %[result], r0      \n" // a[0] += b[0]
    "   st      %a[dst], %[result] \n"
    
    "   ldd     r0, %a[src]+1           \n" 
    "   ldd     %[result], %a[dst]+1    \n"
    "   adc     %[result], r0           \n" // a[1] += b[1]
    "   std     %a[dst]+1, %[result]    \n"

    //Repeat above block for itterations 2 to 30 yourself
    
    "   ldd     r0, %a[src]+31          \n" 
    "   ldd     %[result], %a[dst]+31   \n"
    "   adc     %[result], r0           \n" // a[1] += b[31]
    "   std     %a[dst]+31, %[result]   \n"
    
    "   sbc     %[result], %[result]\n" // 0 if final carry is clear, 255 if final carry is set
    : [result] "=r" (result)
    : [src]    "e" (b),
      [dst]    "e" (a)
  );
  return result;
}

in the unfolded version I used LDD and STD versions rather then using the auto increment versions LD/ST X/Y/Z+ it’s a bit more typing but this way the pointers are unchanged which may lead to further optimized code.

2 Likes

Just make a problem sufficiently interesting and someone else will solve it for you ;D

1 Like

This is awesome! One question. After the assembler code I would like to do something with the result of the addition, which should be in ‘a’. When I call the function the result of the addition is indeed in ‘a’. But when I check ‘a’ inside the add256b() function just after the assembler code, then ‘a’ does not hold the correct result. How can I access the result of the addition after the assembler code but still inside the function?

gasp Scoping! My worst nightmare!

Just a tiny correction to your code. The second parameter ‘b’ also needs to be a pointer ‘uint8_t * b’. You are missing the asterisk.

Thats because both pointers point after the the numbers due to the auto increment. To ‘restore’ them add these lines after the sbc line

"   subi     %A[dst], 256/8 \n" // restore a
"   sbci     %B[dst], 0     \n"

"   subi     %A[src], 256/8 \n" // restore b
"   sbci     %B[src], 0     \n"

You don’t need to add these when you use the unrolled example.

Oops typo.

I have the looped version working. But when I try to compile the unrolled version I get this error:

ccwiV4Fp.s: Assembler messages:
ccwiV4Fp.s:4417: Error: pointer register (Y or Z) required

I assume this has something to do with these lines

[src]    "e" (b),
[dst]    "e" (a)

Maybe changing “e” to “z” or “y”? Just guessing, but it doesn’t work with “z” or “y” either.

I only picked r0 and r1 as an example (I probably should have made that clearer).

I wanted to stay away from the higher registers because I know some of the higher registers form X and Z (and can never remember which).

It’s a fair point nonetheless.

I’d suggest being careful about returning bool though, because sometimes the conversion from uint8_t to bool can result in an added cost in terms of progmem.


@wasshuber

It would also be wise to provide wrapper functions for @Mr.Blinky’s functions to make sure that the arrays passed to the function are large enough, as well as marking b as const because it’s not modified at any point.

If you’re only ever dealing with 32-byte arrays then the easiest answer would be something like:

inline bool add256BitUnsafe(uint8_t * a, const uint8_t* b)
{
	// Assembly and stuff
}

inline bool add256Bit(uint8_t (&a)[32], const uint8_t (&a)[32])
{
	return add256BitUnsafe(&a[0], &b[0]);
}

If you’re dealing with arrays larger than that you’d need to use templates, so something like:

inline bool add256BitUnsafe(uint8_t * a, const uint8_t* b)
{
	// Assembly and stuff
}

template<size_t sizeA, size_t sizeB>
inline bool add256Bit(uint8_t (&a)[sizeA], const uint8_t (&a)[sizeB])
{
	static_assert(sizeA > (256 / 8), "Array a is not large enough");
	static_assert(sizeB > (256 / 8), "Array b is not large enough");

	return add256BitUnsafe(&a[0], &b[0]);
}

(There’s better ways to do that than using static_assert, but unfortunately Arduino doesn’t have a readily available <type_traits> implementation.)

Apparently I pasted an earlier version. In the unrolled version It should be:

[src]    "b" (b),
[dst]    "b" (a)

Thanks. Now the error is: can’t find a register in class ‘BASE_POINTER_REGS’ while reloading ‘asm’