Fast 256 bit integer addition

For interest, here are the cycle timings for this on ATMega32U4:

LD r0, X ; Read left = 1 cycle
LD r1, Z+ ; Read right = 2 cycles
ADC r0, r1 ; Add with carry = 1 cycle
ST X+, r0 ; Store to left = 2 cycles

total=6 cycles, or over 256 bits, 32x6=192 cycles.

Arduboy running at 16MHz (16 cycles per microsecond) would therefore run the code in 192/16=12 microseconds. You could do a million 256-bit additions in 12 seconds. HTH.

Thanks. I just timed my fastest non-asm routine and I get 36us, meaning 3x slower than your theoretical asm performance. The micros() function has a resolution of 4us so 36us may not be totally accurate. Maybe there is some other overhead not captured in the theoretical cycle count, but this should be a fairly accurate estimate. 3x faster is worth the trouble in my case.

(If you want a more accurate timing estimate, run your function many times, possibly in a loop, ideally back-to-back copies of the code to minimize the influence of the loop code (or a mix of both) and divide the result by the number of runs. )

Did a more careful averaged timing run and it comes out to 32.4us, or 2.7x slower.

Note when using r1. Make sure you clear r1 (clr r1) at the end of the assembly code as r1 is used as zero register and is expected to be 0

Just for fun, here’s a an assembly version that uses a loop to not waste too much PROGMEM. Each add part takes 6 cycles and 3 cycles for looping:

inline bool add256b(uint8_t * a, uint8_t* b)
{
  uint8_t result;
  uint8_t tmp;
  asm volatile
  ( 
    "   ldi     %[result], 256/8    \n" // use result as bitcounter
    "   clc                         \n" // clear initial carry
    "1:                             \n"
    "   ld      r0, %a[src]+        \n" 
    "   ld      %[tmp], %a[dst]     \n"
    "   adc     %[tmp], r0          \n" // a[] += b[]
    "   st      %a[dst]+, %[tmp]    \n"
    "   dec     %[result]           \n" // bitcounter-- (dec does not affect carry flag)
    "   brne    1b                  \n" // loop until bitcount == 0
    "   sbc     %[result], %[result]\n" // 0 if final carry is clear, 255 if final carry is set
    : [src]    "+e" (b),
      [dst]    "+e" (a),
      [tmp]    "=r" (tmp),
      [result] "=d" (result)
    :
  );
  return result;
}

If you want to unfold the code to save those loop cycles and a few setup cycles:

inline bool add256b(uint8_t * a, uint8_t* b)
{
  uint8_t result;
  asm volatile
  ( 
    "   ld      r0, %a[src]        \n" 
    "   ld      %[result], %a[dst] \n"
    "   add     %[result], r0      \n" // a[0] += b[0]
    "   st      %a[dst], %[result] \n"
    
    "   ldd     r0, %a[src]+1           \n" 
    "   ldd     %[result], %a[dst]+1    \n"
    "   adc     %[result], r0           \n" // a[1] += b[1]
    "   std     %a[dst]+1, %[result]    \n"

    //Repeat above block for itterations 2 to 30 yourself
    
    "   ldd     r0, %a[src]+31          \n" 
    "   ldd     %[result], %a[dst]+31   \n"
    "   adc     %[result], r0           \n" // a[1] += b[31]
    "   std     %a[dst]+31, %[result]   \n"
    
    "   sbc     %[result], %[result]\n" // 0 if final carry is clear, 255 if final carry is set
    : [result] "=r" (result)
    : [src]    "e" (b),
      [dst]    "e" (a)
  );
  return result;
}

in the unfolded version I used LDD and STD versions rather then using the auto increment versions LD/ST X/Y/Z+ it’s a bit more typing but this way the pointers are unchanged which may lead to further optimized code.

2 Likes

Just make a problem sufficiently interesting and someone else will solve it for you ;D

1 Like

This is awesome! One question. After the assembler code I would like to do something with the result of the addition, which should be in ‘a’. When I call the function the result of the addition is indeed in ‘a’. But when I check ‘a’ inside the add256b() function just after the assembler code, then ‘a’ does not hold the correct result. How can I access the result of the addition after the assembler code but still inside the function?

gasp Scoping! My worst nightmare!

Just a tiny correction to your code. The second parameter ‘b’ also needs to be a pointer ‘uint8_t * b’. You are missing the asterisk.

Thats because both pointers point after the the numbers due to the auto increment. To ‘restore’ them add these lines after the sbc line

"   subi     %A[dst], 256/8 \n" // restore a
"   sbci     %B[dst], 0     \n"

"   subi     %A[src], 256/8 \n" // restore b
"   sbci     %B[src], 0     \n"

You don’t need to add these when you use the unrolled example.

Oops typo.

I have the looped version working. But when I try to compile the unrolled version I get this error:

ccwiV4Fp.s: Assembler messages:
ccwiV4Fp.s:4417: Error: pointer register (Y or Z) required

I assume this has something to do with these lines

[src]    "e" (b),
[dst]    "e" (a)

Maybe changing “e” to “z” or “y”? Just guessing, but it doesn’t work with “z” or “y” either.

I only picked r0 and r1 as an example (I probably should have made that clearer).

I wanted to stay away from the higher registers because I know some of the higher registers form X and Z (and can never remember which).

It’s a fair point nonetheless.

I’d suggest being careful about returning bool though, because sometimes the conversion from uint8_t to bool can result in an added cost in terms of progmem.


@wasshuber

It would also be wise to provide wrapper functions for @Mr.Blinky’s functions to make sure that the arrays passed to the function are large enough, as well as marking b as const because it’s not modified at any point.

If you’re only ever dealing with 32-byte arrays then the easiest answer would be something like:

inline bool add256BitUnsafe(uint8_t * a, const uint8_t* b)
{
	// Assembly and stuff
}

inline bool add256Bit(uint8_t (&a)[32], const uint8_t (&a)[32])
{
	return add256BitUnsafe(&a[0], &b[0]);
}

If you’re dealing with arrays larger than that you’d need to use templates, so something like:

inline bool add256BitUnsafe(uint8_t * a, const uint8_t* b)
{
	// Assembly and stuff
}

template<size_t sizeA, size_t sizeB>
inline bool add256Bit(uint8_t (&a)[sizeA], const uint8_t (&a)[sizeB])
{
	static_assert(sizeA > (256 / 8), "Array a is not large enough");
	static_assert(sizeB > (256 / 8), "Array b is not large enough");

	return add256BitUnsafe(&a[0], &b[0]);
}

(There’s better ways to do that than using static_assert, but unfortunately Arduino doesn’t have a readily available <type_traits> implementation.)

Apparently I pasted an earlier version. In the unrolled version It should be:

[src]    "b" (b),
[dst]    "b" (a)

Thanks. Now the error is: can’t find a register in class ‘BASE_POINTER_REGS’ while reloading ‘asm’

compiles fine here (Arduino IDE 1.8.13). The issue is probably triggered by your other code. Could you post the complete function?

You can put your code between two sets of ``` to format a code block.

Why are all the code snippets making the function inline? That is not necessarily a good idea and it’s likely to increase the register pressure hugely.

I am using the same 1.8.13 Arduino IDE. I have now removed from the sketch pretty much everything else except the following function. Still getting the same error:

inline bool add256bAsm2(uint8_t * a, uint8_t * b) {
  uint8_t result;
  asm volatile
  ( 
    "   ld      r0, %a[src]        \n" 
    "   ld      %[result], %a[dst] \n"
    "   add     %[result], r0      \n" // a[0] += b[0]
    "   st      %a[dst], %[result] \n"
    
    "   ldd     r0, %a[src]+1           \n" 
    "   ldd     %[result], %a[dst]+1    \n"
    "   adc     %[result], r0           \n" // a[1] += b[1]
    "   std     %a[dst]+1, %[result]    \n"
    
    "   ldd     r0, %a[src]+2           \n" 
    "   ldd     %[result], %a[dst]+2    \n"
    "   adc     %[result], r0           \n" // a[2] += b[2]
    "   std     %a[dst]+2, %[result]    \n"
    
    "   ldd     r0, %a[src]+3           \n" 
    "   ldd     %[result], %a[dst]+3    \n"
    "   adc     %[result], r0           \n" // a[3] += b[3]
    "   std     %a[dst]+3, %[result]    \n"
    
    "   ldd     r0, %a[src]+4           \n" 
    "   ldd     %[result], %a[dst]+4    \n"
    "   adc     %[result], r0           \n" // a[4] += b[4]
    "   std     %a[dst]+4, %[result]    \n"
    
    "   ldd     r0, %a[src]+5           \n" 
    "   ldd     %[result], %a[dst]+5    \n"
    "   adc     %[result], r0           \n" // a[5] += b[5]
    "   std     %a[dst]+5, %[result]    \n"
    
    "   ldd     r0, %a[src]+6           \n" 
    "   ldd     %[result], %a[dst]+6    \n"
    "   adc     %[result], r0           \n" // a[6] += b[6]
    "   std     %a[dst]+6, %[result]    \n"
    
    "   ldd     r0, %a[src]+7           \n" 
    "   ldd     %[result], %a[dst]+7    \n"
    "   adc     %[result], r0           \n" // a[7] += b[7]
    "   std     %a[dst]+7, %[result]    \n"
    
    "   ldd     r0, %a[src]+8           \n" 
    "   ldd     %[result], %a[dst]+8    \n"
    "   adc     %[result], r0           \n" // a[8] += b[8]
    "   std     %a[dst]+8, %[result]    \n"
    
    "   ldd     r0, %a[src]+9           \n" 
    "   ldd     %[result], %a[dst]+9    \n"
    "   adc     %[result], r0           \n" // a[9] += b[9]
    "   std     %a[dst]+9, %[result]    \n"
    
    "   ldd     r0, %a[src]+10          \n" 
    "   ldd     %[result], %a[dst]+10   \n"
    "   adc     %[result], r0           \n" // a[10] += b[10]
    "   std     %a[dst]+10, %[result]   \n"
    
    "   ldd     r0, %a[src]+11          \n" 
    "   ldd     %[result], %a[dst]+11   \n"
    "   adc     %[result], r0           \n" // a[11] += b[11]
    "   std     %a[dst]+11, %[result]   \n"
    
    "   ldd     r0, %a[src]+12          \n" 
    "   ldd     %[result], %a[dst]+12   \n"
    "   adc     %[result], r0           \n" // a[12] += b[12]
    "   std     %a[dst]+12, %[result]   \n"
    
    "   ldd     r0, %a[src]+13          \n" 
    "   ldd     %[result], %a[dst]+13   \n"
    "   adc     %[result], r0           \n" // a[13] += b[13]
    "   std     %a[dst]+13, %[result]   \n"
    
    "   ldd     r0, %a[src]+14          \n" 
    "   ldd     %[result], %a[dst]+14   \n"
    "   adc     %[result], r0           \n" // a[14] += b[14]
    "   std     %a[dst]+14, %[result]   \n"
    
    "   ldd     r0, %a[src]+15          \n" 
    "   ldd     %[result], %a[dst]+15   \n"
    "   adc     %[result], r0           \n" // a[15] += b[15]
    "   std     %a[dst]+15, %[result]   \n"
    
    "   ldd     r0, %a[src]+16          \n" 
    "   ldd     %[result], %a[dst]+16   \n"
    "   adc     %[result], r0           \n" // a[16] += b[16]
    "   std     %a[dst]+16, %[result]   \n"
    
    "   ldd     r0, %a[src]+17          \n" 
    "   ldd     %[result], %a[dst]+17   \n"
    "   adc     %[result], r0           \n" // a[17] += b[17]
    "   std     %a[dst]+17, %[result]   \n"
    
    "   ldd     r0, %a[src]+18          \n" 
    "   ldd     %[result], %a[dst]+18   \n"
    "   adc     %[result], r0           \n" // a[18] += b[18]
    "   std     %a[dst]+18, %[result]   \n"
    
    "   ldd     r0, %a[src]+19          \n" 
    "   ldd     %[result], %a[dst]+19   \n"
    "   adc     %[result], r0           \n" // a[19] += b[19]
    "   std     %a[dst]+19, %[result]   \n"
    
    "   ldd     r0, %a[src]+20          \n" 
    "   ldd     %[result], %a[dst]+20   \n"
    "   adc     %[result], r0           \n" // a[20] += b[20]
    "   std     %a[dst]+20, %[result]   \n"
    
    "   ldd     r0, %a[src]+21          \n" 
    "   ldd     %[result], %a[dst]+21   \n"
    "   adc     %[result], r0           \n" // a[21] += b[21]
    "   std     %a[dst]+21, %[result]   \n"
    
    "   ldd     r0, %a[src]+22          \n" 
    "   ldd     %[result], %a[dst]+22   \n"
    "   adc     %[result], r0           \n" // a[22] += b[22]
    "   std     %a[dst]+22, %[result]   \n"
    
    "   ldd     r0, %a[src]+23          \n" 
    "   ldd     %[result], %a[dst]+23   \n"
    "   adc     %[result], r0           \n" // a[23] += b[23]
    "   std     %a[dst]+23, %[result]   \n"
    
    "   ldd     r0, %a[src]+24          \n" 
    "   ldd     %[result], %a[dst]+24   \n"
    "   adc     %[result], r0           \n" // a[24] += b[24]
    "   std     %a[dst]+24, %[result]   \n"
    
    "   ldd     r0, %a[src]+25          \n" 
    "   ldd     %[result], %a[dst]+25   \n"
    "   adc     %[result], r0           \n" // a[25] += b[25]
    "   std     %a[dst]+25, %[result]   \n"
    
    "   ldd     r0, %a[src]+26          \n" 
    "   ldd     %[result], %a[dst]+26   \n"
    "   adc     %[result], r0           \n" // a[26] += b[26]
    "   std     %a[dst]+26, %[result]   \n"
    
    "   ldd     r0, %a[src]+27          \n" 
    "   ldd     %[result], %a[dst]+27   \n"
    "   adc     %[result], r0           \n" // a[27] += b[27]
    "   std     %a[dst]+27, %[result]   \n"
    
    "   ldd     r0, %a[src]+28          \n" 
    "   ldd     %[result], %a[dst]+28   \n"
    "   adc     %[result], r0           \n" // a[28] += b[28]
    "   std     %a[dst]+28, %[result]   \n"
    
    "   ldd     r0, %a[src]+29          \n" 
    "   ldd     %[result], %a[dst]+29   \n"
    "   adc     %[result], r0           \n" // a[29] += b[29]
    "   std     %a[dst]+29, %[result]   \n"
    
    "   ldd     r0, %a[src]+30          \n" 
    "   ldd     %[result], %a[dst]+30   \n"
    "   adc     %[result], r0           \n" // a[30] += b[30]
    "   std     %a[dst]+30, %[result]   \n"
    
    "   ldd     r0, %a[src]+31          \n" 
    "   ldd     %[result], %a[dst]+31   \n"
    "   adc     %[result], r0           \n" // a[31] += b[31]
    "   std     %a[dst]+31, %[result]   \n"
    
    "   sbc     %[result], %[result]\n" // 0 if final carry is clear, 255 if final carry is set
    : [result] "=r" (result)
    : [src]    "b" (b),
      [dst]    "b" (a)
  );
  return result;
}

Are you using the board package, I bet @mr.blinky is not sure how much that changes to effect something like that but it is a compile related thing?

Because if you want to put a function body in a header file (such that it can be used in more than one translation unit) it has to be marked inline.
If you don’t use an inline specifier you’ll get a linker error.

As stated by cppreference:

An inline function […] has the following properties:

  1. The definition of an inline function […] must be reachable in the translation unit where it is accessed (not necessarily before the point of access).
  2. An inline function […] with external linkage (e.g. not declared static) has the following additional properties:
    1. There may be more than one definition of an inline function […] in the program as long as each definition appears in a different translation unit and (for non-static inline functions […]) all definitions are identical. For example, an inline function […] may be defined in a header file that is #include'd in multiple source files.
    2. It must be declared inline in every translation unit.
    3. It has the same address in every translation unit.

(I have redacted parts that are specific to C++17 because they are not relevant for Arduino code.)

The other alternative is to declare it in a header and define it in a .cpp file, but that’s typically more hassle than necessary for a small code base.

I think I’m getting inline confused with alwaysinline. It seems very obtuse to me to use the word inline to mean such very different things.

To be clear what I was originally speaking about is inlining where the compiler will insert the code “in line” rather than dispatching it as a function/subroutine with a CALL.

But I’m no longer sure that’s what’s going on here.