AVR optimization (?)

I’m using Ardens to inspect some code and profile and etc. I wanted to talk about some things that are probably known but which tripped me up.

Starting with… why the heck this:

uint8_t b = 1 << a;

Is 10 times slower AND maybe requires a branch or two compared to:

uint8_t b = LUT[a];

where LUT is what you’d expect. Let’s look at the assembly. Please excuse the variable names:




Perhaps my years working on x86 are showing through, but I was extremely surprised to learn there’s no arbitrary left shift, only 1 at a time (hence the loop). I was ALSO surprised to learn the compiler didn’t catch this and generate a lookup table for me, the speed difference is absolutely absurd. I know the -Os flag is used for code size, but still… it’s 8 bytes!! 8 bytes and like 100,000 cycles (not joking).

I’m going to have to go through and replace all my left shifts with something better now. Even multiplication is faster, I just can’t even wrap my head around that…

The other one that I knew off the bat is ofc no hardware division. The generated division assembly takes up nearly 50% of all cycles for my original raycaster, and there’s only 3 in the main loop, amounting to about 300 divisions per frame. 50% of ALL cycles are held up in 13,500 divisions per second, which is… honestly understandable, given there’s no hardware for it. I wish I could remove these divisions, but I don’t think any of them can be precomputed or inverted to multiplication.

Here’s the weird part. The AVR docs say that their recommended implementation should take about 56 cycles on average, and something like 62 max. All my divisions are 2 bytes though, so that does change it. With 1 byte division, the 13,500 divisions should eat up 783000 cycles per second, which is nowhere near the 16 million cycles per second and not even close to 10% usage. So I’m surprised to see that double width is, seemingly, 10 times more complicated. I’m wondering if maybe I should add a custom division routine that copies from the AVR recommended docs… or if maybe my understanding of the division being performed is off.

@Mr.Blinky has a nice version of 1 << (shift & 7) in ArduboyFX that I’ve stolen several times: here.

Sprites drawing routines use multiplication for shifting, as it’s the same shift applied a bunch of times and is much faster as you’ve noted.

Division is tough too. I used approximate methods in ArduGolf.


Oo nice, thank you!

Edit: hang on, I just profiled it. The lookup table is the same speed lol. Ah well, I’m already using the table.

Gosh how have I not seen that one, that’s amazing! I feel so lowly

There’s always a bigger fish… John Carmack is out there somewhere…


AVR processors don’t have a barrel shifter.

I don’t know if it was a cost-cutting measure or just to make the processor simpler to design, but that’s the way it is.

I’m presuming the GCC team did the typical “well of course shifting is always faster than multiplying or look-up tables, no modern CPUs could possibly lack a barrel shifter”.

I’d file this under “incorrect assumptions developers frequently make”, alongside things like “char is always signed”, “CHAR_BIT will never be anything other than 8”, and sizeof(char) will never be the same as sizeof(int)".

Joking aside, I suspect the root of the problem is simply that nobody has actually gone through and written AVR-specific optimisations and that the generic optimisations make those aforementioned assumptions that shifting is always faster than multiplication or division or look-up tables.

The thing that really annoys me is that the compiler won’t replace uint8_t x = y >> 8 (where y is 16 bits or larger) with the equivalent of uint8_t x = reinterpret_cast<unsigned char *>(y)[1] (assuming little endian byte order).

If it did, the multiplication and division in my fixed point library would be a lot faster. Unfortunately the only thing I can do to fix it (other than providing a specialised named function) would sacrifice the constexpr modifier.

(Come to think of it, I probably could try making a special case for AVR that uses multiplication instead of shifting, though I’m reluctant because would feel like I’m having to make up for the inadequacies of the compiler.)

That’s becuase AVR does have a mul (and muls) instruction, though it takes two 8-bit operands and produces a 16-bit output, so your mileage may vary.

(I can’t provide a direct link to the instructions but check the documentation.)

Yep. Welcome to the worlds of embedded systems and CPU design.

(And to think, the Arduboy’s processor is still more powerful than an M6502.)

If you’ve got some simple examples that you could post then I’ll have a look and see if I can think of something, but I’m not much of a mathematician so don’t get your hopes up too much.

1 Like

Is this with an AVR specific toolchain?


I don’t know where one would get GCC, are you using what comes with a linux distro? I would only know to use what comes with the arduino package as I’m on PC. And I don’t know if there is anything specialized there?

Do you mean what I’m using? I’m just using the arduino cli, which is indeed an avr-specific gcc

Yes, I found that out after seeing that it can only shift 1 bit lol but thank you

Considering it’s an AVR-specific gcc I’m not sure if that’s the case, plus how would it know to generate a shift loop if gcc thought “they’d always have a barrel shifter”. It is 100% aware that there’s no barrel shifter; I think it chose no lookup simply because of optimal code size, or perhaps the number of situations where a lookup table is known to work is too low to bother with it.

But, I do know what compiler writers aren’t perfect, especially considering this probably isn’t critical stuff. Using an arduino to move motors and control lights and signals generally doesn’t require anything on the upper edge of the processing power.

So? Do it anyway! Compilers are always inadequate if you’re looking for best performance, you know that.

It’s all in the raycast() function of https://github.com/randomouscrap98/arduboy_collection/blob/master/arduboy_raycast/arduboy_raycast.ino but be warned: I’m a C programmer at heart and I’m not going to jump to change all my macros lol. I’m aware c++ has replacements, just deal with it, it’s a small program :slight_smile:. Also, I’m working on a more complicated version at https://github.com/randomouscrap98/arduboy_collection/blob/master/arduboy_raycast2/arduboy_raycast2.ino but it’s probably going to change frequently (maybe)

More corner case optimizations, I’ll have to study the assembly output but I have a hunch I know what it is.

Using a do-while loop is faster than both a while loop and a for loop. I believe this is because the compiler is generating a “naive” loop for the while/for, where it branches to enter the loop. But because this hardware always assumes the branch is not taken, repetitive loops incur the 1 cycle miss penalty each loop. This does actually add up, but a do-while loop naively branches to exit a loop, so you lose that penalty for most iterations.

But I need to look at the assembly for the real answer and I just haven’t felt like it lol.

That’s ok. I started programming with ANSI C which required all variables to be declared at the top of a code block. I’m still vary much in the habbit of doing this rather than the recommended method of declaring them closer to where they’re actually used.

@Pharap likes to remind of this periodically :P

GCC is the compiler that Arduino IDE (and Arduino CLI) uses as a backend.

Arduino’s basically just some libraries and a bit of extra preprocessing and some ‘glue’ programs to coordinate the underlying dependencies.

(Technically GCC is actually a collection of compilers and the specific compiler used for compiling C++ is called something like g++, but usually everyone just says ‘GCC’ rather than worry about what the specific compilers are actually called.)

The back-end would have to know what to generate for a shift, but what I’m saying is that it may just be the case that they implemented the bare minimum that could possibly work and didn’t spend time trying to create additional optimisations.

It may even be using some kind of fall-back that was put in place for a worst-case scenario.

Have you tried using a different set of optimisation flags to see if it makes a difference?

If that doesn’t work, each of the -O flags is basically just shorthand for various other optimisation flags, so you could try looking to see if any of the individual optimisation flags makes a difference. (Though there are quite a lot of them.)

Though on a brief look, it looks like most of the optimisations that -Os ignores wouldn’t actually make a difference on AVR anyway since they’re to do with alignment and alignment isn’t really an issue on an 8-bit processor.

It also means yet more conditional compilation, and yet another thing on my todo list.

Actually most modern compilers are generally very good. Compiler technology has come on leaps and bounds in the last two decades. It’s not like the days where some compilers would actually produce less code for for (;;) than for while (1).

AVR’s just in an unfortunate position because it behaves differently to a lot of more popular processor families and breaks a lot of typical assumptions. The lack of a barrel shifter is small fry compared to the fact it’s a modified harvard architecture in a sea of von neuman processors.

On first glance, it seems your only division is a reciprocal so there might be the potential to replace it with something like an approximation using Newton’s method, which should use only multiplication and subtraction.

Whether that’s faster or not depends on what the compiler’s spitting out at the moment though, and whether the approximation is good enough is a separate matter that would need to be weighed against the speed.

Just to mention it, absFixed might be better than Arduino’s abs.
absFixed does the equivalent of ((value.getInternal() < 0) ? -value : value) whereas Arduino’s abs does the equivalent of (value > 0 ? value : -value) which may not be as efficient. (< 0 will test the negative flag, whereas > 0 has to look at both the negative and zero flags, so it may incur an extra jump. It might not though, I haven’t actually checked.)

Also, out of interest, have you determined whether getProgmemStruct is more efficient than memcpy_P? I would have expected memcpy_P might be able to make better use of the Z register, but it probably also has to do additional null checks, so I’m not sure how the two would compare.

Just be careful not to do something like abs(++x), that’s all I’m saying.

(I can practically hear the functional programmers sniggering and whispering about referential transparency and mutable state.)

This is mostly a readability/best practices thing, but to be fair I have actually seen a few cases where reducing the scope of the variables has caused an Arduboy game to use less progmem. (Don’t ask me how though, it’s been too long and I never bothered to check the disassembly.)


I don’t remember if it was Shattered Land or someone else’s game, but I remember the reason. It had to do with declaring two values of the same type with an initial value. The first was only used in the top statements while the second was only used in the latter statements (only one was in use at a time). By moving the declarations the compiler was able to determine that the first variable was no longer needed and so it could just reuse that space. This both reduced stack size and code space because having both used one too many registers so they had to be shifted in and out of ram more.

1 Like

Oh true

I was going to, but getting at those flags turned out to be quite a nightmare in vscode. I thought you were supposed to be able to change flags in the arduino.json file but it didn’t work, and people online suggested changing arduino defaults. I figured I might as well leave it, for the reasons you gave (probably wouldn’t make a difference, I’m happy with -Os anyway)

Yes yes, I’ve worked on LLVM and I did my thesis on compiler and hardware optimizations lol I know. And when I write for modern systems, I trust the compiler and don’t try to get in its way. I guess I should’ve clarified: for particularly restricted RISC architectures, you can often write better assembly than compilers, and that becomes especially important at the (often) lower clock speeds these run at.

Something like x86 or even ARM has tons of instructions to utilize, so you often aren’t left going “how optimal is divide”. They’re also usually superscalar, out of order, and massively pipelined, so the hardware usually has a better idea of optimizations it can make for itself even compared to the compiler (in fact, sometimes compilers that output dumber code can run faster than “clever” ones). And regardless, all the optimizations in the world mean nothing when you have a cache miss, and that’s a lot harder to optimize for regardless of hardware or compilers (you just have to write better code), so yes, a lot of factors contribute to “compilers are good” (on modern hardware). That’s not even getting into SIMD or the compiler knowing about the various kinds of branch prediction and stalls on modern hardware and optimizing for specific processors, something a human generally can’t do (unless they’re building against a super specific processor and even then you basically match the compiler… maybe).

Yes I’m aware of how macros work lol don’t worry. Generally if I have to use an argument more than once I avoid the macro if possible, I use the macros for easier inlining, not for ease of use (generally).

I just tried that and it seems to be roughly the same. According to the datasheet, all branch comparisons are 1 cycle for not taken and 2 cycles for taken, regardless of direction of comparison or inclusion of 0 etc.

Is that memcpy for program memory? If so, I just didn’t know it existed lol

Man thank you so much for this, I kept thinking “do I really want to go find this table online or generate it myself and just hope it’s the same”. The answer was no, so I really appreciate it.

1 Like

Strange, I was under the impression that it did.

Though perhaps the cases I’ve heard of where people managed to do it were with Arduino CLI or PlatformIO or something.

If I remember rightly I think there’s a file buried somewhere in %APPDATA% that stores those. (An .ini file, I think.)

I’ve never even so much as taken a computer science course, let alone stepped foot inside a university. My highest qualification is only college level.

I’m just well read. (Enough to know about SIMD, pipelining, branch prediction, caches, parse trees, parser combinators, static single-assignment form, continuation passing style, et cetera.)

Ironically 32-bit x86 and ARM both date back to the 80s, whereas 8-bit AVR debued in '96.
It’s technically modern, it’s just very… trimmed down, I shall say.

Which usually means consume fewer resources, waste less memory, and perhaps resort to manual memory management rather than relying on a GC.

(Or do sneaky things like the small string optimisation.)

Eh, I just trust the compiler to inline functions.

Functions defined in headers usually have to be marked inline anyway, and template functions are more likely to be inlined by nature of how they’re implemented.

Yeah, but the issue is that branch instructions can only branch on one flag and that > 0 should theoretically require both N and Z to be tested, whereas < 0 should only need to test N. I.e. > 0 should produce an extra branch instruction because of the second flag test.

Perhaps I’m missing something though.


Recommended reading:


(Come to think of it, I probably should have mentioned it in this old thread. Maybe I’ll go back and add it later. I was more concerned about the number of people using pgm_read_word instead of pgm_read_ptr.)

1 Like