Sprites: speed of masked -vs- unmasked

Couldn’t find this discussed elsewhere…
I’ve assumed that rendering a sprite with no mask drawOverwrite(), is faster than using a mask drawExternalMask(). However, looking over the library code, it’s not clear to me…


Would love any feedback (@MLXXXp ?).

I didn’t write the sprites code (It was ported from Team A.R.G.'s library) and haven’t done any extensive testing myself. You would probably just have to experiment yourself.

In addition to comparing the different functions, note that there are both the Sprites and SpritesB classes. They both provide the same functions but one may offer better size and/or speed, depending on their use in a particular sketch. (The compiler output is always hard to predict.)

If you create an instance, it’s easy to switch between Sprites and SpritesB for testing by using the same instance name for one or the other.


@Pharap may suggest that using can be used for easy switching without needing to instantiate, if that’s your preference. (It won’t affect code size/speed one way or the other.)

1 Like

Thanks for the comments.
I’m not using SpritesB, as I understood Sprites was optimised for speed.

I was just surprised there still seems to be a masking step for case SPRITE_UNMASKED, albeit with 0xFF… despite having it’s own rendering loop.
However, it does skip pgm_read_byte for mask_data, and PROGMEM access is slower…

I can see a lot of effort went in to the final case SPRITE_PLUS_MASK, with hand crafted ASM … and then debugging of r28 and r29 issues. Can we assume this is the fastest way to render masked sprites? It would be great if all of the draw_mode types were reviewed for speed / efficiency - not necessary for converting to ASM, but purely to document any speed impact of the different choices.

Edit: CC in lords of the pixel: @Dreamer3 & @dxb .

Feel free to benchmark and/or submit PRs, changes or suggestions for optimising the sprites classes. :smiley:

1 Like

They do different jobs, so unless you have a way to simulate one with the other you’d be comparing apples to oranges.

Comparing drawExternalMask to drawSelfMasked or drawPlusMask might make a bit more sense since drawExternalMask and drawPlusMask are roughly equivalent and both can simulate drawSelfMasked.

But speed is not the only factor to consider, you must also consider memory usage.

The fastest way possible or the fastest way provided by the Arduboy2 library?

The thing you have to remember about Sprites is that it’s attempting to balance being fast with not using too much progmem, and with providing sufficient functionality.

It’s fast enough and small enough to suit the majority of users rather than trying to be fast enough for the few users who need really fast rendering or small enough for the people who are really pushed for space.

In particular, Sprites seems to be geared more towards people who are likely to need more than one of the rendering functions it provides.

Theoretically if you only needed one of the functions Sprites provides it would be possible to write a version that implements that one function in a cheaper and more efficient way than what Sprites does.

That all said, it’s entirely possible that there may be ways to make Sprites faster without making it larger, or make it smaller without making it slower, but any modifications are not to be undertaken lightly because there are literally hundreds of games depending on Sprites, and any changes made must avoid breaking (any of) those games.

1 Like

Thanks @Pharap; I appreciate your good summary of the issues.
Perhaps my questions seem a little odd, so I can give some specific examples:

  1. To get better FPS, is there an advantage to sacrificing the aesthetics of some of my sprites, by not masking and obscuring the background. (i.e. images appear more blocky but perhaps render quicker). Or if speed is equal, I should preserve the look of my game?
  2. For better speed, should I use DrawPlusMask() with interleaved image + mask bytes… Or if there’s no speed advantage, should I use drawExternalMask() as (perhaps) it’s easier to maintain the image and mask arrays separately during game development?

I understand the different sprite methods can render different visual results. However, if visuals and data size are not the priority, (rare, I know), which method in Sprites takes the least cycles to render a similar image?

I have no plans on touching that code! :smiley: …just requesting a bit more detail in the documentation. I’m sure great minds have poured over each line of the code and have best balanced it for speed, compiled size and features… I just glanced at the code and was surprised we are still masking an unmasked sprite.

When I was working on Arduroad (the first attempt at Road Trip) a few years ago, I bench-marked drawExternalMask and drawPlusMask and found the latter was about 8% faster (give or take). I did this by simply drawing a thousand images of each type and recording the milliseconds each took.

However, I have attempted to use it in other games and come unstuck by a lack of memory. For example, if you have multiple frames of a character which are essentially the same bar some detail it is often more memory efficient to have the frames separate and use a single mask.

Its a classic balancing act … but then programming is always that.


I couldn’t say for definite without testing, but I would presume drawing without a mask would be faster simply because it should require fewer steps, with the potential exception of drawPlusMask because the fact it’s written in assembly might be enough to give it an edge over the compiler-generated code.

Ultimately it depends on what optimisations the compiler chooses, so even if we ran some benchmarks those benchmarks would only be a vague guideline and would not be generalisable and could not be taken as gospel.

I would have expected drawPlusMask to be marginally faster, partly because it has fewer parameters to deal with and partly under the assumption that the operation implemented in hand-optimised assembly would be faster.

(Not because hand-written assembly is necessarily faster, but because one of the original authors decided that there was cause to use assembly in the first place, which implies they would have verified the speed.)

@filmote’s comment suggests that those expectations are probably correct.

(Note that if it weren’t for drawExternalMask, the actual draw function could do away with two of its parameters since they’re NULL and 0 in all other functions.)

That depends on the definition of ‘similar image’. If your definition is closer to ‘exactly the same’ you’ll probably get a different answer than if your definition is closer to ‘completely different’.

I’d be willing to bet either drawPlusMask or drawOverwrite would be fastest - the former because of its assembly implementation, the latter because it’s theoretically the simplest.

Ultimately that’s just a guess though, you’d be better off just running some benchmarks rather than listening to my armchair guesswork.

The masking going on here is not masking the sprite in the same sense as what drawPlusMask or drawExternalMask do. This is not a ‘sprite mask’.

The masking here is masking the data that’s already in the frame buffer so that only some of the bits in the frame buffer are overwritten.

As you may or may not know, the frame format used by the Arduboy’s screen consists of 8 rows of 128 8-bit columns.


(This is that same old image made years ago by @emutyworks that I dig up to show people every few months. :P)

In the case where a sprite isn’t vertically aligned to a whole column and it thus partially covers one or more columns, it is necessary to mask off the part of the column (more specifically, the 8-bit byte that represents that column) that will be overwritten by the drawing function. It is necessary because the rest of the data in that column must be retained, not overwritten, so the masking operation erases only the bits that are to be overwritten and retains the rest.

This is also why * mul_amt appears in the line that assigns bitmap_data. This multiplication in both cases is actually supposed to act like a left shift, presumably with the intent that the compiler will opt to use the AVR MUL instruction for the job.

Does that clarify the matter?


That’s super helpful- thanks @Pharap @filmote !
It’s enough info for what I was working on.
I’d like to revisit this in the future and do some benchmarking. (Secretly I’m hoping someone else will do this!). Cheers.

Oh come on, it wouldn’t take you long :slight_smile:

1 Like

I wrote the assembly in question. At the time it way out-performed the C solution. (2-3x IIRC if not an order of magnitude). You have to compare it to the original C solution for itself though, you can not compare it to the other drawing modes. Sprite plus mask requires 3 16-bit pointers:

  • Display buffer page x (8 pages/rows of VRAM)
  • Display buffer page x + 1
  • Offset into the sprite data

So to do this fast you need X, Y, and Z registers (the only 16 bit regs on our 8 bit chip)… which the compiler will NOT give you. (though perhaps things have changed) One is always reserved for the stack I think? I forget… so since you can’t use fast LOAD/STORE/INCREMENT loops you fall back to SUPER slower array access with tons of CPU time burnt just incrementing values and doing silly array access mechanics.

The original code can all be found in Git if someone wanted to go back and make an apples to apples before/after comparison (with today’s compiler toolchain) and actually benchmark the old C++ code against the assembly.

1 Like

Thanks @Dreamer3. The ASM code is really impressive- and a little daunting!
Your comments suggest a big difference in speed between an internal mask (DrawPlusMask) and an external mask (drawExternalMask); hopefully this will be easily measurable.

Was there a technical reason to only provide the ASM optimisation for one of the sprite methods? Or was it based on effort -vs- rewards?
Did you also look over the remaining C++ code for efficiency?
Thanks again for your original work and taking time to reply.

Speaking as someone who read it all and wrote an equivalent C++ implementation: it looks daunting at first, but it’s definitely possible to understand if you know enough about AVR assembly and how computers work at a low level (registers, RAM and addressing).

(By the way, if you’re trying to understand it rather than simply testing it, you might have a better time reading my C++ version. It’s a very literal translation from what I recall.)

1 Like

It should be. Sprite + external mask is the WORST case possible.

It’s really hard, for one - and often times it’s not slow enough to matter. So yes, effort vs rewards… plus there can be no equivalent “perfect” assembly solution for an external mask because that would require 4 16-bit pointers, and we only have three. That is the whole reason I went with the interleaved format in the first place… because we needed a single stream of data vs multiple streams in order to do everything as fast as possible.

1 Like