Using SSD1327 128x128 4-bit display

I might be able to think of something if I knew what all these magic numbers meant.
(And the variables for that matter…)

So far all I’ve been able to work out is that you’re double buffering scanlines, and drawing tiles and sprites.

1 Like

This is the best I could do without knowing more about the rest of the system:

void drawFullMap2()
{
	uint8_t tileX = 0;
		
	// Number of horizontal lines
	for(uint_fast8_t lineY = 0; lineY < 128; ++lineY)
	{ 
		const uint8_t y1 = (lineY % 2);

		for(uint8_t spriteIndex = 0; spriteIndex < 8; ++spriteIndex)
		{
			// Would have just called this 'sprite' if the sprite array were 'sprites'
			auto & spriteReference = sprite[spriteIndex];
		
			if((lineY >= spriteReference.y) && (lineY <= spriteReference.y + 7))
			{
				spriteReference.offset = (((lineY - spriteReference.y) % 8) * 4);
				
				for(uint8_t offset = 0; offset < 4; ++offset)
					scanLine[y1][spriteReference.x + offset] = (spriteIndex + 1);
			}
		}
		
		// Horizontal line number modulo 8
		const uint16_t jStart = ((lineY % 8) * 4);
		
		// If y1 is 0, y2 is 1
		// If y1 is 1, y2 is 0
		const uint8_t y2 = (1 - y1);
				
		const uint8_t tileY = ((lineY + 1) / 8);
		uint16_t tileIndex = (tileX + (tileY * 16));
		uint8_t scanlineX = 0;
		
		for(uint8_t x = 0; x < 16; ++x, ++tileIndex)
		{
			const uint8_t tile = myMap[tileIndex];
			
			for (uint8_t i = 0, j = jStart; i < 4; ++i, ++j, ++scanlineX)
			{
				scanLine[y2][scanlineX] = 0;
				const auto value = scanLine[y1][scanlineX];
				
				// Default to reading background tile,
				// assuming it's the more frequent occurance.
				uint8_t lineTile = tile;
				uint8_t lineOffset = j;
				
				// Having only one conditional section should improve performance
				// by virtue of having less code to traverse.
				if(value > 0)
				{
					// no longer needed
					// asm volatile( "nop\n" ); 
					const uint8_t spriteIndex = (value - 1);
					
					// Would have just called this 'sprite' if the sprite array were 'sprites'
					auto & spriteReference = sprite[spriteIndex];
					
					lineTile = spriteReference.tile;
					lineOffset = spriteReference.offset;
					++spriteReference.offset;
				}
				
				SPDR = pgm_read_byte(&alltiles[lineTile][lineOffset]);
			}
			
			// This might be able to be removed?
			++tileX;
			if(tileX == 16)
				tileX = 0;
		}
	}
}

In some places I’ve had to resort to auto because I don’t know what the type should be.

I have no way of testing this, and without the accompanying code and environment (or even the definitions of the global variables used) I don’t have a way of verifying that what I’ve written is even correct.

Also, though I’m not completely sure,
I think refering to the scanlines with pointers like in the following code might be more optimal:

void drawFullMap2()
{
	uint8_t tileX = 0;
	
	auto * activeScanline = &scanLine[0];
	auto * otherScanline = &scanLine[1];
		
	// Number of horizontal lines
	for(uint_fast8_t lineY = 0; lineY < 128; ++lineY)
	{ 
		for(uint8_t spriteIndex = 0; spriteIndex < 8; ++spriteIndex)
		{
			// Would have just called this 'sprite' if the sprite array were 'sprites'
			auto & spriteReference = sprite[spriteIndex];
		
			if((lineY >= spriteReference.y) && (lineY <= spriteReference.y + 7))
			{
				spriteReference.offset = (((lineY - spriteReference.y) % 8) * 4);
				
				for(uint8_t offset = 0; offset < 4; ++offset)
					activeScanline[spriteReference.x + offset] = (spriteIndex + 1);
			}
		}
		
		// Horizontal line number modulo 8
		const uint16_t jStart = ((lineY % 8) * 4);
		
		uint8_t scanlineX = 0;
		
		const uint8_t tileY = ((lineY + 1) / 8);
		uint16_t tileIndex = (tileX + (tileY * 16));
		
		for(uint8_t x = 0; x < 16; ++x, ++tileIndex)
		{
			const uint8_t tile = myMap[tileIndex];
			
			for (uint8_t i = 0, j = jStart; i < 4; ++i, ++j, ++scanlineX)
			{
				otherScanline[scanlineX] = 0;
				const auto value = activeScanline[scanlineX];
				
				// Default to reading background tile,
				// assuming it's the more frequent occurance.
				uint8_t lineTile = tile;
				uint8_t lineOffset = j;
				
				// Having only one conditional section should improve performance
				// by virtue of having less code to traverse.
				if(value > 0)
				{
					// no longer needed
					// asm volatile( "nop\n" ); 
					const uint8_t spriteIndex = (value - 1);
					
					// Would have just called this 'sprite' if the sprite array were 'sprites'
					auto & spriteReference = sprite[spriteIndex];
					
					lineTile = spriteReference.tile;
					lineOffset = spriteReference.offset;
					++spriteReference.offset;
				}
				
				SPDR = pgm_read_byte(&alltiles[lineTile][lineOffset]);
			}
			
			// This might be able to be removed?
			++tileX;
			if(tileX == 16)
				tileX = 0;
		}
		
		// If std::swap were available, I'd use that instead
		const auto temporary = activeScanline;
		activeScanline = otherScanline;
		otherScanline = temporary;
	}
}

Theoretically it should save some calculations,
but it depends if the CPU can cope with the added demand for registers.

Also, my instinct is that there might be a more optimal way than using two scanline buffers,
but I don’t know enough about what’s going on to suggest an alternative.

I’m using the scan lines to add the sprite data only.Each scanline will be zeros unless any sprite is detected in that position, I did this to cycle through the 8 sprites only once per scanline, rather than once per pixel. It’s 2 scanlines so that I only need to re-zero places where a sprite is/was and not while it’s currently being used, removing the need to memset the whole scanline.

[edit] your function gets 42fps, the unrolled version of mine is getting 53.

I managed to guess that much,
but it doesn’t tell me how the sprites are actually defined or what the magic numbers mean.
(E.g. what sprite[index].offset means.)

Aha!
While reading this I had a sudden flash of inspiration and reaslised why the use of two scanline buffers was niggling at me.

You don’t need the second scanline buffer,
because you can zero the current scanline buffer entry after discovering that it has a sprite:

if(value > 0)
{
	// no longer needed
	// asm volatile( "nop\n" ); 
	const uint8_t spriteIndex = (value - 1);
	
	// Would have just called this 'sprite' if the sprite array were 'sprites'
	auto & spriteReference = sprite[spriteIndex];
	
	lineTile = spriteReference.tile;
	lineOffset = spriteReference.offset;
	++spriteReference.offset;
	
	// The sprite's existance has been noted,
	// now clear the scanline entry
	activeScanline[scanlineX] = 0;
}

Thus you only need one scanline buffer.

Thus:

void drawFullMap2()
{
	uint8_t tileX = 0;
		
	// Number of horizontal lines
	for(uint_fast8_t lineY = 0; lineY < 128; ++lineY)
	{ 
		const uint8_t y1 = (lineY % 2);

		for(uint8_t spriteIndex = 0; spriteIndex < 8; ++spriteIndex)
		{
			// Would have just called this 'sprite' if the sprite array were 'sprites'
			auto & spriteReference = sprite[spriteIndex];
		
			if((lineY >= spriteReference.y) && (lineY <= spriteReference.y + 7))
			{
				spriteReference.offset = (((lineY - spriteReference.y) % 8) * 4);
				
				for(uint8_t offset = 0; offset < 4; ++offset)
					scanLine[spriteReference.x + offset] = (spriteIndex + 1);
			}
		}
		
		// Horizontal line number modulo 8
		const uint16_t jStart = ((lineY % 8) * 4);
				
		const uint8_t tileY = ((lineY + 1) / 8);
		uint16_t tileIndex = (tileX + (tileY * 16));
		uint8_t scanlineX = 0;
		
		for(uint8_t x = 0; x < 16; ++x, ++tileIndex)
		{
			const uint8_t tile = myMap[tileIndex];
			
			for (uint8_t i = 0, j = jStart; i < 4; ++i, ++j, ++scanlineX)
			{
				const auto value = scanLine[scanlineX];
				
				// Default to reading background tile,
				// assuming it's the more frequent occurance.
				uint8_t lineTile = tile;
				uint8_t lineOffset = j;
				
				// Having only one conditional section should improve performance
				// by virtue of having less code to traverse.
				if(value > 0)
				{
					// no longer needed
					// asm volatile( "nop\n" ); 
					const uint8_t spriteIndex = (value - 1);
					
					// Would have just called this 'sprite' if the sprite array were 'sprites'
					auto & spriteReference = sprite[spriteIndex];
					
					lineTile = spriteReference.tile;
					lineOffset = spriteReference.offset;
					++spriteReference.offset;
					
					scanLine[scanlineX] = 0;
				}
				
				SPDR = pgm_read_byte(&alltiles[lineTile][lineOffset]);
			}
			
			// This might be able to be removed?
			++tileX;
			if(tileX == 16)
				tileX = 0;
		}
	}
}

Did you unroll mine? (If not, you’re not comparing like for like.)
How does it compare to the non-unrolled version of yours?
Which of the two versions did you test?

Also, does it actually work?

This keeps track of which sprite pixels are to be plotted as each sprite is only drawn one line at a time.

A minor tweak to the end (++tileX&=15;) get your function to 51/52fps without unrolling. Very nice :slight_smile:


void drawFullMap2()
{
  uint8_t tileX = 0;
    
  // Number of horizontal lines
  for(uint_fast8_t lineY = 0; lineY < 128; ++lineY)
  { 
    const uint8_t y1 = (lineY % 2);

    for(uint8_t spriteIndex = 0; spriteIndex < 8; ++spriteIndex)
    {
      // Would have just called this 'sprite' if the sprite array were 'sprites'
      auto & spriteReference = sprite[spriteIndex];
    
      if((lineY >= spriteReference.y) && (lineY <= spriteReference.y + 7))
      {
        spriteReference.offset = (((lineY - spriteReference.y) % 8) * 4);
        
        for(uint8_t offset = 0; offset < 4; ++offset)
          scanLine[spriteReference.x + offset] = (spriteIndex + 1);
      }
    }
    
    // Horizontal line number modulo 8
    const uint16_t jStart = ((lineY &7) * 4);
        
    const uint8_t tileY = ((lineY + 1) / 8);
    uint16_t tileIndex = (tileX + (tileY * 16));
    uint8_t scanlineX = 0;
    
    for(uint8_t x = 0; x < 16; ++x, ++tileIndex)
    {
      const uint8_t tile = myMap[tileIndex];
      
      for (uint8_t i = 0, j = jStart; i < 4; ++i, ++j, ++scanlineX)
      {
        const auto value = scanLine[scanlineX];
        
        // Default to reading background tile,
        // assuming it's the more frequent occurance.
        uint8_t lineTile = tile;
        uint8_t lineOffset = j;
        
        // Having only one conditional section should improve performance
        // by virtue of having less code to traverse.
        if(value > 0)
        {
          // no longer needed
          // asm volatile( "nop\n" ); 
          const uint8_t spriteIndex = (value - 1);
          
          // Would have just called this 'sprite' if the sprite array were 'sprites'
          auto & spriteReference = sprite[spriteIndex];
          
          lineTile = spriteReference.tile;
          lineOffset = spriteReference.offset;
          ++spriteReference.offset;
          
          scanLine[scanlineX] = 0;
        }
        
        SPDR = pgm_read_byte(&alltiles[lineTile][lineOffset]);
      }
      
      // This might be able to be removed?
      ++tileX&=15;
      //if(tileX == 16)
      //  tileX = 0;
    }
  }
}

Next step…sound :stuck_out_tongue_winking_eye:

So it’s the index of the column of the sprite to be drawn at that index in the scan line?
Or the row?
Or something else?

Best make that two lines:

++tileX;
tileX %= 16;

The modulo is for the sake of clarity. Leave converting it to an & to your compiler.
(If your compiler doesn’t produce the machine code equivalent of &= 0xF, or something better, don’t change the code to &= 0xF, file a bug report instead.)

And splitting into two lines is to avoid undefined behaviour.

I’m not entirely sure about C++11 because the rules about the sequencing of increments changed,
but in C++98 at least, doing ++i = x was undefined behaviour.

More importantly, nobody even has to consider whether it’s undefined behaviour or not.
The sequencing rules are horribly complicated.


Have you tried removing that part by the way?

I’m about 70% sure it’s not needed, because:

  • The loop runs 16 times, taking x from 0 to 16
  • tileX is being incremented every time x is, until tileX hits 16, at which point it’s reset to 0
  • tileX isn’t actually used anywhere else in that loop

So theoretically it’s 0 both before and after the loop.
That combined with the fact is never even used by the loop implies that outside the loop it’s always 0.
Which in turn would mean that tileX is useless,
because the only other place that actually reads it would do so at a point where it’s guaranteed to be 0.

Just tried, seems fine, getting the top end of 52fps.

1 Like

I wonder how difficult scrolling would be. Other than extending the map by at least 1 tile in both directions…

[edit]
Unrolled @Pharap’s loops a little, got it to 72fps with 8 sprites!
Again, just rename the .zip.bin to just .zip…
128x128_screen_2-_13_1_72fps_sprites.zip.bin (5.9 KB)

[edit] Quick test with transparent sprites…

  if(pgm_read_byte(&alltiles[lineTile][lineOffset])>0){\
    SPDR = pgm_read_byte(&alltiles[lineTile][lineOffset]);\
  }else{\
    SPDR = pgm_read_byte(&alltiles[tile][j]);\
  }\

Gets a respectable 58fps also :slight_smile:

Thankyou @Pharap for turning my spaghetti into actual code :slight_smile:

1 Like

I don’t know about speed and specifics,
but if you want a starting point, here’s a high-level Arduboy example of drawing a scrolling map:

Note that it only draws (width + 1) * (height + 1) tiles,
rather than attempting to draw the whole map.

(I think I have some other examples elsewhere too.)

That’s actually much better than I was expecting.
I was only expecting about 60 fps.

If you don’t want to use a proper GitHub repo, consider using gists:
https://help.github.com/en/github/writing-on-github/creating-gists

You can download them as a .zip without having to worry about the forum’s extension filter.

You should probably cache the result of pgm_read_byte(&alltiles[lineTile][lineOffset]).

I could be wrong, but I’m fairly certain that the compiler can’t actually cache the result of any pgm_read functions because they’re implemented in assembly and the compiler has no knowledge of assembly.

(It’s possible that the assembler might be able to optimise it,
but I’m somewhat doubtful of the assembler’s ability to recognise redundant code.)

If nothing else, I hope this demonstrates that readable code can be fast too.
Good performance comes from good semantics,
good maintainability comes from good presentation.

Swapping to the following, gets 64fps…

  pixel = pgm_read_byte(&alltiles[lineTile][lineOffset]);\
  if(pixel==0){\
    pixel = pgm_read_byte(&alltiles[tile][j]);\
  }\
  SPDR = pixel;\

…very not bad :slight_smile:

1 Like

Assuming the type wasn’t omitted by accident,
consider using more variables with shorter lifetimes.

Contrary to what might seem likely,
reusing variables is actually worse than having lots of short-lived variables because what really matters is the lifetime of a variable.

By reusing the same variable, the variable’s lifetime is artificially extended,
which can force the compiler to have to keep the variable around (e.g. by keeping it in a register while it isn’t needed, or by pushing it to the stack).

By introducing new variables and ceasing to use old ones,
the compiler has an easier time of allocating the registers.
(In particular because some compilers like GCC use single static assignment form as part of their optimisation.)

1 Like

Most, if not all of the variables used are declared in the update screen function, I assume they would get killed at the end of the function.

This is all unmasked right? OR “delete” masked, as in the sprites simply overwrite the WHOLE area, vs being layered onto the background?

The sprites imply overwrite the background, but in the second test (64fps) I’m checking for byte==0, which only checks if both odd and even pixels are zero within each 2 pixel byte. I haven’t tested yet at an individual pixel level, I’m tempted to think it would be too slow.

All locals do indeed get ‘killed’ by the end of the function,
but it’s the register demand during the function that you need to worry about.

AVR chips only have 32x8-bit registers, and certain instructions only work with certain registers.
E.g. ‘add immediate to word’ (roughly reg += constant, where reg is 16-bits wide) only works on 4 of the register pairs (8 of the 32 registers), ‘compare with immediate’ only works for the ‘upper’ 16 registers.

As a result of the constraints, it’s easy for the CPU to end up wasting cycles simply juggling registers around to free up a register that can do what it needs to do.

I think roughly it would be 1 extra branch, possibly 1 extra compare (depending on whether or not an ‘and’ sets the zero flag), definitely 2 extra 'and’s.


I’ve been thinking, if there were some way to avoid having to do this 8 step loop for each scanline you could probably save quite a bit of processing power.

Unfortunately I don’t fully understand what it’s doing,
so I can’t suggest anything yet.
(For one thing I’m still not clear on exactly what SPRITES::offset is.)

(Which reminds me, why is the type SPRITES, plural and in macro case, but the actual array is sprite, singular? Surely the type should be singular because it describes a single entity and the array should be plural since an array contains multiple items?)


Come to think of it, I’m not sure that % 8 is needed:

spriteReference.offset = (((lineY - spriteReference.y) % 8) * 4);

The prior condition:

if((lineY >= spriteReference.y) && (lineY <= spriteReference.y + 7))

Would imply that (lineY - spriteReference.y) can only be between 0 and 7 (inclusive).
(Unless spriteReference.y + 7 overflows of course, but that shouldn’t be possible because of (lineY >= spriteReference.y) and lineY being < 128.)

That only saves you an & 0x7, but at least it’s something.


There’s a potential buffer overrun bug with the current code…

for(uint8_t offset = 0; offset < 4; ++offset)
  scanLine[spriteReference.x + offset] = (spriteIndex + 1);

If spriteReference.x is 125 or more then scanLine[spriteReference.x + offset] overruns the buffer.

But for the SamD am i rite?

Indeed, you just got the fps up to 65 :slight_smile:

1 Like

I assume simply increasing the size of the buffer bye a few bytes would work out faster than checking bounds?

Or do the bounds check once, early. (if possible)

Yes, but you’d be paying for the privilege with RAM, which isn’t exactly plentiful on an AVR chip.
(Though it’s only 3-4 bytes, assuming spriteReference.x is never more than 127, so it might be worth it.)

One alternative would be to wrap the index using modulo (which would compile to an & because you’re using a power-of-two for size), which would have the side effect of sprites drawn near the right edge of the screen wrapping around to the left edge (which depending on how you lookt at it is either a bug or a feature).

But theoretically that would cost an extra 1 or 2 cycles per loop,
which is going in the opposite direction of what you want (unless you think the wrapping is a neat feature).


I was about to say that the ‘proper’ solution is:

for(uint8_t offset = 0; offset < 4; ++offset)
{
  const size_t index = (spriteReference.x + offset);

  if(index >= 128)
    break;

  scanLine[index] = (spriteIndex + 1);
}
And suggest something crazy...
const auto x = spriteReference.x;
const auto index = (spriteIndex + 1);

switch(x)
{
	case 125:
	{
		scanLine[x + 2] = index;
		scanLine[x + 1] = index;
		scanLine[x + 0] = index;
		break;
	}
	case 126:
	{
		scanLine[x + 1] = index;
		scanLine[x + 0] = index;
		break;
	}
	case 127:
	{
		scanLine[x + 0] = index;
		break;
	}
	default:
	{
		scanLine[x + 3] = index;
		scanLine[x + 2] = index;
		scanLine[x + 1] = index;
		scanLine[x + 0] = index;
		break;
	}
}

Or even suggest dropping into assembly to write a jump table
(which admittedly might still be faster than what I’m about to propose),
but then I had a flash of inspiration and came up with this:

for(uint8_t index = spriteReference.x; index < 128; ++index)
  scanLine[index] = (spriteIndex + 1);

Which shouldn’t be much more expensive than what you’ve already got (if not the same even).