Arduboy_VT100

After playing around with using VGA1306 as a serial terminal I took the same code and adapted it to use the Arduboy itself as a serial terminal…


Using an 8x8 font gives a screen size of 16 columns by 8 rows, and also means I was able to bypass the Arduboy pixel buffer entirely and render the characters directly to the screen! :smirk:

Only a subset of the VT100 escape codes are implemented, so not everything works - but it is enough that you can open a terminal and type directly to the Arduboy screen (100% for the geek factor! :smile:), I also managed to get LCDproc (which is Linux tool for sort of running ‘widgets’ on LCD character displays) to talk with it like this:

LCDd | tee /dev/ttyACM0

The font array is stored in RAM, my assumption is that this would be faster than storing it in PROGMEM?

3 Likes

Not fast enough to matter for a terminal rendering process.

I have a screen library that does the same thing with the 3x5 font. Forget where I put it though. Any tile based system is pretty easy to render without a buffer if you write it carefully.

No licence yet?

To be honest, I think a loop might be smaller and faster than memset (likewise with memmove).
I can’t remember exactly, but I’m pretty sure I tested once and found that memset is actually worse than what the compiler does to optimise loops.

I found this to be the case and removed all memsets from my game.

Ah good, I didn’t imagine it then.

On most modern compilers, I would expect memset to perform slightly worse anyway.
I would expect that the compiler can do size and type checking at compile time,
and just inject a word-by-word copy into the loop area.
(Possibly using vector extensions on a CPU that supports them.)
memset would probably do similar, but it would have to do the work at runtime, which costs time and memory.

This might hold true for AVR, but unless the compiler knew a lot about your size and alignment I don’t see how it could be as fast as optimized assembler on say an ARM chip. Those routines can use some instructions to move large chunks of memory [much more than 1 word at a time] (not even counting DMA) and then as you get towards the end do it in smaller and smaller pieces until finally you’re moving a single byte at the end. I wouldn’t expect even a smart compiler to turn a 1 byte at a time for loop into an a fully optimized memory copy.

This might also be faster with memcpy written in AVR assembly also but I haven’t looked to see what a for loop that’s just doing a memory copy compiles down to. If it’s not using X and Y directly and indirect memory access with increment then it’s much slower than it could be.

Lets not get smaller and faster confused too. That might be the problem here. In ARM (which I mentioned) I’m sure the FASTEST memcpy is MUCH MUCH larger than a for loop… because it has to deal with a lot of different size of copies to get the speed. So you have to be careful what you’re measuring.

Not sure why that would be the case on AVR though - there isn’t really much AVR can do push more bytes at a single time. Wouldn’t surprise me if memcpy had some sanity checking or something though that takes up some space.

Edit: I take that back. Just unrolling the loop 2-4x time would have a big impact on the speed and would result in conserably more code with the idea of better performance.

Difficult to know what to do about a licence… the original code was from a MediaFire download linked to from a YouTube video, with no licence provided… and the font is believed to be public domain? Slap on an eye patch and have at it, I’d say! :parrot: :pirate_flag:

Yes, it seems AVR’s libc has optimized code (if OPTIMIZE_SPEED was set when it was compiled) for memcpy that unrolls the loops once so that it copies words instead of bytes each iteration of the loop. The unoptimized version just copies a single byte at a time.

Relevant portion:

#if OPTIMIZE_SPEED
; 15 words, (14 + len * 6 - (len & 1)) cycles
	sbrs	len_lo, 0
	rjmp	.L_memcpy_start
	rjmp	.L_memcpy_odd
.L_memcpy_loop:
	ld	__tmp_reg__, Z+
	st	X+, __tmp_reg__
.L_memcpy_odd:
	ld	__tmp_reg__, Z+
	st	X+, __tmp_reg__
.L_memcpy_start:
	subi	len_lo, lo8(2)
	sbci	len_hi, hi8(2)
#else
; 11 words, (13 + len * 8) cycles
	rjmp	.L_memcpy_start
.L_memcpy_loop:
	ld	__tmp_reg__, Z+
	st	X+, __tmp_reg__
.L_memcpy_start:
	subi	len_lo, lo8(1)
	sbci	len_hi, hi8(1)
#endif
brcc .L_memcpy_loop

…have reached out to smeezekitty to ask about a licence, will wait and see if anything comes of it! :neutral_face:

1 Like

Have made some updates, added a nice blinking cursor amongst other things:

It’s the compiler’s job to know the size and alignment of any type.

C++ has sizeof and alignof constructs, which the compiler implements,
therefore the compiler has to be able to know the size and alignment of any object (in the current translation unit).

I think you underestimate modern compilers.

There is typically a phase dedicated to assembly optimisation where the optimisations used are tailored to the assembly language (e.g. object code optimisation or peephole optimisation).

Most compilers have different flags purposely to allow people to choose.

Exactly, that’s what I’m saying.

The compiler can do that check at compile time to avoid having to do it, memcpy cannot.
Thus, regardless of speed, a bare loop should be smaller as long as the objects being copied have a trivial copy operator.

But with a good compiler, the compiler will end up essentially reimplementing memcpy to replace the loop with, will do all the overhead calculations at compile time and only generate the assembly code that’s actually required.

I.e. If the compiler knows an array is copying an even number of bytes, it will completely elide the extra logic that would be necessary to handle an odd number of bytes.

memcpy can’t do that, it has to exist in its entirety and it has to contain the overhead calculations because you’re passing it a void *, not a strongly-typed pointer.

Loop unrolling is a standard compiler optimisation, and in most cases whether it’s used will depend on the flags chosen.


That’s precisely what I would have suggested doing.

It’s more courteous to ask, and you can rest easy knowing that the author isn’t going to later say “you can’t use this”, or if they do say that then it’s better that they say it immediately.

(If they say they don’t care, probably best to just shove MIT, Apache 2.0 or BSD 3C on it, and add an extra disclaimer about the original author.)

OK, I think I am going to leave it as it is now, have made all the changes that were ‘fun’ to make… :smile: But looking at this list of unimplemented escape codes is dizzying!

Here it is running on the ‘big screen’:

output

2 Likes

I meant if your object is 1024 bytes but you’re doing a for loop one byte at a time… The compiler is assuming a lot it it decides it’s not just a bunch of bytes and tries to treat it as larger chunks. If 1024 is a constant known at compile time I see how it could, but I don’t think it currently does.

I think you underestimate modern compilers.

Maybe, but I haven’t seen the AVR compiler do something like turn a for loop into hightly optimized memcpy code yet, but perhaps I just haven’t been looking.

But with a good compiler, the compiler will end up essentially reimplementing memcpy to replace the loop with, will do all the overhead calculations at compile time and only generate the assembly code that’s actually required.

Your trick here again is saying “with a good compiler”. I don’t think gcc is particularly “bad” but I do not think it will turn a for loop that is byte by byte into essentially an optimized memcpy using Z and X like the actual stdlib memcpy. At least I’ve seen it do far stupider things when I’ve disassembled it - though its possible a lot has changed in the past year.

Now I think in the case of memcpy what the compiler generates is usually plenty fast enough - but I think the memcpy in stdlib could easily be timed and measured and it would be found to be a lot faster (at the expense of being larger).

Loop unrolling is a standard compiler optimisation

Sure, but it’s not typically included with -Os I don’t think. I thought Arduino was doing that by default now…

If you won’t take my word for it, have a benchmark:

using microseconds = unsigned long;

// Prevent inlining to avoid other factors influencing the result
// count is passed on the stack
microseconds __attribute__ ((noinline)) testCopyA(void * destination, const void * source, size_t count)
{
	auto start = micros();
	memcpy(destination, source, count);
	auto end = micros();
	return (end - start);
}

// Prevent inlining to avoid other factors influencing the result
// count is a template parameter, and thus becomes hardcoded
template<typename T, size_t count>
microseconds __attribute__ ((noinline)) testCopyB(T (&destination)[count], const T (&source)[count])
{
	auto start = micros();
	for(size_t i = 0; i < count; ++i)
		destination[i] = source[i];
	auto end = micros();
	return (end - start);
}

// Verifies that two arrays contain the same values
template<typename T, size_t count>
bool verify(const T (&a)[count], const T (&b)[count])
{
	for(size_t i = 0; i < count; ++i)
		if(a[i] != b[i])
			return false;
	return true;
}

// Provides "true" or "false" instead of "1" or "0"
const __FlashStringHelper * boolName(bool b)
{
	return b ? F("true") : F("false");
}

// Prevent inlining as a precaution
// Templated test function for easy test generation
template<typename T, size_t size_value>
void __attribute__ ((noinline)) test(void)
{
	using value_type = T;
	constexpr size_t size = size_value;
	
	value_type source[size];
	for(size_t i = 0; i < size; ++i)
		source[i] = (char)i;
		
	{
		Serial.print(F("memcpy: "));
		value_type destA[size];
		unsigned timeA = testCopyA(destA, source, size * sizeof(value_type));
		Serial.println(timeA);
		Serial.print(F("Verify: "));
		Serial.println(boolName(verify(source, destA)));
	}

	{
		Serial.print(F("for loop: "));
		value_type destB[size];
		unsigned timeB = testCopyB(destB, source);
		Serial.println(timeB);
		Serial.print(F("Verify: "));
		Serial.println(boolName(verify(source, destB)));
	}
}

void setup(void)
{
	while(!Serial);
	
	Serial.println(F("Test char[512]:"));
	test<char, 512>();
	Serial.println();
	
	Serial.println(F("Test char[511]:"));
	test<char, 511>();
	Serial.println();
	
	Serial.println(F("Test int[64]:"));
	test<int, 64>();
	Serial.println();
	
	Serial.println(F("Test int[63]:"));
	test<int, 63>();
	Serial.println();
}

void loop(void)
{
}

I don’t know what assembly code the compiler is generating
(because I can’t be bothered to dig out my copy of objdump),
but the for loop outperforms memcpy in the char[511], int[64] and int[63] test cases,
and equals it in the char[512] case.

For simple chars the performance is quite similar,
but for ints the for loop does notably better.

The speed does vary, so here’s three separate runs:

Run 1:

Test char[512]:
memcpy: 260
Verify: true
for loop: 260
Verify: true

Test char[511]:
memcpy: 272
Verify: true
for loop: 260
Verify: true

Test int[64]:
memcpy: 68
Verify: true
for loop: 64
Verify: true

Test int[63]:
memcpy: 64
Verify: true
for loop: 52
Verify: true

Run 2:

Test char[512]:
memcpy: 260
Verify: true
for loop: 264
Verify: true

Test char[511]:
memcpy: 260
Verify: true
for loop: 264
Verify: true

Test int[64]:
memcpy: 68
Verify: true
for loop: 48
Verify: true

Test int[63]:
memcpy: 68
Verify: true
for loop: 52
Verify: true

Run 3:

Test char[512]:
memcpy: 260
Verify: true
for loop: 268
Verify: true

Test char[511]:
memcpy: 260
Verify: true
for loop: 260
Verify: true

Test int[64]:
memcpy: 68
Verify: true
for loop: 52
Verify: true

Test int[63]:
memcpy: 68
Verify: true
for loop: 48
Verify: true

But you don’t have to take my word for it, you can run the code yourself.

If often does.

The size of an array is frequently constant and often known.
When writing Arduboy games, people tend to refer to an array directly and use constant-sized arrays.

I’d wager that very few people have been looking.

Few Arduboy games require high levels of speed, and even in the cases where speed is needed, there’s usually larger speed savings that can be made before reaching the point where assembly needs to be analysed to look for micro optimsations.

GCC is widely regarded as an example of ‘a good compiler’.
(As are Clang and MSVC.)

Optimsations are always subject to interference from other optimisations and from coding style.

For example, the compiler starts to struggle once an expression gets notably complex or a function gets particularly long.

Sometimes simply spliting a large function into smaller functions can be enough to reduce code size because the compiler can handle the smaller functions better since it doesn’t have as much context to keep track of.

And of course, sometimes language rules get in the way of optimisations you’d think would be viable.
Often humans can see optimisations that compilers aren’t allowed to make because doing so would change the meaning of the code or violate the rules of the language.

The GCC documentation does not explicitly state either way.
However other levels of optimisation perform other loop-related optimisations such as taking advantage of loop invariants and performing loop fission (e.g. -fmove-loop-invariants and -funswitch-loops).

Really in order to state anything conclusively you have to disassemble it though. Turns out we’re both right. You’re right: the default optimization here (when the compiler knows the type) is really good. I changed the #s for a fairer comparison (so we’re moving the same amount of data):

Test char[512]:
memcpy: 260
for loop: 260

Test int[256]:
memcpy: 260
for loop: 192

When the compiler knows the data type it essentially “unrolls” the byte copy into copying two bytes at a time, making the loop faster. What this can’t show you is that the optimized code for memcpy does this without knowing the data types in all cases, but Arduino is simply not using the optimized code. So I’m right in theory also: the optimized memcpy code I linked to earlier would be equally fast (if it was actually being used). Unfortunately it’s not (likely because the unoptimized code is smaller and size is usually more important than speed here).

Now of course you still have things like function call overhead (memcpy) vs inlined code (for loops)… which either is or isn’t fair depending on what exactly you’re trying to say. Obviously in simple cases like these loops (that are inlined) are going to win there.

I’d still argue that in more advanced cases a platform specific memcpy could have distinct speed advantages… on other architectures there are opportunities to copy larger chunks of memory than even object alignment would have the compiler doing.

But effectively for Arduboy and AVR (at least on 1.8.7 with default compile options) it seems for loops carry the day - and get faster if you use large data types.

I was also correct in what I was trying to say here. If you tell the compiler to copy 1024 bytes one at a time, that’s exactly what it does. It doesn’t magically turn that into a faster memcpy (as the optimized memcpy code tries to do). It only speeds things when it KNOWS the datatype is say int16.

My point was only that using a dedicated function (or writing one by hand) can get you those speed savings either way - even when the compiler magic can’t help you.

@Pharap

If you really want to see the compiler start to trip all over itself try to write simple C for loop code that copies from two different memory locations and then merges them into a third location (such as moving a separate sprite AND mask onto the video buffer). Things get really slow because you don’t have enough 16-bit CPU registers. Technically you do (X,Y, Z) but the compiler will never give up Y cause it’s the stack pointer I think? BUT if you drop down to assembly you can get the Y register and get it done crazy fast.