Arduboy Dual Core Concept

Earlier today I am trying to place my Arduino Micro on my gen 1 protoshield sitting on top of a Arduino Leonardo
Then I started wondering about the entire “dual core” thing.
At first it sounds pretty dumb: why will you want to combine TWO Atmega 32U4s in completing tasks that can be done by one?
The reaoning is quite simple: to give it more power.
But having two cores does not automatically give you 100% more performance, since we have to take into account the big program flow and etc, etc.
And so I started thinking about this problem, “if I indeed have a separate 32U4, what will I do so I will get the maxinum performance boost”?
And so I started to think about the big issue some of us currently face: RAM.
And we know that the display buffer eat up 1KB of RAM. So let’s toss this display buffer to core No.2.
But instead of sending display content over to core No.2, we send COMMANDS over to core No.2 and have it prepare the graphics.
So now core No.1 will not have to deal with the mess of drawing pixels onto the screen, or remembering what that is on the screen to prepare it for the next frame. It will now only need to deal with buttons, LEDs, buzzer, and logic. And it have 32K ROM and 2K RAM to do that. And the second core will now specialize in taking drawing commands and prepare the graphics – GPUs and CPUs!
I figured you can also store bitmaps in core2, which is nice.
To communicate between the two core we can use the fast SPI interface, however since SPI is also used by core No.2 to interface with the display there may be issues with (e.g. master/host, changing recipient, etc) so we may need to use slower protocols (e.g. I2C, UART)
But consider that the 32U4 can only process one command at a time, using SPI will likely not a issue.
However, to ensure that the two cores are in sync, we may need the cores to send confirmation bytes (e.g. if core 1 send drawLine command to core 2 but core 2 had not finish with the previous fillRect yet) or pull up/down certain pins to indicate readiness.
This might best be done with I2C as the ability to “ACK” then hold the line low (to halt the other processor) is built-in. however we need to remember that I2C is fairly slow.
But I think if we REALLY want to pull this off, we can get it done.

There’s also another processor that might be a useful contender, the ATTINY85, which for anyone with a mod-chip FX (not the production FX) in their old arduboy, is just collecting dust inside the casing. It’s used to update the bootloader on the original arduboy, and then is never used again. So with that in mind, you may be able to get it to run in conjunction with the 32u4 for some very specific applications. I don’t know that it would be useful for offloading display protocols but I don’t know much about the ATTINY at all in earnest :sweat_smile:

EDIT: I went ahead and moved this to a more appropriate topic than “General”, hope you don’t mind

1 Like

The big problem with trying to create a ‘dual core’ system by combining two CPUs (which, might I add, isn’t the same thing as using a ‘dual core’ CPU - the implications are very different) is that then you start have to worrying about all the downsides that come with asyncrhonous processing and threading.

E.g. race conditions, mutexes (or should that be muteces? :P), semaphores…

That’s before factoring in that the two devices won’t have a shared address space like a proper dual core CPU would, so their communication channel (which will probably be either I2C or SPI) will be a major bottleneck.

It’s completely possible, but it will introduce some extra complications and complexities.

This much is probably doable, and generally I think it’s a good idea and I’d certainly like to see someone try it.

However, you’d have to be careful with the timing, and it would be more useful for drawing basic shapes than it would be for drawing sprites, unless you had a means to flash the sprites onto the second ‘core’ at the same time you flash your program data onto the first ‘core’.

SPI only needs to be unavailable to core one during the time in which core two is transmitting a frame buffer to the screen, the rest of the time it should be available.

One option would be for core two to send a signal over SPI back to core one after it’s finished rendering the frame, so core one would then know it’s safe to transmit commands again.

Of course, what would really complicate things is if the FX chip was also added to the mix, since that also uses SPI.

Actually, for the best performance what you’d want to do is to be able to queue up a list of commands into core two’s memory.

If core one had to wait for core two to finish a command before you could send it any more commands, that’s going to be a big bottleneck and you’re not really going to be gaining as much benefit as you ought to.

Thus it’s best if core one can send lots of commands without having to worry about what core two is actually doing at that time, and core two can take new commands off its command queue as and when it needs to.

For the record, this is what actual GPUs do.
(Or at the very least, D3D has support for it.)


Thinking about it, you could also theoretically get ‘core two’ to run shader programs. Not that you could ‘shade’ much on a monochrome screen. You could do dithering I suppose.

2 Likes

Sending graphic commands over might take more time than actually drawing them when considering the complex protocol and interrupt overhead, SPI/I2C speed, etc.

There’s no SPI DMA / buffer (that 1 byte latch doesn’t count :stuck_out_tongue: ) on 32U4 so both CPUs will be busy shovelling SPI bytes and waiting for completion (16 cycles per SPI transfer isn’t worth leaving an interrupt handler).

And/or you might need some buffering for those commands for efficiency, potentially costing more RAM than it saves.

On top of that you now have an entire graphic command protocol handling of code to add to the ROM on the rendering MCU, instead of passing gfx data pointers to a function you need a pointer lookup table (or some weird compiling setup so your build on CPU1 knows where the graphics are in CPU2’s flash at compile time)

I think having the GFX code along with the game code on the same MCU will be more efficient.
And much less of a nightmare to setup the build system.

Audio would be a good candidate. Protocol can be as simple as send a enum byte that tells which sound effect / music to play with 255 entries (some could be SFX stops as needed) and zero to stop all.

And that’s plenty for what can be audible out of a tiny piezo (too much at the same time and it becomes just indistinct noise)

The CPU can be kept busy enough doing PCM/PWM to be a big win offloading that to the other “core”.

Handling an SD card too would be a good candidate, using its RAM for dealing with FAT32 entries. reads can be done on-demand without storing the whole sector: Slower but very little RAM usage. However writes require read-modify-write of entire sectors to update the file system.

And that would allow streaming audio on the same MCU.

Streaming FMVs back to the main core would still be possible. Main MCU doesn’t need anything more than its already allocated 1KB video buffer. The 2nd MCU can do reads on demand as long as the MCUs communicate on something other than SPI, or if they talk over SPI 2nd MCU needs an 512 bytes buffer to not waste time doing discarded SD reads.

1 Like

By I mean sending graphics command I mean “telling the other core what to do”
And, in our case of a simple Arduboy, we can do as simple as sending a byte to indicate the “what to do” part.
For drawLine, it will require 6 transfers (1 command, 4 data and 1 color). However if we were to transfer “drawFastVLine”, we will get away with 5 transfers. And using commands like “set color” we can take one transfer out of each.

Displaying the entire screen cost 1K transfers (wha?). That will equate to about 150 to 200 commands.

Technically we wouldn’t need to worry about buffering for “core1”, and even if we do buffer the commands we will likely need less than 1K of RAM.
That’s one solid K of bytes just for display buffer. Even if we have 10 bytes for each command we can still store a hundred commands (which we won’t need)
here’s how I picture the two “cores” work without a buffer:

“core1” sends “draw line from x1y1 to x2y2”
“core2” received command and started drawing the line
meanwhile, “core1” continued in the program flow to process logic (e.g. bullet collision)
after that “core1” sends command to draw 2 circles
“core2” will likely have already finished drawing the line and will now be drawing the first circle. however because there is two circles to be drawn, “core1” will wait until “core2” finish drawing the 1st circle, then send the second circle command over, and continue.

During the process “core1” was effectively halted for the duration of a “transfer drawcircle command” and a “drawcircle command”, but otherwise won’t have to “slow” and process “drawline” and another circle.

The way to get the most out of this approach is to use sprites, and space the display calls a few lines apart. This way “core1” will only need to stop for exceptionally long sprites.

The entire story will be a lot easier if the two “cores” share a memory space. “core1” can just write some things in a specific part of sram (with pointers to the data) that “core2” recognize as some command. then “core2” can just “grab” the content from the pointer and process the drawing command.

I’m not a system engineer yet, though. It won’t be surprising to know that my speculations are totally off.