I’m sure I’ve commented here before but I’d like to again - based on reviewing the specs of the Gamebuino META (which I’ve helped kickstart).
- SD Card
- Much larger/faster CPU, such as a M0+, etc this has been discussed already
- Much larger storage space (256kb would be great)
- 128x128, 160x128, or something even higher-res LCD/OLED
-
Bust most importantly: Either a dedicated graphics co-processor OR wire in a capable LCD/driver NOT over SPI. That means actually using some pins and using a 8/16/18-bit parallel interface. Example: ST7735
The ST7735 is the chip used in the Gamebuino META. It looks like it’s going to be a great system BUT already they are trying to sell “50 fps” as the new “60 fps” (and 25 FPS will be the default on a lot of things evidently) just because of how heavy writing all that data to the screen is over SPI. They are using the 16 bit data mode… but if use the 18 bit mode for example it takes 17ms of SPI time just to push the 0.5mb of data that is one full screen. That’s 100% of available time if you wanted to run at 60 FPS… i.e., no games at 60 FPS. (just to push the RAM buffer to display, that’s not counting any game logic or rendering)
Things are slightly better with the 16 bit data mode. You can push a screen in 11.7ms. Leaving 4.96ms for rendering and game logic (if you wanted to hit 60fps).
Of course this is the reason these chips have parallel modes. If you wired up 16 lines to 2 ports on the chip you’d be able to push 16 bits at once instead of one at a time. If you could somehow mix this with DMA you might have a way to push the screen from memory to the driver without very little if any CPU involvement. Also lets not forget using parallel wiring allows you to access the VRAM on most driver chips (which are often read/write) - which creates all sorts of other possibilities - possibly removing the need to buffer a large 16/18 bit color screen in RAM.
I don’t know exactly how much faster it would be… as this point you’d have to calculate the cost of the instruction pipeline since the CPU itself might become your bottleneck. I also don’t know what the max speed could be. The docs mention switch speed for the WRX pin itself as having a 100ns write cycle with 30ns each for the high and low pulses. So if I assume that is a full flip (pushing one segment of data) that means you can push 8-18 bits per 100ns. Around 10Mhz. So if a faster ARM chip (say 48Mhz) would still hold around one instruction-ish per cycle… that should give you 38 cycles left to FETCH, set PORT, and flip WRX (if you can’t use DMA or something cool). If you have 16-bit MOV/LD instructions this seems more than feasible (esp. for 16-bit parallel, maybe even 18-bit).
So now comparing… 1 bit SPI bus at 22MHz vs a 16bit parallel bus at 10Mhz. By my rough math that would make the parallel interface over 7 times faster cutting your “transfer” time down to 2-3ms for a whole screen of data. Even 8 bit parallel should come out to around 3.5x faster.
If any of my math is confusing or doesn’t make any sense, please let me know. I think I got most of it right. 
Wiring the video chip to the CPU this tightly would allow some for some seriously impressive graphics pushing even with a larger high-color LCD/OLED screen. The Gamebuino bills itself as a “hacker” device in that they want to support expansion shields, so they leave all the pins open (as much as possible)… but I don’t think that has even really been the Arduboy’s goal… it’s more of a “closed shell” device… so we shouldn’t have the same motivation to leave pins unused as they do.
So are there any good reasons NOT to do parallel wiring in the Arduboy 2?
It goes without saying that we should consider 4/8-bit parallel access for the SD card to if we had enough pins and any chipsets support that. You don’t want the SD card to become the new bottleneck. Although I know with SD cards eventually you might have to deal with the read speed of the card itself, but no room to discuss that here. 