[WIP] VGA-out (without FPGA)

I wanted to share something I have been working on for the last few days. (The idea came to me during a road trip over the weekend).

The idea is to use a ‘cheap’ (< $2) STM32F103C8 microcontroller to provide a VGA-out adapter for DIY Arduboys similar to the excellent FPGA based one that @uXe came up with: VGA1306 (VGA-out for DIY Arduboys implemented on an FPGA!).

The primary advantage is the low cost (< $2 for a ‘blue pill’ board) and easier toolchain (GNU GCC compatible) compared to the FPGA solution.

What makes this possible is the fast clock speed and excellent peripherals in the STM32F103C8. It has two hardware SPI peripherals which can be clocked up to 1/2 the primary clock speed, and both of which can be used by the DMA controller. It is normally clocked at up to 72 MHz, but can easily be overclocked as high as 128 MHz (or maybe even higher) as long as the USB functionality isn’t needed.

For this proof-of-concept, I am using a slight overclock of 80 MHz that allows 128 pixels (per line) to be clocked in using hardware SPI over DMA with a /16 prescaler in roughly the same amount of time (25.6 uS) that 640 pixels are normally clocked in with a 640x480@60Hz resolution (25.422 uS). This provides a 5x scaled pixel to the monitor’s VGA input. In theory, the clock could be cut down to 40 MHz (and prescaler changed to /8) or 20 MHz (prescaled to /4) or 10 MHz (prescaled to /2) to avoid the overclock (I haven’t tested it this way yet).

The VGA horizontal and vertical sync signals are handled by hardware timers with output pins driven by interrupts, so most of the time the main MCU is sitting idle.

There is plenty of RAM available (20 KB) for double-buffering (both a full input and output video buffer). The output buffer has to use horizontal addressing to match what the SPI DMA controllers needs for the VGA output, and the input buffer will need to match the vertical addressing that the Arduboy sends over. There should be plenty of time available in the vertical back porch (> 1 mS) to (at least start to) convert the input buffer to the output buffer.

So far, I have the entire output side of things working fairly nicely.

I still need to code and wire up the input side of things. I think I can use the second hardware SPI (in slave mode, potentially with DMA) on the input side to pull in the Arduboy’s video buffer. I should be able to just process the ‘data’ bytes and ignore the ‘command’ bytes normally intended for the SSD1306 (similar to the FPGA solution), so in theory this shouldn’t require any software changes on the Arduboy, and it shouldn’t require very much cpu time on the STM32 or cause much interference with the VGA output.

I imagine there will be plenty of pins and cpu time left over to be able to include a NES/SNES controller converter as well. :wink: Actually, there will probably be more RAM and CPU left over than the Arduboy has to begin with. And, there is an Arduino toolchain available (stm32duino), so in theory this one MCU/board might be able to handle everything (with a port of the required libraries).

I’m not quite ready to share code yet, but will do so when things are a little further done and tested.

On a 17" non-widescreen 4x3 (1280x1024 native) monitor with ~10x10 pixels:

On a 22" widescreen 16x10 (1680x1050 native) monitor with ~13x15 pixels:

The STM32F103C8T6 ‘Blue Pill’ board with VGA connector:


That looks really neat … and only a couple of bucks?


$1.67 (free shipping):

For programming, you would also need a ST-Link v2 for ~$1.81 (free shipping):

EDIT: You might be able to program it via USB and/or USB-to-serial, but the ST-Link v2 is fast/easy/cheap.

1 Like

Full disclosure:

I did not originate the idea of using hardware SPI with DMA on one of these microcontrollers for VGA output.

The following article was inspiration for that part of the idea, and along with the sample code was a great resource to get going and learn how it all works.

Using that code with a slight modification, you can get a 400x300 monochrome VGA output (on a 800x600@56Hz signal) that uses very few cpu cycles (using the standard 72 MHz clock).

My changes and experimentation have been all around finding a way to get the timings right for a 128x64 output using the same concepts.

I also made changes to get this up and running with the free/open-source GNU GCC toolchain instead of the paid Keil toolchain used in the article.


I’ve seen those super cheap stm32 boards floating around on ebay, now I have a reason to pick one up! Looking forward to seeing what you can accomplish with this project.

Great project. The board is dirt cheap too :+1:

It seems there is an Arduino package too

Looking good so far, I have an STM32F103RET6 (Maple) board around here somewhere when you are ready to share the code?

I installed the IceStorm toolchain with two lines at the command prompt:

pip install apio
apio install -a

and then compiled / built the project in one line:

apio build --size 1k --type hx --pack vq100

…couldn’t be simpler! :grinning:

1 Like

Yep. That package is great for basic Arduino-like projects.

For this project (at least to start with) it seemed better to get closer to the raw hardware. It’s probably not advisable, and maybe not even possible to do hardware SPI with DMA with the Arduino package.

That STM32F103RET6 MCU (512 KB flash, 64 KB ram, 3x SPI, 72 MHz, 32-bit ARM Cortex-M3) looks more than capable of doing everything that I am asking of the STM32F103C8T6 MCU (64 KB flash, 20 KB ram, 2x SPI, 72 MHz, 32-bit ARM Cortex-M3), and then some!

I was even looking into going cheaper/lower-end… something like the ~$0.70 GD32F130G8U6 or GD32F130F8P6 (64 KB flash, 8 KB ram, 2x SPI, 72 MHz, 32-bit ARM Cortex-M3) should be sufficient (in theory).

I will definitely share the code when I have something more worth sharing.

Cool. Will have to check that out later, although I don’t currently have a Lattice ICE FPGA dev board. I currently have Xlinix and Altera FPGA dev boards (gathering dust), and there was a pretty big learning curve on the software for those.

I was gonna say arduino is pretty much synonymous with overhead which is something you just dont want to worry about with a project like this. It might come down to even having to go to assembly if c doesn’t cut the tight timing.

Arduino may be more convenient to get started? Install package, upload using serial.
For critical stuff you can always resolve to naked functions with inline assembly.

What would it take to get started with this? and what uploader/programmer software is required?

I see a few interesting usages for me:

  • Hooked to an ATMEGA32U4 and used as an arduboy devkit console. With TV/VGA output (with RGB LED simulation), (S)NES controller input, ICSP uploader so full 32K can be used for development

  • hooked to an 128x128 (color) OLED display for a custom pico8-ish handheld? :smiley:

Quick update/sneak preview:

I was busy with other things yesterday and today, but was able to get things mostly working late Thursday night / early Friday morning.

There are still some graphical artifacts that I have to deal with, and I am still having issues with always properly determining where the beginning of the screen is in the incoming SPI data stream. Even in this example, the display is shifted over by one pixel because I am not switching between data/command modes quickly enough. And @Mr.Blinky’s bootloader appears to use page addressing mode (sends in 128 byte chunks instead of the whole screen at once) which isn’t properly supported yet.

I have a few more tricks to try, and more code cleanup to do, but progress nonetheless, and amazing to see it work even this well considering what is happening and the timings involved.


That looks cool

Use a timout to reset to the display buffer to 1st position when there is no display data for 10 mSec or so (It takes about ~1.2 mSec to output the 1K display data over SPI but it can be interrupted by interrupts)

1 Like

Yep, came to the same conclusion last night after posting this. Will try it out later today.

1 Like

Yeah, that definitely helps.

I am getting almost perfect ‘in-game’ results, but the Cathy3k flashcart bootloader is still a bit screwed up.

Even after a bunch of optimizations, I think I am still not transitioning between command/data modes quickly enough and am loosing a few bits off the beginning of the SPI data stream.

To make sure I only process ‘data’ and not ‘commands’, I have pin change interrupts on the DC and CS pins running the following code:

void EXTI9_5_IRQHandler() { // B9 (DC) // B12 (CS)
  if ((GPIOB->IDR & 0x1200) == 0x0200) { // CS=0, DC=1
    SPI2->CR1 = 0x8641; // Enable SPI
  } else {
    SPI2->CR1 = 0x8701; // Disable SPI

which compiles down to:

0800042c <EXTI9_5_IRQHandler>:
void EXTI9_5_IRQHandler() { // B9 (DC) // B12 (CS)
 800042c:	b508      	push	{r3, lr}
static __INLINE void __disable_irq()              { __ASM volatile ("cpsid i"); }
 800042e:	b672      	cpsid	i
  if ((GPIOB->IDR & 0x1200) == 0x0200) { // SS=0, DC=1
 8000430:	4b09      	ldr	r3, [pc, #36]	; (8000458 <EXTI9_5_IRQHandler+0x2c>)
 8000432:	689b      	ldr	r3, [r3, #8]
 8000434:	f403 5390 	and.w	r3, r3, #4608	; 0x1200
 8000438:	f5b3 7f00 	cmp.w	r3, #512	; 0x200
    SPI2->CR1 = 0x8641; // Enable SPI
 800043c:	4b07      	ldr	r3, [pc, #28]	; (800045c <EXTI9_5_IRQHandler+0x30>)
 800043e:	bf0c      	ite	eq
 8000440:	f248 6241 	movweq	r2, #34369	; 0x8641
    SPI2->CR1 = 0x8701; // Disable SPI
 8000444:	f248 7201 	movwne	r2, #34561	; 0x8701
 8000448:	801a      	strh	r2, [r3, #0]
static __INLINE void __enable_irq()               { __ASM volatile ("cpsie i"); }
 800044a:	b662      	cpsie	i
 800044c:	f44f 7000 	mov.w	r0, #512	; 0x200
 8000450:	f000 fbea 	bl	8000c28 <EXTI_ClearITPendingBit>
 8000454:	bd08      	pop	{r3, pc}
 8000456:	bf00      	nop
 8000458:	40010c00 	andmi	r0, r1, r0, lsl #24
 800045c:	40003800 	andmi	r3, r0, r0, lsl #16

I’m at a bit of a loss at the moment on how to make this much better.

I haven’t done much ARM assembly before, but the code looks pretty good to me already at a first glance.

Maybe this is just at the limits of a MCU based SPI slave under software control for the SS pin?

Maybe I could find a way to combine the CS and DC pins in hardware to a single SS pin (SS=0 only when CS=0 and DC=1) and eliminate the software SS control?

I could always try and modify the bootloader to have a short delay between command->data transitions (or stop using page addressing mode), but that just feels like cheating.

Anyone with any other ideas?

I don’t know little about the STM32 so I’m not sure how fast ISR response is and if there is a setup time on SPI enable.

How about using a seperate interrupt for CS to control SPI enable / disable if that’s possible
and use the DC pin interrupt to set a data flag,vector or DMA controller enable?

It would be trivial to use a single quad nand chip to do this in hardware.

Yeah, I’m not sure either. I am already using the highest priority for these ISRs. Maybe there is SPI/DMA setup time getting in the way as well.

Well, technically the CS and DC pins already have their own ISRs, although they currently do the same thing… i.e. read the state of port B, mask the appropriate bits, and compare if CS=0 and DC=1. I’m not sure I follow the rest. I don’t think setting a flag would help because there isn’t any other code in play (DMA takes over). Not sure what you mean by vector. I could enable/disable the DMA controller, but that would take even longer than just enabling/disabling the SPI that feeds the DMA. At one point I was doing both, but didn’t see any benefit over just enabling/disabling SPI.

It isn’t quite NAND logic, unless you invert one of the signals first, but I think I can build a circuit with 2 transistors (1 PNP, 1 NPN) and 3 resistors that gives the logic I need. I’ll try and built it up tonight.

I was thinking to sacrifice one of the nand gates to make an inverter by tying the two inputs together but yes, transistor resistor or diode resistor logic would work just as well.

I mentioned those options as I wasn’t sure how you read the SPI data.

Having a better look at the image I see that SPI is enabled 2 bits too late (~250 nSec) and two bits of the next byte are shifted in at the top.

I have no idea how what the timings are but. Do you need to disable the interrupts manually? aren’t they disabled by the hardware upon interrupt service?

You could also try enablind SPI upon entry:

 SPI2->CR1 = 0x8641; // Enable SPI
  if ((GPIOB->IDR & 0x1200) != 0x0200) { // CS=0, DC=1
    SPI2->CR1 = 0x8701; // Disable SPI

If you use a naked ISR with inline assembly only and reduce register usage you can speed up the push {r3, lr} I think (now r0, r1,r2,r3 and lr are pushed?)

If you can’t make this faster. You need to poll the pins and SPI continiously and drive VGA using interrupts instead. The VGA interrupt would also poll the pins and SPI on frequent intervals to ensure no SPI data is lost.

What I ment with vector is you set up a RAM vector that can be called by the interrupt service routine. You could for example use the CS interrupt to set the vector to enable or disable SPI then the DC interrupt would just jump to that vector.

1 Like