Over the christmas break I put together a little demo which renders the utah teapot (240 faces) at >30fps on the arduboy, with 4x4 dithering. I plan to do a bit more before releasing a proper demo (working on music now), but @bateske asked me to share here so here it is.
Download source: https://github.com/a1k0n/arduboy3d/archive/master.zip
Happy to answer questions on how this works. I’ll give a brief overview; sorry about the long post but I’ve been holding this in too long
I started off just rendering an octahedron, as it was simple and the faces were triangles instead of squares (usually I’d use a cube…). And then I extended it to an icosahedron, but then someone pushed me to do more:
That required doing a ton of optimization work, but I’m happy I persevered. In the course of this project, I ended up creating a bespoke statistical profiler which runs in a timer ISR and inspects the call stack, buffers it, and sends it over the USB serial port. I’ll get to that later.
Here’s how it works.
The arduboy screen is a 1024 byte array where each byte corresponds to eight vertically arranged pixels; you can think of it as 8 horizontal “banks” of 8 pixels tall. So the triangle filler works in vertical stripes (instead of the more typical horizontal ones), and the inner loop can plot up to eight pixels at once just by writing a byte to memory and then moving down to the next “bank”. It’s no extra work to plot a dithered pattern; you just write a different byte value for each horizontal column (0x55, 0xaa, etc for a checkerboard).
Another important property to making this look good is that all screen coordinates are computed with sub-pixel accuracy (it’s 4 extra subpixel bits, so the real 128x64 screen becomes a virtual 2048x1024 one). The reason for this is that when plotting lines or triangles, the interior pixels covered by the triangle can change even if the vertices of the triangle map to the same pixels. So slow rotations make subtle changes to the shape of the outline and it really improves the overall aesthetic.
Compromising on rendering accuracy would speed things up a bit, but I’m unwilling to do that and I found better ways to optimize, though ultimately I did compromise precision in 3D coordinates.
I use the same 4x4 matrix as what’s on the wikipedia page, but converted into a 17-level lookup table (from pure black to 15 in-between levels to pure white). A couple lines of numpy generate this.
The teapot has 137 vertices and 240 faces. Naïvely doing the rotation, projection and backface sorting on a microcontroller stretches the frame budget, even without rendering polygons.
The original octahedron used the AVR
float library for all computations, and that ran pretty fast, but then I got a bit more ambitious and had to switch to fixed point math in order to render a teapot. I first did this in a straightforward way, using a lot of 32-bit integer multiplications instead of float multiplications.
However the profiler showed me I was basically spending all my time in the 32-bit multiply and divide routines.
The AVR has a variety of 2-cycle 8x8 -> 16 multiply instructions, and this makes extensive use of that. It does not have any efficient way to perform division, though, so I had to replace all division with either subtract loops (the triangle filler does this) or fake it (the perspective projection is a linear Taylor expansion of k1/(z + k2) ).
The first step in rendering (
DrawObject if you’re looking at the code) is to create a rotation matrix; this is done with a 10-bit precision sine lookup table (another neat trick I discovered: 1024 * (sin(x) - x) is < 256 everywhere, so a 10-bit precise sine LUT fits in 8 bits just by adding x back on), and some 32 bit math to construct a rotation matrix once up front, and then that matrix is quantized down to 8 bits.
The object models are also quantized down to 8 bit coordinates with a little python script which takes an .obj format, rescales the model and quantizes to 8 bits, fuses merged vertices, and also rescales / quantizes face normal vectors. Then the rotation is a 3x3 8-bit matrix multiplied with an 8-bit 3-vector; the AVR has a special instruction (
fmuls) which will do a 1.7 signed fixed point multiplication which is exactly what I needed, and is available in avr-gcc as
Then the vertices are projected with the Taylor expansion trick: 2^20 * x / (2^10 - z) is approximated as x * (2^10 + z) and it’s good enough for a 128x64 screen…
I still need to do 32-bit multiplications to check the face winding order (faces that have clockwise vertices are facing away from the camera, so are not rendered), since the vertices are 16-bits in precision: both because the subpixel screen grid is 2048x1024 but also because some vertices might be off-screen, and we still want to correctly clip the triangles if they are.
The teapot is not a convex object; if you render it without allowing for this, the knob at the top, the handle, or the spout will show through the object like a ghost, unless you either z-buffer (HA! too slow, not enough memory anyway) or draw the faces back-to-front.
I actually implemented a heap sort to do this, which worked perfectly on my computer but would crash on the AVR. Turns out, I’m completely out of memory. Between the frame buffer (1024 bytes), projected vertices (13722 = 548 bytes), stack and other globals, I definitely don’t have another 377 bytes left (137 z coordinates, 240 face orders) to dynamically sort.
Instead, I pre-sort the faces, as I’m not hurting for flash space. The model comes with the faces “baked in” sorted by +x, and also has auxiliary face orders sorted by y and z. At runtime, I figure out which axis is most facing toward (or away from) the camera, choose the face order from the lookup table (reversed if necessary), and render in that order. It’s close enough.
Profiling on Arduboy
I was surprised to find there’s no real decent arduino profiler option.
I ended up hacking together something ugly; I have to manually count the number of ‘push’ instructions in the generated assembly to figure out where the return address on the stack is to get it to work, but once it is working it sends a stream of sampled addresses over the USB port. It also walks the stack a bit and probes the flash for anything that points to a CALL instruction; so that if, for instance, the avr-gcc multiply routine is interrupted, it can find the function which called it and attribute the sample to that function as well.
There is an auxiliary python scripts which collect the data from USB and save the histogram out, and another one to annotate the disassembly with the histogram.
It’s not great, but it’s really really useful.
The code needs tons of cleanup; a lot of junk (including the profiler) is still sitting in the main .ino file. It’s right here:
Feel free to poke at it, use derivatives in your own work. MIT license.
I don’t have binaries to provide at the moment; you’ll have to compile it yourself. It uses the Arduboy2 library but doesn’t use any of its features except getting the screen array pointer and the fast SPI blit.
I am not currently planning to turn it into a generic 3D rendering library; it’s not ready for that yet. I can conceive of ways to extend it to be useful outside this little demo and someday I might. But I have other sub-projects to work on before I get back to this.