Continuing the discussion from VGA1306 (VGA-out for DIY Arduboys implemented on an FPGA!):

Finally found a DE10-Nano locally for a reasonable price, and have managed to find the time to get the beginnings of an ‘Arduboy core’ running on it!

So, with a bit more development time / effort we could bring Arduboy games to FPGA gaming platforms like the MiSTer, the upcoming ‘Analogue Pocket’, or even the new GameBoy-shaped Hackaday Supercon Badge!

This is using a modified version of the Arduboy2 library, with the code running on a ‘soft’ RISC-V microcontroller implemented within the FPGA, and outputting video through the DE10-Nano’s HDMI port… I have bodged menloparkinnovation’s DE10-Nano fork of the FPGArduino project together with an example I found for using the HDMI output:

All of my customisations are made to the 32K/100MHz RISC-V variation of the core, with serial RX/TX mapped to USER_IN[0] and USER_OUT[1] of the MiSTer (aka. SCL / SDA on the DE10-Nano’s Arduino header) which are then connected to one of these USB-to-serial cables:

There is a corresponding Arduino IDE boards package which allows uploading new sketches over this serial connection, but it is also possible to just upload pre-complied hex files to the bootloader without the Arduino IDE - for example using ‘Send File’ in TeraTerm.

I am going to need help from someone(s) in the MiSTer community to port this project into the MiSTer ‘ecosystem’ though - just getting this far was a lot of effort, but developing specifically for MiSTer has its own additional and intimidating learning curve!


Slowly but surely making progress on this… getting started with coding for the MiSTer platform is a definite culture shock compared to the community here though - very much trial-by-fire! :fire: :flushed:

One thing I did notice when testing was that TEAMarg’s SiNe-DeMo ran much slower than expected - must be because of all the calls to sin(). This is the FPGA core’s implementation of it, which could likely be optimised? I wouldn’t know where to start - but thought that @Pharap may have some insight? :pleading_face:

Do you have a way to measure that to be sure?
It’s better to be safe than sorry.

E.g. It’s possible that the truncations from double to float are also consuming time.

Unfortunately maths isn’t my strongpoint, and being written in C rather than C++ limits how much I can infer from the code, and I don’t know much about optimising code for FPGAs because I don’t actually know much about how they physically work compared to standard general-purpose CPUs,
but I’ll have a think about it anyway.

On first read the code looks pretty sensible.
(I’ll resist the urge to review the quality. There are good points and bad points.)

It presumably maps x to quadrant 1 of the unit circle and then performs a 10 step taylor series approximation of sine.
(I say presumably because I’m not entirely sure what the 4 ifs are doing, I only know for definite what the fmod and for are doing.)

Off the top of my head, a few options:

  • It might be possible to rewrite the code to employ more lookup tables
  • You could reduce the number of iterations of the taylor series,
    which would give less accurate results in exchange for faster execution time
  • You could make a float version of sin, floats can be faster than doubles and have less precision (so fewer iterations of the taylor series may be required)

One thing I do know that may be of use is that the sin used on the Arduboy is actually written in assembly.
(It’s part of avr-libc, the documentation is here, and the source code is here.)

Lastly, one surprisingly obvious optimisation.
(Obvious for someone who knows what an alternating series is. :P)
(Assuming that eliminating branches is an optimisation for FPGAs.)
The if in the for loop can be eliminated by simply negating every other entry of the lookup table:

 * libc/math/lib_sin.c
 * This file is a part of NuttX:
 *   Copyright (C) 2012 Gregory Nutt. All rights reserved.
 *   Ported by: Darcy Gong
 * It derives from the Rhombs OS math library by Nick Johnson which has
 * a compatibile, MIT-style license:
 * Copyright (C) 2009-2011 Nick Johnson <nickbjohnson4224 at>
 * Permission to use, copy, modify, and distribute this software for any
 * purpose with or without fee is hereby granted, provided that the above
 * copyright notice and this permission notice appear in all copies.

 * Included Files

#include <sys/types.h>
#include <math.h>

 * Private Data

static double _dbl_inv_fact[] =
  1.0 / 1.0,                    // 1 / 1!
  -1.0 / 6.0,                   // -1 / 3!
  1.0 / 120.0,                  // 1 / 5!
  -1.0 / 5040.0,                // -1 / 7!
  1.0 / 362880.0,               // 1 / 9!
  -1.0 / 39916800.0,            // -1 / 11!
  1.0 / 6227020800.0,           // 1 / 13!
  -1.0 / 1307674368000.0,       // -1 / 15!
  1.0 / 355687428096000.0,      // 1 / 17!
  -1.0 / 121645100408832000.0,  // -1 / 19!

 * Public Functions

double sin(double x)
  double x_squared;
  double sin_x;
  size_t i;

  /* Move x to [-pi, pi) */

  x = fmod(x, 2 * M_PI);
  if (x >= M_PI)
      x -= 2 * M_PI;

  if (x < -M_PI)
      x += 2 * M_PI;

  /* Move x to [-pi/2, pi/2) */

  if (x >= M_PI_2)
      x = M_PI - x;

  if (x < -M_PI_2)
      x = -M_PI - x;

  x_squared = x * x;
  sin_x = 0.0;

  /* Perform Taylor series approximation for sin(x) with ten terms */

  for (i = 0; i < 10; i++)
      sin_x += x * _dbl_inv_fact[i];
      x *= x_squared;

  return sin_x;

Completely untested, but I can’t see any logical reason why it wouldn’t work.

While we’re on the subject of circles, tau (τ) is better than pi (π).

Nice catch, thanks! :+1: This didn’t seem to have a noticeable impact on the time, but I have added the changes in anyway - also tried switching from double to float and didn’t have much impact either… I noticed there was an alternate sinf() function that uses only 6 iterations instead of 10, and this is what really made the difference!

I borrowed the SCREENCOUNTER code from the old Mega-Man demo to get some numbers for milliseconds per frame, and the difference between sin() vs sinf() was around 104ms vs around 23ms per frame!

The AVR function does look nice and compact (and apparently also sticks to 6 iterations) but compiling and running the same sketch on a real Arduboy with the stock Arduboy2 library only gave around 32ms per frame… :confused:

EDIT: looks like trying to implement something the way it was done on the good ol’ ZX Spectrum could be a way to go?

Then either simple branching isn’t a bottleneck or the compiler’s been doing a really good job of optimising.

Whether or not there’s a difference always depends on implementation details,
so whether you’d see a gain or not depends on the CPU.

As far as I’m aware, on a program targeting AVR you wouldn’t see a difference because double is actually implemented with float.
This could be the case for the system you’re targetting too.

Although it does depend on how exactly you did the substitution.
E.g. if you were still calling fmod then the floats would be promoted to double and fmod would be using double internally.

That difference implies to me that there’s more to it than just the loop iterations.
Some of the other factors that could be partially responsible for the time difference:

  • fmodf rather than fmod
  • Smaller lookup table (better data locality if a cache is involved)
  • Fewer iterations gives a greater chance of the loop being unrolled
  • float might be cheaper for multiplication
  • float variables and parameters use less stack space (and possibly fewer registers)

Here’s an interesting idea: Try manually unrolling the loop.
If the unrolled loop is significantly faster than using a plain loop then the loop overhead is a bottleneck,
but if the difference is negligable then the floating point addition and multiplication are the real bottlenecks.

That’s not entirely surprising, AVR doesn’t have an FPU or any FPGA logic so its floating point operations are simulated entirely in software.
Not to mention the other factors like its limited pool of registers and its 8-bit register size.

Also AVR code tends to be optimised for size rather than speed.

Hrm… I’ve heard of Chebyshev polynomials but know little about them.

It’s plausible that this approach could be faster,
but I hasten to add that you shouldn’t implement it using macros like the code in the second link does.
Aside from the fact it’s messy, I doubt you’ll get optimal results out of it.

Try this instead (intended to be a .c file):

// Chebyshev coefficients
static const double sin_constant_0 = 1.276278962;
static const double sin_constant_1 = -0.285261569;
static const double sin_constant_2 = 0.009118016;
static const double sin_constant_3 = -0.000136587;
static const double sin_constant_4 = 0.000001185;
static const double sin_constant_5 = -0.000000007;

// Approximate sine using a Chebyshev polynomial
// Expects input in a (-0.25, 0.25) range
static inline double zx_sin(double x)
   const double w = (4 * x);
   const double z = ((2 * (w * w)) - 1);
   const double z0 = 1.0;
   const double z1 = (z0 * z);
   const double z2 = (z1 * z);
   const double z3 = (z2 * z);
   const double z4 = (z3 * z);
   const double z5 = (z4 * z);
   double sum = sin_constant_0;
   sum += (sin_constant_1 * z);
   sum += (sin_constant_2 * ((2 * z2) - z0));
   sum += (sin_constant_3 * ((4 * z3) - (3 * z1)));
   sum += (sin_constant_4 * ((8 * z4) - (8 * z2) + z0));
   sum += (sin_constant_5 * ((16 * z5) - (20 * z3) + (5 * z1)));
   return (sum * w);

static const double tau = 6.28318530717958647692528676655900576839433879875021;

double sin(double x)
	// Calculate the number of full turns
	const double y = (x / tau);

	// Limit to (-0.25, 0.25) range
	const double z = fmod(y, 0.25f)

	// Calculate
	return zx_sin(z);

I’m not sure if it would compile into smaller code than the macro version,
but it’s cleaner and easier to read.
(If the macro version does compile to smaller or faster code then consider using it, but if the difference is negligable I’d advise erring on the side of readability.)

My usual “this code is untested” warning goes double this time,
because I’m not actually a C programmer.
Everything I know about C comes from studying the differences between C and C++,
I have never read a C tutorial and it is a different language.

If the compiler accepts C++, let me know the version and I will write a superior version.

Hrm… was actually slower, counter-intuitively, and looks more like a sawtooth wave than a sine?

I adapted the ZX spectrum code precisely as it was presented in the second link,
so either I’ve made a mistake somewhere (which is possible - as I say, I’m not actually very good at maths) or this approach doesn’t actually work better for the FPGA.

Have you tried compiling the macro version from the link?
If that works better then there’s probably something wrong with my transcription,
if it doesn’t then it’s likely to be that this approach simply doesn’t work as well on FPGAs.

1 Like

…a month later and have made a lot of progress on this (pre-compiled hex files don’t work at the moment though, sketches need to be re-compiled using the modified Arduboy2 library):

Have managed to add in the ability to load hex files from the SD card using the MiSTer framework’s menu system (the flashing colours are just the address lines tied to RGB during the loading sequence to provide some kind of visual activity indication):


Curious how you have implemented the sound synthesis? It sounds really good through MiSTer, way better than the emulator.

Is there something we can learn from what you’ve done and implement it in the emulator?

Wouldn’t work for the emulator unfortunately - I am kind of cheating here, by re-writing the sound functions to just continually pass across a value for the desired pitch (or zero for off), instead of trying to actually emulate the real timers / interrupts of the 32u4 and interpret those register values! :lying_face:

So, the MiSTer core can’t run the same .hex file that the emulator or a real Arduboy can, only sketches that have been recompiled using the modified library… might be a useful strategy to apply to @SimonMerrett’s SAMD port though?

1 Like

Where is the math wizard who can give us true hex compatibility!?

Sounds good tho still. We talked about doing something like that for the emulator but I’m not keen on modifying all the games, maybe it’s what has to be done but would prefer not to.

Have been working some more on this, collaborating with the author of an ATmega soft core, to be able to move away from the RISC-V core I used and make it truly ATmega32U4 compatible:

…doing some testing with The Curse of AstaroK and getting a weird glitch with the background rendering - anyone have an idea which particular AVR instruction going wrong would cause this particular part of the rendering to go wrong, in this way? @Mr.Blinky?

…here are a couple more photos if it helps - I can’t figure what would be causing it to glitch in this particular way?

…seems to be a pattern with the enemy name glitching at least, where it goes sequentially through the alphabet!

Cool you’re using an actual avr core.

First thing that came to mind was: does the lpm rx,z+ and adiw xx,k correctly handle bit overflow from LSB to MSB (and xmega uses more then 16bit addressing?)

2nd ly I wondered if SPI was working properly. But that couldn’t cause the walk through the alphabet thing.

3rdly I wondered if the multiply is fuctioning properly (signed and unsigned) as thats what’s used to shift pixels at the proper positions (junk bits could be easily shifted/multiplied in)

I’d say see if you can recreate the issue with a simple sketch that moves a single sprite around. it will be easier to track down the issue

Maybe seeing a video of the glitching will give some more ideas.


I don’t know if it’s helping but each time you go in the dungeon, the names of all monsters are randomly generated, so what we see there is probably the while array of created names. Why it’s doing this is beyond me though.

1 Like

Try testing some other games.

Identify which ones have the problem and which (if any) don’t,
that should help to narrow down the issue.

In particular, try running some smaller, simpler games,
because that will mean less code to wade through.

Not sure if it’s an issue or not but I use ArdBitmap for this game. Is there any way the compression could be an issue?

I also use some generated music that uses interrupts written by someone … oh yeah, you!

1 Like

Not for font rendering though, for that you’re using SpritesB:

(Unless the names aren’t being rendered with Font3x6, but I would assume they probably are.)