Arduboy_MiSTer

Continuing the discussion from VGA1306 (VGA-out for DIY Arduboys implemented on an FPGA!):

Finally found a DE10-Nano locally for a reasonable price, and have managed to find the time to get the beginnings of an ‘Arduboy core’ running on it!



So, with a bit more development time / effort we could bring Arduboy games to FPGA gaming platforms like the MiSTer, the upcoming ‘Analogue Pocket’, or even the new GameBoy-shaped Hackaday Supercon Badge!



This is using a modified version of the Arduboy2 library, with the code running on a ‘soft’ RISC-V microcontroller implemented within the FPGA, and outputting video through the DE10-Nano’s HDMI port… I have bodged menloparkinnovation’s DE10-Nano fork of the FPGArduino project together with an example I found for using the HDMI output:

http://www.nxlab.fer.hr/fpgarduino/


All of my customisations are made to the 32K/100MHz RISC-V variation of the core, with serial RX/TX mapped to USER_IN[0] and USER_OUT[1] of the MiSTer (aka. SCL / SDA on the DE10-Nano’s Arduino header) which are then connected to one of these USB-to-serial cables:

There is a corresponding Arduino IDE boards package which allows uploading new sketches over this serial connection, but it is also possible to just upload pre-complied hex files to the bootloader without the Arduino IDE - for example using ‘Send File’ in TeraTerm.

I am going to need help from someone(s) in the MiSTer community to port this project into the MiSTer ‘ecosystem’ though - just getting this far was a lot of effort, but developing specifically for MiSTer has its own additional and intimidating learning curve!

2 Likes

Slowly but surely making progress on this… getting started with coding for the MiSTer platform is a definite culture shock compared to the community here though - very much trial-by-fire! :fire: :flushed:

One thing I did notice when testing was that TEAMarg’s SiNe-DeMo ran much slower than expected - must be because of all the calls to sin(). This is the FPGA core’s implementation of it, which could likely be optimised? I wouldn’t know where to start - but thought that @Pharap may have some insight? :pleading_face:

Do you have a way to measure that to be sure?
It’s better to be safe than sorry.

E.g. It’s possible that the truncations from double to float are also consuming time.

Unfortunately maths isn’t my strongpoint, and being written in C rather than C++ limits how much I can infer from the code, and I don’t know much about optimising code for FPGAs because I don’t actually know much about how they physically work compared to standard general-purpose CPUs,
but I’ll have a think about it anyway.

On first read the code looks pretty sensible.
(I’ll resist the urge to review the quality. There are good points and bad points.)

It presumably maps x to quadrant 1 of the unit circle and then performs a 10 step taylor series approximation of sine.
(I say presumably because I’m not entirely sure what the 4 ifs are doing, I only know for definite what the fmod and for are doing.)

Off the top of my head, a few options:

  • It might be possible to rewrite the code to employ more lookup tables
  • You could reduce the number of iterations of the taylor series,
    which would give less accurate results in exchange for faster execution time
  • You could make a float version of sin, floats can be faster than doubles and have less precision (so fewer iterations of the taylor series may be required)

One thing I do know that may be of use is that the sin used on the Arduboy is actually written in assembly.
(It’s part of avr-libc, the documentation is here, and the source code is here.)

Lastly, one surprisingly obvious optimisation.
(Obvious for someone who knows what an alternating series is. :P)
(Assuming that eliminating branches is an optimisation for FPGAs.)
The if in the for loop can be eliminated by simply negating every other entry of the lookup table:

/************************************************************************
 * libc/math/lib_sin.c
 *
 * This file is a part of NuttX:
 *
 *   Copyright (C) 2012 Gregory Nutt. All rights reserved.
 *   Ported by: Darcy Gong
 *
 * It derives from the Rhombs OS math library by Nick Johnson which has
 * a compatibile, MIT-style license:
 *
 * Copyright (C) 2009-2011 Nick Johnson <nickbjohnson4224 at gmail.com>
 * 
 * Permission to use, copy, modify, and distribute this software for any
 * purpose with or without fee is hereby granted, provided that the above
 * copyright notice and this permission notice appear in all copies.
 * 
 * THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES
 * WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF
 * MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR
 * ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
 * WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
 * ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF
 * OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
 *
 ************************************************************************/

/************************************************************************
 * Included Files
 ************************************************************************/

#include <sys/types.h>
#include <math.h>

/************************************************************************
 * Private Data
 ************************************************************************/

static double _dbl_inv_fact[] =
{
  1.0 / 1.0,                    // 1 / 1!
  -1.0 / 6.0,                   // -1 / 3!
  1.0 / 120.0,                  // 1 / 5!
  -1.0 / 5040.0,                // -1 / 7!
  1.0 / 362880.0,               // 1 / 9!
  -1.0 / 39916800.0,            // -1 / 11!
  1.0 / 6227020800.0,           // 1 / 13!
  -1.0 / 1307674368000.0,       // -1 / 15!
  1.0 / 355687428096000.0,      // 1 / 17!
  -1.0 / 121645100408832000.0,  // -1 / 19!
};

/************************************************************************
 * Public Functions
 ************************************************************************/

double sin(double x)
{
  double x_squared;
  double sin_x;
  size_t i;

  /* Move x to [-pi, pi) */

  x = fmod(x, 2 * M_PI);
  if (x >= M_PI)
    {
      x -= 2 * M_PI;
    }

  if (x < -M_PI)
    {
      x += 2 * M_PI;
    }

  /* Move x to [-pi/2, pi/2) */

  if (x >= M_PI_2)
    {
      x = M_PI - x;
    }

  if (x < -M_PI_2)
    {
      x = -M_PI - x;
    }

  x_squared = x * x;
  sin_x = 0.0;

  /* Perform Taylor series approximation for sin(x) with ten terms */

  for (i = 0; i < 10; i++)
    {
      sin_x += x * _dbl_inv_fact[i];
      x *= x_squared;
    }

  return sin_x;
}

Completely untested, but I can’t see any logical reason why it wouldn’t work.


While we’re on the subject of circles, tau (τ) is better than pi (π).

Nice catch, thanks! :+1: This didn’t seem to have a noticeable impact on the time, but I have added the changes in anyway - also tried switching from double to float and didn’t have much impact either… I noticed there was an alternate sinf() function that uses only 6 iterations instead of 10, and this is what really made the difference!

I borrowed the SCREENCOUNTER code from the old Mega-Man demo to get some numbers for milliseconds per frame, and the difference between sin() vs sinf() was around 104ms vs around 23ms per frame!

The AVR function does look nice and compact (and apparently also sticks to 6 iterations) but compiling and running the same sketch on a real Arduboy with the stock Arduboy2 library only gave around 32ms per frame… :confused:

EDIT: looks like trying to implement something the way it was done on the good ol’ ZX Spectrum could be a way to go?


Then either simple branching isn’t a bottleneck or the compiler’s been doing a really good job of optimising.

Whether or not there’s a difference always depends on implementation details,
so whether you’d see a gain or not depends on the CPU.

As far as I’m aware, on a program targeting AVR you wouldn’t see a difference because double is actually implemented with float.
This could be the case for the system you’re targetting too.

Although it does depend on how exactly you did the substitution.
E.g. if you were still calling fmod then the floats would be promoted to double and fmod would be using double internally.

That difference implies to me that there’s more to it than just the loop iterations.
Some of the other factors that could be partially responsible for the time difference:

  • fmodf rather than fmod
  • Smaller lookup table (better data locality if a cache is involved)
  • Fewer iterations gives a greater chance of the loop being unrolled
  • float might be cheaper for multiplication
  • float variables and parameters use less stack space (and possibly fewer registers)

Here’s an interesting idea: Try manually unrolling the loop.
If the unrolled loop is significantly faster than using a plain loop then the loop overhead is a bottleneck,
but if the difference is negligable then the floating point addition and multiplication are the real bottlenecks.

That’s not entirely surprising, AVR doesn’t have an FPU or any FPGA logic so its floating point operations are simulated entirely in software.
Not to mention the other factors like its limited pool of registers and its 8-bit register size.

Also AVR code tends to be optimised for size rather than speed.

Hrm… I’ve heard of Chebyshev polynomials but know little about them.

It’s plausible that this approach could be faster,
but I hasten to add that you shouldn’t implement it using macros like the code in the second link does.
Aside from the fact it’s messy, I doubt you’ll get optimal results out of it.

Try this instead (intended to be a .c file):

// Chebyshev coefficients
static const double sin_constant_0 = 1.276278962;
static const double sin_constant_1 = -0.285261569;
static const double sin_constant_2 = 0.009118016;
static const double sin_constant_3 = -0.000136587;
static const double sin_constant_4 = 0.000001185;
static const double sin_constant_5 = -0.000000007;

// Approximate sine using a Chebyshev polynomial
// Expects input in a (-0.25, 0.25) range
static inline double zx_sin(double x)
{
   const double w = (4 * x);
   const double z = ((2 * (w * w)) - 1);
   
   const double z0 = 1.0;
   const double z1 = (z0 * z);
   const double z2 = (z1 * z);
   const double z3 = (z2 * z);
   const double z4 = (z3 * z);
   const double z5 = (z4 * z);
   
   double sum = sin_constant_0;
   sum += (sin_constant_1 * z);
   sum += (sin_constant_2 * ((2 * z2) - z0));
   sum += (sin_constant_3 * ((4 * z3) - (3 * z1)));
   sum += (sin_constant_4 * ((8 * z4) - (8 * z2) + z0));
   sum += (sin_constant_5 * ((16 * z5) - (20 * z3) + (5 * z1)));
 
   return (sum * w);
}

static const double tau = 6.28318530717958647692528676655900576839433879875021;

double sin(double x)
{
	// Calculate the number of full turns
	const double y = (x / tau);

	// Limit to (-0.25, 0.25) range
	const double z = fmod(y, 0.25f)

	// Calculate
	return zx_sin(z);
}

I’m not sure if it would compile into smaller code than the macro version,
but it’s cleaner and easier to read.
(If the macro version does compile to smaller or faster code then consider using it, but if the difference is negligable I’d advise erring on the side of readability.)

My usual “this code is untested” warning goes double this time,
because I’m not actually a C programmer.
Everything I know about C comes from studying the differences between C and C++,
I have never read a C tutorial and it is a different language.

If the compiler accepts C++, let me know the version and I will write a superior version.

Hrm… was actually slower, counter-intuitively, and looks more like a sawtooth wave than a sine?

I adapted the ZX spectrum code precisely as it was presented in the second link,
so either I’ve made a mistake somewhere (which is possible - as I say, I’m not actually very good at maths) or this approach doesn’t actually work better for the FPGA.

Have you tried compiling the macro version from the link?
If that works better then there’s probably something wrong with my transcription,
if it doesn’t then it’s likely to be that this approach simply doesn’t work as well on FPGAs.

1 Like