On Randomness

To avoid de-railing the choplifter thread, I’m opening this thread to discuss random number generators on the Arduboy.

Idea: Remove the random and regain 300 bytes of Flash
Here’s a simple test program:

#include <Arduboy2.h>

Arduboy2 arduboy;

unsigned long jsr;

void setup() {
  arduboy.boot();
  arduboy.clear();
  power_adc_enable();
  ADCSRA |= _BV(ADSC);
  while (bit_is_set(ADCSRA, ADSC)) { }
  jsr = ((unsigned long)ADC << 16) + micros();
  power_adc_disable();
  
  // randomSeed( seed ); // uncomment when testing regular random()

}

void drawPixel(int16_t x, int16_t y )
{
  if (x < 0 || x > (WIDTH-1) || y < 0 || y > (HEIGHT-1))
  {
    return;
  }

  uint16_t row_offset;
  uint8_t bit;

  uint8_t row = (uint8_t)y / 8;
  row_offset = (row*WIDTH) + (uint8_t)x;
  bit = _BV((uint8_t)y % 8);

  Arduboy2::sBuffer[row_offset] ^=   bit;
}

// Marsaglia's XOR-shift RNG, sans left-shift
// http://www.cse.yorku.ca/~oz/marsaglia-rng.html
long xrandom( long min, long max ){

    (jsr^=(jsr*(1<<17)), jsr^=(jsr>>13), jsr^=(jsr*(1<<5)));
    
    return jsr%(max-min) + min;
}

// same as above, but with 8-bit input/output. 32-bit seed.
uint8_t xrand8( uint8_t min, uint8_t max ){

    (jsr^=(jsr*(1<<17)), jsr^=(jsr>>13), jsr^=(jsr*(1<<5)));
    
    return jsr%(max-min) + min;
}

void loop() {
  uint8_t x, y;
  /* * /
  // standard random.
  // total: 6606
  x = random(0, 128);
  y = random(0, 64);
  /**/
  /* * /
  // total: 6280
  x = xrandom(0, 128);
  y = xrandom(0, 64);
  /* */
  /* */
  // total: 6280
  x = xrand8(0, 128);
  y = xrand8(0, 64);
  /* */
  
  drawPixel(x, y);
  arduboy.display();
}

The standard random produces white noise in this test (therefore passes), LCGs fail terribly (they form a visible pattern), and a XOR-shift RNG passes using less code than the standard random. The seed is still 32 bits in the 8-bit version because changing its size would impact the RNG’s period.

I tried it as a drop-in replacement in Karateka and freed 316 bytes, by means of a macro:
#define random( min, max ) xrand8( min, max )

Can it be made even smaller?

3 Likes

Does this help explain my leftie hostages?

Not really, we still don’t know anything about the distribution or implementation of random.

All we know is that random is a significantly large function.

I did an objdump on random, but it’s of limited use:

00000000 <do_random>:
   0:	a0 e0       	ldi	r26, 0x00	; 0
   2:	b0 e0       	ldi	r27, 0x00	; 0
   4:	e0 e0       	ldi	r30, 0x00	; 0
   6:	f0 e0       	ldi	r31, 0x00	; 0
   8:	00 c0       	rjmp	.+0      	; 0xa <do_random+0xa>
   a:	c8 2f       	mov	r28, r24
   c:	d9 2f       	mov	r29, r25
   e:	68 81       	ld	r22, Y
  10:	79 81       	ldd	r23, Y+1	; 0x01
  12:	8a 81       	ldd	r24, Y+2	; 0x02
  14:	9b 81       	ldd	r25, Y+3	; 0x03
  16:	61 15       	cp	r22, r1
  18:	71 05       	cpc	r23, r1
  1a:	81 05       	cpc	r24, r1
  1c:	91 05       	cpc	r25, r1
  1e:	01 f4       	brne	.+0      	; 0x20 <do_random+0x20>
  20:	64 e2       	ldi	r22, 0x24	; 36
  22:	79 ed       	ldi	r23, 0xD9	; 217
  24:	8b e5       	ldi	r24, 0x5B	; 91
  26:	97 e0       	ldi	r25, 0x07	; 7
  28:	2d e1       	ldi	r18, 0x1D	; 29
  2a:	33 ef       	ldi	r19, 0xF3	; 243
  2c:	41 e0       	ldi	r20, 0x01	; 1
  2e:	50 e0       	ldi	r21, 0x00	; 0
  30:	00 d0       	rcall	.+0      	; 0x32 <do_random+0x32>
  32:	82 2e       	mov	r8, r18
  34:	93 2e       	mov	r9, r19
  36:	a4 2e       	mov	r10, r20
  38:	b5 2e       	mov	r11, r21
  3a:	27 ea       	ldi	r18, 0xA7	; 167
  3c:	31 e4       	ldi	r19, 0x41	; 65
  3e:	40 e0       	ldi	r20, 0x00	; 0
  40:	50 e0       	ldi	r21, 0x00	; 0
  42:	00 d0       	rcall	.+0      	; 0x44 <do_random+0x44>
  44:	c2 2e       	mov	r12, r18
  46:	d3 2e       	mov	r13, r19
  48:	e4 2e       	mov	r14, r20
  4a:	f5 2e       	mov	r15, r21
  4c:	9b 2d       	mov	r25, r11
  4e:	8a 2d       	mov	r24, r10
  50:	79 2d       	mov	r23, r9
  52:	68 2d       	mov	r22, r8
  54:	2c ee       	ldi	r18, 0xEC	; 236
  56:	34 ef       	ldi	r19, 0xF4	; 244
  58:	4f ef       	ldi	r20, 0xFF	; 255
  5a:	5f ef       	ldi	r21, 0xFF	; 255
  5c:	00 d0       	rcall	.+0      	; 0x5e <do_random+0x5e>
  5e:	02 2f       	mov	r16, r18
  60:	13 2f       	mov	r17, r19
  62:	24 2f       	mov	r18, r20
  64:	35 2f       	mov	r19, r21
  66:	bf 2d       	mov	r27, r15
  68:	ae 2d       	mov	r26, r14
  6a:	9d 2d       	mov	r25, r13
  6c:	8c 2d       	mov	r24, r12
  6e:	80 0f       	add	r24, r16
  70:	91 1f       	adc	r25, r17
  72:	a2 1f       	adc	r26, r18
  74:	b3 1f       	adc	r27, r19
  76:	b7 ff       	sbrs	r27, 7
  78:	00 c0       	rjmp	.+0      	; 0x7a <do_random+0x7a>
  7a:	01 97       	sbiw	r24, 0x01	; 1
  7c:	a1 09       	sbc	r26, r1
  7e:	b0 48       	sbci	r27, 0x80	; 128
  80:	88 83       	st	Y, r24
  82:	99 83       	std	Y+1, r25	; 0x01
  84:	aa 83       	std	Y+2, r26	; 0x02
  86:	bb 83       	std	Y+3, r27	; 0x03
  88:	68 2f       	mov	r22, r24
  8a:	79 2f       	mov	r23, r25
  8c:	8a 2f       	mov	r24, r26
  8e:	9b 2f       	mov	r25, r27
  90:	9f 77       	andi	r25, 0x7F	; 127
  92:	cd b7       	in	r28, 0x3d	; 61
  94:	de b7       	in	r29, 0x3e	; 62
  96:	ec e0       	ldi	r30, 0x0C	; 12
  98:	00 c0       	rjmp	.+0      	; 0x9a <random_r>

0000009a <random_r>:
  9a:	00 d0       	rcall	.+0      	; 0x9c <random_r+0x2>
  9c:	08 95       	ret

0000009e <random>:
  9e:	80 e0       	ldi	r24, 0x00	; 0
  a0:	90 e0       	ldi	r25, 0x00	; 0
  a2:	00 d0       	rcall	.+0      	; 0xa4 <random+0x6>
  a4:	08 95       	ret

000000a6 <srandom>:
  a6:	60 93 00 00 	sts	0x0000, r22	; 0x800000 <srandom+0x7fff5a>
  aa:	70 93 00 00 	sts	0x0000, r23	; 0x800000 <srandom+0x7fff5a>
  ae:	80 93 00 00 	sts	0x0000, r24	; 0x800000 <srandom+0x7fff5a>
  b2:	90 93 00 00 	sts	0x0000, r25	; 0x800000 <srandom+0x7fff5a>
  b6:	08 95       	ret

I have just copied the code into the IDE to start playing with it.

Hi!

I agree: the stock RNG can be made smaller.

I get an error when compiling this. It should probably read jsr = ((unsigned long)ADC &lt;&lt; 16) + micros();?

Comma expression, why? :cry: Oh I see it’s the way the article it was lifted from defines it. It doesn’t help readability. :stuck_out_tongue_closed_eyes:

nitpicking you mean the shift register (and the seed) are still 32 bit /nitpicking

I get what you say and I wonder if we can get away using only 16bits for the shift register and an even simpler LSFR.

But we do! :wink:

http://svn.savannah.gnu.org/viewvc/avr-libc/trunk/avr-libc/libc/stdlib/random.c?view=markup

I think most Arduboy games don’t need high quality RNG and games that do can provide their own. Take ArduboyBeep, it is a good idea even if very limited because many games don’t need more.

Speaking of barebone, this seems to give decent results and is quite compact. It has a loop but it may still be faster than the 32 bit version in the OP.

// 16-bit maximal-period Galois LFSR
uint16_t xrand16( uint8_t min, uint8_t max ) {
    uint8_t res = 0;

    for (uint8_t i=0;i<8;i++) {
        const uint8_t msb = (jsr>>8) & 0x80;

        jsr += jsr;
        if (msb)
                jsr ^= 0x002D;

        res |= msb;
        res >>= 1;
    }
    return res%(max-min) + min;
}
2 Likes

I keep forgetting avr-libc is a separate project to Arduino.
The function names match up, so that looks likely.

Judging by the comments they’re using MINSTD, which is a Lehmer LCG.
(It’s been criticised in the past but the relevant publication is longwinded so I didn’t read it.)

Unfortunately that still features modulo bias, but it should be fairly cheap.

Out of interest, where did you get the code/parameters for the GLFSR?

As usual, if anyone comes up with randomising function(s) that are better/smaller than the standard ones, I’ll consider adding them to the Arduoby2 library.

3 Likes

This is the same LFSR shown here http://codebase64.org/doku.php?id=base:flexible_galois_lfsr and is featured on wikipedia https://en.wikipedia.org/wiki/Linear-feedback_shift_register#Galois_LFSRs for some reason I cannot recall, bits are shifted to the left unlike the examples in the above links, but they are all equivalent.

D’oh. Missed that when cleaning up the code.

Yeah, it’s hideous. I didn’t want to change it too much from the original, though. The one change I did was using jsr*(1<<17) instead of jsr<<17. Adding the multiply results in smaller code. :stuck_out_tongue:

I hadn’t seen the Galois LFSR, yet. A 16-bit shift register should be good enough for most games.

Another thing to consider, but I haven’t tested yet, is to move the modulo out of the function:
#define random( min, max ) (xrand16()%((max)-(min))+(min))
The arithmetic can then be optimized away, depending on its values. I think that, in theory, the compiler could replace things like rand()%(9-1)+1 with rand()&7+1.

Yes, it’s definitely desirable. I just assumed that would be done at some point before “release”.

Maybe modifying (jsr^=(jsr*(1<<17)), jsr^=(jsr>>13), jsr^=(jsr*(1<<5))); to use 16bit storage instead of 32bit would end up being both faster and compact. I really don’t like the loop in my Galois LSFR based variant.

I gave that a try, replacing the shifts with smaller values:

uint16_t xrand16( uint16_t min, uint16_t max ){

    jsr^=jsr*(1<<11);
    jsr^=jsr>>9;
    jsr^=jsr*(1<<3);
    
    return jsr%(max-min) + min;
}

I don’t know how to calculate the ideal shifts, so it was trial and error. Most of the trials resulted in really short periods that fail the white noise test badly.

11, 9, 3 look good at first, but eventually it produces a pattern:
ArduboyRecording%20(1)

A little while later this happens:
ArduboyRecording

I assume that the period has ended and the repetition of previous values results in the black screen.

Disassembly contains two small loops that iterate 3 times each. The two calls get inlined and the modulo is done with an AND op.
The entire sketch takes up 6208 bytes, which should be taken with a grain of salt, since the test reflects a very specific case: does not need modulo and GCC might not inline in larger sketches.

Karateka makes for a more realistic test, which takes up 25204 bytes (instead of 25688) with this dropped in, freeing 484 bytes. Here it is not inlined. Moving the modulo out, it drops further to 25166 bytes, 522 bytes freed. The opponent AI appears to be unaffected.


I also tested the GLFSR, which takes up 6270 bytes. It also produces a black screen, and does so more quickly than the 16-bit XOR. I guess this means it has a shorter period.

I found it interesting that it does not get inlined and the compiler is forced to add all this for modulo:


000017da <__divmodhi4>:
__divmodhi4():
    17da:	97 fb       	bst	r25, 7
    17dc:	07 2e       	mov	r0, r23
    17de:	16 f4       	brtc	.+4      	; 0x17e4 <__divmodhi4+0xa>
    17e0:	00 94       	com	r0
    17e2:	07 d0       	rcall	.+14     	; 0x17f2 <__divmodhi4_neg1>
    17e4:	77 fd       	sbrc	r23, 7
    17e6:	09 d0       	rcall	.+18     	; 0x17fa <__divmodhi4_neg2>
    17e8:	0e 94 01 0c 	call	0x1802	; 0x1802 <__udivmodhi4>
    17ec:	07 fc       	sbrc	r0, 7
    17ee:	05 d0       	rcall	.+10     	; 0x17fa <__divmodhi4_neg2>
    17f0:	3e f4       	brtc	.+14     	; 0x1800 <__divmodhi4_exit>

000017f2 <__divmodhi4_neg1>:
    17f2:	90 95       	com	r25
    17f4:	81 95       	neg	r24
    17f6:	9f 4f       	sbci	r25, 0xFF	; 255
    17f8:	08 95       	ret

000017fa <__divmodhi4_neg2>:
    17fa:	70 95       	com	r23
    17fc:	61 95       	neg	r22
    17fe:	7f 4f       	sbci	r23, 0xFF	; 255

00001800 <__divmodhi4_exit>:
    1800:	08 95       	ret

00001802 <__udivmodhi4>:
__udivmodhi4():
    1802:	aa 1b       	sub	r26, r26
    1804:	bb 1b       	sub	r27, r27
    1806:	51 e1       	ldi	r21, 0x11	; 17
    1808:	07 c0       	rjmp	.+14     	; 0x1818 <__udivmodhi4_ep>

0000180a <__udivmodhi4_loop>:
    180a:	aa 1f       	adc	r26, r26
    180c:	bb 1f       	adc	r27, r27
    180e:	a6 17       	cp	r26, r22
    1810:	b7 07       	cpc	r27, r23
    1812:	10 f0       	brcs	.+4      	; 0x1818 <__udivmodhi4_ep>
    1814:	a6 1b       	sub	r26, r22
    1816:	b7 0b       	sbc	r27, r23

00001818 <__udivmodhi4_ep>:
    1818:	88 1f       	adc	r24, r24
    181a:	99 1f       	adc	r25, r25
    181c:	5a 95       	dec	r21
    181e:	a9 f7       	brne	.-22     	; 0x180a <__udivmodhi4_loop>
    1820:	80 95       	com	r24
    1822:	90 95       	com	r25
    1824:	bc 01       	movw	r22, r24
    1826:	cd 01       	movw	r24, r26
    1828:	08 95       	ret

I dont know if this might be of some use

Link is broken, but googling the filename gives this:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.192.6584&rep=rep1&type=pdf

It’s about Hash functions, which are related. Hello, Commander uses something similar to generate terrain procedurally.

1 Like

Oh huh

There’s the main link

Another nice little paper

uint16_t Ent(uint16_t y){
  uint16_t seed = y * 1103515245 + 12345;
  return seed;
}

@FManga By the way, I spotted a slight bug in xrandom.
(1 << 17) overflows the size of int (which is 16 bits large on Arduboy) so the result isn’t what it’s supposed to be. You need to use (static_cast<uint32_t>(1) << 17) (or equivalent, e.g. (1UL << 17)).


I think the wikipedia one might end up smaller since one of the branches can be eliminated using the lfsr ^= (-lsb) & 0xB400u; bithack.

Again, that’s going to introduce modulo bias.

Either way please make it a function and not a macro.
Function macros have no concept of scope and end up ‘poisoning’ all scopes.

The compiler can still probably optimise it if it’s a function since it’ll be small enough to be inlined and then the compiler will be able to figure out the rest.

One of the reasons you’re seeing patterns there is that 9 isn’t a prime number.

17, 13 and 5 are all prime numbers for a reason.
11 and 3 are fine, but 9 isn’t prime. Try 5 or 7 instead.

It has a period of 65535 according to wkipedia, so most likely it’s due to the difference in bit length.

Most of that didn’t add much, however that last article had a couple of lines that got me thinking:

While the LFSR alone can give us pseudo-random data, we actually want to get truly random data whenever possible. In our scenario, the LFSR serves as an entropy pool: we collect actual random data and use it to reseed the LFSR.
Simply put, as long as we feed more entropy into the LFSR than we take out of it, we will actually get random results.

So maybe either of the proposed systems would work alright providing that there was a mechanism in place to track their period and once the period expires, re-seed them with more pin noise.
It wouldn’t be enough for scientific levels of randomness, but it would be enough to mitigate any issues arising from period lengh.


If anyone cares about modulo bias, here’s how to fix it:

// Modified version of OpenBSD's arc4random_uniform
template< typename UInt, typename Random >
UInt uniformRandom(Random random, UInt max)
{
	const UInt min = -max % max;
	UInt result = 0;
	while(result < min)
		result = random();
	return result % max;
}

template< typename UInt, typename Random >
UInt uniformRandom(Random random, UInt min, UInt max)
{
	return min + uniformRandom(random, (max - min));
}

Here’s a demo program

// Modified version of OpenBSD's arc4random_uniform
template< typename UInt, typename Random > UInt uniformRandom(Random random, UInt max)
{
	const UInt min = -max % max;
	UInt result = 0;
	while(true)
	{
		result = random();
		if(result >= min) return result % max;
	}
}

template< typename UInt, typename Random > UInt uniformRandom(Random random, UInt min, UInt max)
{
	return min + uniformRandom(random, (max - min));
}

template< typename UInt, typename Random > UInt nonUniformRandom(Random random, UInt max)
{
	return random() % max;
}

template< typename UInt, typename Random > UInt nonUniformRandom(Random random, UInt min, UInt max)
{
	return min + (random() % (max - min));
}

uint32_t xrandomstate = 1;
uint32_t xrandom(void)
{
  xrandomstate ^= xrandomstate * ((uint32_t)1 << 17);
  xrandomstate ^= xrandomstate >> 13;
  xrandomstate ^= xrandomstate * ((uint32_t)1 << 5);
  return xrandomstate;
}

void setup()
{
  while(!Serial);
  
  xrandomstate = 1;
  for(uint8_t i = 0; i < 20; ++i)
  {
    uint32_t value = nonUniformRandom(xrandom, 0, 128);
    Serial.println(value);
  }
  Serial.println();
  
  xrandomstate = 1;
  for(uint8_t i = 0; i < 20; ++i)
  {
    uint32_t value = uniformRandom(xrandom, 0, 128);
    Serial.println(value);
  }
  Serial.println();
}

void loop()
{
}
2 Likes

This can be achieved with code like:

1 Like

Arg, I keep forgetting int is 16 bits.

11,7,3 produces noise, but it cycles much more quickly than 11,9,3.
Primes give patterns too:
11,5,3 (a lot of diagonal lines, not sure if visible in gif):
ArduboyRecording%20(3)
11,13,3:
ArduboyRecording%20(2)
7,11,3:
ArduboyRecording%20(4)
13,11,5:
ArduboyRecording%20(5)

1 Like

A short period isn’t too much of an issue because of the re-seeding technique, but I guess that’s evidence that there’s more to it than mere primality.


Hrm, in hindsight I notice that:
17 = 16 + 1 = (4 * 4) + 1,
13 = 12 + 1 = (3 * 4) + 1,
5 = 4 + 1 = (1 * 4) + 1,
so maybe that’s also why those were chosen.

It’d be an issue for games that don’t want true random. The game I’m working on now depends on being able to change the seed and getting a known sequence afterwards.


Btw, I was using a macro previously precisely because of the “poisoning”. Since I’m just testing, I wanted to quickly replace all references to random, so macros do that well enough. In more permanent code, we’d have to see if we really want to replace random or provide something like arduboy.random(), which games would have to opt-in to. Migrating from the default to the other might be annoying in existing games.


As for modulo bias, for games that need PRNs generated quickly, the bias might be the lesser problem. The algorithm for removing modulo bias requires two full modulos and a loop that doesn’t run in constant time. In contrast, the biased implementation allows the compiler to use a simple mask in the ideal case (which won’t have a bias).

It might be good to have arduboy.random(min, max) and arduboy.uniformRandom(min, max), or something along those lines.