Coding in ASM - What kind of 8 bit instruction set?


(Damian Caynes) #1

A friend and I are 8 bit coders, I have experience with 6510 on the c64 and he also has z80 experience. We’d like to tackle coding the Arduboy in asm, but I was just wondering, what kind of 8 bit instruction set does it have?


(Damian Caynes) #2

Ok I did some googling, apparently the Arduboy uses an AVR RISC, which I have no experience with.

But I did find the instruction set manual - http://www.atmel.com/Images/doc0856.pdf


Welcome New Developers! A Listing of Development Links and Articles
#3

Here is the datasheet on the chip as well.
http://www.atmel.com/Images/Atmel-7766-8-bit-AVR-ATmega16U4-32U4_Datasheet.pdf

I’m curious to see how this will turn out. I will eventually want to convert some functions in the future to assembly for tighter control.


(Damian Caynes) #4

At a quick glance it looks fairly familiar and has quite a few handy extra opcodes compared to 6502, I know my mate is going to get really into this because he’s a really hardcore hacker type, who also loves making classic 8 bit games :slight_smile:

We’ll be putting together an ASM library of common routines we use when making games for the Arduboy.


#5

Cool. If you guys build out any fast routines for memory transfers let us know. It’s the area I’m the most interested since I’m looking to build a easy & fast to use tile map system.


(Damian Caynes) #6

For sure that’ll be one of our first experiments.


(Josh Goebel) #7

Of course (as with anything) you’re going to get more speed out of the assembler… you can use the Load/Store opcodes with the built in increment/decrement of the X, Y, Z, etc… where the compiler will tend to do it’s own 16 bit math (for ints, which is common with memory location)… this isn’t the part of the system that’s currently slowing anything down though. I already wrote a screen blank routine in assembler (copying zeros into the buffer). I think I computed by hand that it would be 16x faster than the C implementation - but in a typical app if drawing is 99% of CPU and copying buffer over SPI is 1% then you’d want to first tackle the 99% to have the most impact. Oh I remember… a lot of the speed was that I could repeat instructions like 4 opcodes in a row to copy 0s and increment is way faster than a tight loop that is checking a bound condition 4x as often. Plus if you copy 4 at a time you can use 8 bit math for your counter, not 16. And all said it down it come out in like 18 bytes of assembly vs 24 bytes of compiled C or something. It could no doubt be a few bytes smaller even at the cost of speed.

The issue I’ve seen with the core lib as it stands now are functions that call other functions that call other functions - and doing so with a crazy long parameter list… the compiler than has to push those onto the stack - or push registers only the stack to free up registers for the params. If you disassembled you’ll see it’s not uncommon for many of our functions to start by pushing 16 registers on the stack then end by popping them off… so if you call that function in a tight loop from another function - boom you’re now potentially 32 times slower - just because of your function calling overhead.

I only started with the screen blank to pick something easy that had an obvious implementation and way to be improved.

Someone recommended we always inline setPixel, for example… I tried it and my compiled sketch was like 200 bytes larger (not as bad as i thought)… though I didn’t bench it to see about the speed improvements. setPixel is actually one of the better functions with only 3 args.


#8

Nothing personal Dreamer3 but what you stated I already know and in great detail (I read the instruction set the first day I came to the forums in a thread we were talking about the screen fill function being bad/slow). =) I guess it doesn’t hurt for others to read to be more informed on what’s going on though.

Far as I can tell, and I could be wrong on this since I am going off a Microview (mini oled arduino device) at the moment while I wait for my devkit from the kickstarter). Due to the vertical pixel nature (1x8 per byte) of the display. My main point of focus is going to be on dealing with trying to composite an 8x8 tile (8bytes) based on data from 4 tile blocks (also 8bytes) in a fast enough manor. The faster I make it the more cpu can be spent on logic or quasi timing to resolve the vsync issue. My library will be more focus on those who are use to working on gbc/gba/vb era hardware (where I got my start).

Regardless, It’s interesting to see what the two come up with in assembly. As with all CPU, sometimes less instructions is not the fastest.


(Josh Goebel) #9

What type of compositing are you talking about? Sounds like you’re talking about a bit more than a tile based rendering engine.


#10

Since this is a Bit based display, thus 1 byte is 8 pixels and not one, scrolling a 8x8 tile vertically by 1 pixel you go from 8 writes to 16 with some additional bit masking math to not overwrite what is already set. Since the whole background layer will be composed of a tile map of 8x8 tiles which can can be scrolled horizontally or vertically, I will want to write a faster system to handle the drawing. Simply just doing drawBitmap() per tile in the map would be horribly inefficient.


(Josh Goebel) #11

I fixed a lot of it’s issues in my megaman demo branch. I think I got it down from 5000 micros to around 3000 micros (333 theoretical FPS) for rendering a full screen from 4-5 16x32 tiles. drawBitmap is already trying to do the buffer writes as efficiently as possible - it’s problem is it pushes all the bounds checking into the inner loop. In megaman I move most of them out of the inner loop. Inner loop should just copy data to the display in a tight loop. You might be surprised how fast it already is (at least for C). Of course for rendering 8x8s which will MOSTLY be fully onscreen you really do need a specialized function… since you could entirely skip bounds checking for 90% of the tiles. Only the edge ones would need any extra effort…

Love to see your code and any benchmarks as you’re working on it. :slight_smile:


(Josh Goebel) #12

Of course you could read both tiles then do the bit math and still keep 8 writes… but I’m guessing the memory writes aren’t what are going to cost ya the most here.


#13

I think memory writes are fine. It’s the reduced overhead and loops will save me the most.

Yep. Just as soon as that kickstarter ends and I get my devkit. =)


(Josh Goebel) #14

OR you might be going able this all wrong. Just draw the full 16x8 (tiles) with 8x8s… THEN if there is a Y offset hardware shift the screen up x pixels and then draw the bottom rows of tiles again, masking the correct position. Of course any sprites drawn later would need their x,y calculated to take the screen hardware shift into consideration… but you’re drawing 128 tiles and fewer (I’d guess) sprites, so putting the hard math in the sprites rather than in every single sprite draw could be a win.

I assume you plan on keeping the 1k screen buffer - or else you won’t know what to mask off… correct?


(Mike McRoberts) #15

How did you get on with this? Written anything is ASM for the Arduboy yet?


(Kevin) #16

There is something similar to ASM actually running on a sort of virtual machine, that’s close! :slight_smile:

http://community.arduboy.com/t/abasm-dp1-program-the-arduboy-on-the-arduboy