Of course (as with anything) you're going to get more speed out of the assembler... you can use the Load/Store opcodes with the built in increment/decrement of the X, Y, Z, etc... where the compiler will tend to do it's own 16 bit math (for ints, which is common with memory location)... this isn't the part of the system that's currently slowing anything down though. I already wrote a screen blank routine in assembler (copying zeros into the buffer). I think I computed by hand that it would be 16x faster than the C implementation - but in a typical app if drawing is 99% of CPU and copying buffer over SPI is 1% then you'd want to first tackle the 99% to have the most impact. Oh I remember... a lot of the speed was that I could repeat instructions like 4 opcodes in a row to copy 0s and increment is way faster than a tight loop that is checking a bound condition 4x as often. Plus if you copy 4 at a time you can use 8 bit math for your counter, not 16. And all said it down it come out in like 18 bytes of assembly vs 24 bytes of compiled C or something. It could no doubt be a few bytes smaller even at the cost of speed.
The issue I've seen with the core lib as it stands now are functions that call other functions that call other functions - and doing so with a crazy long parameter list... the compiler than has to push those onto the stack - or push registers only the stack to free up registers for the params. If you disassembled you'll see it's not uncommon for many of our functions to start by pushing 16 registers on the stack then end by popping them off... so if you call that function in a tight loop from another function - boom you're now potentially 32 times slower - just because of your function calling overhead.
I only started with the screen blank to pick something easy that had an obvious implementation and way to be improved.
Someone recommended we always inline setPixel, for example... I tried it and my compiled sketch was like 200 bytes larger (not as bad as i thought)... though I didn't bench it to see about the speed improvements. setPixel is actually one of the better functions with only 3 args.