maemo.org - Talk - EMULib Source with Maemo Support

Quote:

Originally Posted by fms (Post 237185)

Understood. Will fix for the next version.

Good. Also let me know if you have problems with tearing using the latest diablo firmware, there should not be any.

By the way, your assembly code is not good for ARM11. For example LibARM.s contains lots of chunks of code like this:

Code:

        mov r14,r5,lsr #16

        orr r14,r14,r14,lsl #16

        mov r12,r5,lsl #16

        orr r12,r12,r12,lsr #16

The problem is that the shifted register operand is "Early Reg" and increases latency by 1 (see ARM11 TRM, section "Cycle Timings and Interlock Behavior" if you are interested in improving performance). In this particular case you have 2 cycles penalty because of pipeline stalls (you need to wait for one extra cycle after register modification before you can use it as a shifted operand). Just reordering instructions is faster (4 cycles instead of 6) with supposedly no harm for other ARM cores (and surely it is also better for superscalar cores such as Cortex-A8 because it allows dual issue):

Code:

        mov r14,r5,lsr #16

        mov r12,r5,lsl #16

        orr r14,r14,r14,lsl #16

        orr r12,r12,r12,lsr #16

ARM11 pipeline is not so complex (much simplier than x86 cores for sure) and it is usually possible to predict how it would work and how to make it faster.

Just in order to make life easier and ensure that you managed to schedule instructions properly without missing anything, it is possible to use oprofile and collect CYCLES_DATA_STALL events. Because of the pipeline properties, they do not point exactly to the poorly scheduled instruction but are reported with some delay. So if you are looking at 'opannotate' output and see some spikes of CYCLES_DATA_STALL samples, the offending code is usually a few lines above. Checking ARM11 TRM helps to understand why exactly you got this pipeline stall.

Also optimizations for improving memory access performance are important. ARM processors usually don't allocate cache line on write miss, but uses write buffer to store data to memory. This implies that a special care needs to be taken about writes to memory as they may become a bottleneck. For OMAP1710 (Nokia 770) and OMAP2420 (Nokia N800/810) it happens that 16 byte aligned stores of exactly 4 registers with STM instruction are able to make use of burst transfers and performance is much better (roughly twice). So for example, in spite of being somewhat counterintuitive, instead of

Code:

        LDM {set of 8 registers}

        STM {set of 8 registers}

it is better to use

Code:

        LDM {set of 8 registers}

        STM {set of first 4 registers}

        STM {set of last 4 registers}

That is of course if the destination address is 16 bytes aligned. In other cases burst writes are not used, memory bus is not used efficiently and the code is slower. You can also check the following code which implements this trick: https://garage.maemo.org/plugins/scm...er&view=markup

I'm not sure if the same burst write optimization is useful for other ARM processors though because it may be too platform/microarchitecture specific.