Ok, so if 16bit value multiplication is enough for IDCT processing then one could get x2 performance from coding critical parts in assembler so not so huge improvements as with SSE.
But do you know if decoding is bottleneck and not display rendering or reading from SD card?
OMAP 2420 has 32KB/32KB cache, is it enough for performance critical part of mplayer?
Should get one when traveling again so can start playing again with assembler, even though last time was 12 years ago with Intel 8051