Spent two days (evenings) implementing per-op rotation. Result: it doesn't give any performance advantage. I'm not sure why: whether my code is very dirty and unoptimized, doing tens of matrix multiplications per frame or there some other reason. But that's it.