Yes, this would work. I would include interpolation (perhaps this is what you meant by 'smooth the movement') between frames and set a best-guess based on these accumulated values at some point beyond the rendered frame -- though close enough not to produce discernible lag or strange 'drifting'. I would definitely use the acceleration delta of many of these to help determine the best-guess. Linear interpolation would be acceptable due to the minuscule time-interval between guesses. I would also determine some sort of weighted average based on the differing values to best determine where the device is related to the head and tick at regular intervals.
I wonder what the power consumption would be for something like this. The camera would have to be running, but I suppose if the majority of tracking code were DSP, it could be done quite efficiently. My only concern would be the facial recognition. I expect darker faces would also be more challenging due to the reduced contrast.