In it Michael explains how you can use Premiere Pro, Soundbooth, Adobe Media Encoder, and Flash to to automatically generate a text transcript based on speech in a video, and then use that transcript as a captioning file within Flash.
One thing I'd like to point out (without actually having gone through the process myself) is that it may be preferable to keep the produced XML timed caption file separate from the video rather than embedding the cue points into the video, if that's what's happening. This would leave you more flexible if you had to, for example, provide multiple language tracks for one piece of video content, and even opens the door for switching the language of the transcriptions at runtime.
Check out Michael's article here.


As you mentioned, and it appears that the tutorial does embed cue points, getting an external XML file is the way to go. But then you have to deal with the "jumping to nearest keyframe" problems.
Either way, this has amazing potential!