Summary:
Poddar et al. use HMMs and speech to recognize complementary keyword-gesture pairs in weather reporting. The multi-modal system would combine speech and gesture recognition to improve recognition accuracy.
HMMs work well for separating 20 isolated gestures from weather narrators, but continuous gestures drop the accuracy to between 50-80% for sequences. Co-occurrence analysis seeks to study the meaning behind keywords occurring with gestures, the frequency behind these occurrences, and the temporal alignment of the gestures and keywords. A table they presented shows that certain types of gestures (contour, area, and pointing) are more heavily associated with certain keywords ("here", "direction", "location"). The accuracy of recognizing the gestures can improve with both video (gesture) and audio (keywords).
Discussion:
This paper does not necessarily add much to solutions, but it was written ten years ago and did show some nice results that combining speech and gesture can improve recognition. Since the authors did not use a speech recognition system, the errors with that system would also produce interesting results that differ from the given accuracies.
Subscribe to:
Post Comments (Atom)
1 comment:
So adding speech to gestures makes it easier. Who knew? Seems pretty obvious, but like you said, it /was/ ten years ago. I don't like their manual mapping of words to meaning. I think if they tried this automatically, they'd do even worse and 60% accuracy (which is already bad).
Post a Comment