Summary:
Mäntyjärvi et al. apply discrete HMMs to accelerometer data for gesture recognition. The authors had a previous study that indicated users prefer defining their own gestures, or they prefer intuitive gestures.
The authors add noise into the gestures to increase the recognition of user-defined gestures under certain conditions. This supposedly speeds up the training process since less gestures need to be "drawn". Adding Gaussian noise versus uniform noise might improve the recognition. But not really.
Discussion:
This paper changed courses in the middle and moved from customization to noise addition. The gesture set they tested on was super easy and can be done with Rubine's recognizer. I'd like to see some data that users created and the differences between the user-defined gesture and the DVD gestures.
Monday, March 31, 2008
Thursday, March 27, 2008
Gesture Recognition with a Wii Controller
Summary:
Schlomer et al. showed that the Wii controller is pretty good at recognizing tennis gestures.
Discussion:
Here's a good evaluation study.
Schlomer et al. showed that the Wii controller is pretty good at recognizing tennis gestures.
Discussion:
Here's a good evaluation study.
SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction
Summary:
Murayama et al. presented a two-handed computer control device that allowed the manipulation of on-screen objects. The system, called SPIDAR G&G, consisted of two balls suspended in two horseshoe apparatus with six strings each. The user moved these balls with six degrees of freedom, which translated onto a cursor or object on the computer. The strings also had pull and resisted movement through small motors. Each ball included a pressure button that detected grip.
The authors evaluated the system using a pointer and a target object. The users had to manipulate the pointer and object with both balls in order to accomplish a goal. Three people tested their system and found that the use of two SPIDAR balls, as opposed to one and a keyboard, allowed the users to manipulate the objects faster. Also, haptic feedback helped.
Discussion:
Although the system sounds interesting, I have a lot of issues with the evaluation. The authors used only three people familiar with VR interfaces, which is quite low. A greater concern is that the system was only tested against another form of itself. SPIDAR G&G was only compared against SPIDAR G + keyboard, when really SPIDAR G&G should have been compared to a mouse and keyboard interface, or a joystick and mouse, or two joysticks, or a roller ball, or any number of more common peripherals. As is stands, I have no basis to say that the suspended ball manipulation method is any better than traditional interfaces. The only definite conclusion is that two balls are better than one, and having the balls touch back is beneficial.
Murayama et al. presented a two-handed computer control device that allowed the manipulation of on-screen objects. The system, called SPIDAR G&G, consisted of two balls suspended in two horseshoe apparatus with six strings each. The user moved these balls with six degrees of freedom, which translated onto a cursor or object on the computer. The strings also had pull and resisted movement through small motors. Each ball included a pressure button that detected grip.
The authors evaluated the system using a pointer and a target object. The users had to manipulate the pointer and object with both balls in order to accomplish a goal. Three people tested their system and found that the use of two SPIDAR balls, as opposed to one and a keyboard, allowed the users to manipulate the objects faster. Also, haptic feedback helped.
Discussion:
Although the system sounds interesting, I have a lot of issues with the evaluation. The authors used only three people familiar with VR interfaces, which is quite low. A greater concern is that the system was only tested against another form of itself. SPIDAR G&G was only compared against SPIDAR G + keyboard, when really SPIDAR G&G should have been compared to a mouse and keyboard interface, or a joystick and mouse, or two joysticks, or a roller ball, or any number of more common peripherals. As is stands, I have no basis to say that the suspended ball manipulation method is any better than traditional interfaces. The only definite conclusion is that two balls are better than one, and having the balls touch back is beneficial.
Sunday, March 23, 2008
Taiwan sign language (TSL) recognition based on 3D data and neural networks
Summary:
Lee and Tsai implemented a vision-based hand gesture recognition system to classify 20 hand TSL signs. The system used hand features based on visual distances, and 8 reflective markers were placed on the hand to assist in these readings. The features are then sent into a back-propogation neural network (BPNN) that had 15 features as inputs and the 20 gesture probabilities as outputs.
The features used include the distances between a wrist point and the finger tips, and the distances between each finger pair (spread).
10 students tested the system and produced 2788 gestures, of which half went to training and the other half to testing. The authors tested on neural networks with 2 hidden layers varying in size from 25 x 25 to 250 x 250. The best results were with the BPNN with 250 x 250 hidden nodes, with a testing accuracy of 94.65%. Two gestures were heavily confused because the only difference was the length of the finger shown (i.e., the fingers were bent in one gesture).
Discussion:
This was a pretty decent use of neural nets, and I'm glad that they gave the results at different hidden layers and the recognition rates for each gesture. In fact, now that I think about it, I'm just glad they gave results. These are definitely the best results I've seen and quite promising: one of their main issues was a good feature to distinguish between bent fingers and non-bent fingers.
The differences between 150x150 and 250x250 are statistically insignificant, but they might be more significant when more gestures are added. I especially like that there is little discrepancy between training and testing sets, which hopefully indicates that their approach works for the general user.
Lee and Tsai implemented a vision-based hand gesture recognition system to classify 20 hand TSL signs. The system used hand features based on visual distances, and 8 reflective markers were placed on the hand to assist in these readings. The features are then sent into a back-propogation neural network (BPNN) that had 15 features as inputs and the 20 gesture probabilities as outputs.
The features used include the distances between a wrist point and the finger tips, and the distances between each finger pair (spread).
10 students tested the system and produced 2788 gestures, of which half went to training and the other half to testing. The authors tested on neural networks with 2 hidden layers varying in size from 25 x 25 to 250 x 250. The best results were with the BPNN with 250 x 250 hidden nodes, with a testing accuracy of 94.65%. Two gestures were heavily confused because the only difference was the length of the finger shown (i.e., the fingers were bent in one gesture).
Discussion:
This was a pretty decent use of neural nets, and I'm glad that they gave the results at different hidden layers and the recognition rates for each gesture. In fact, now that I think about it, I'm just glad they gave results. These are definitely the best results I've seen and quite promising: one of their main issues was a good feature to distinguish between bent fingers and non-bent fingers.
The differences between 150x150 and 250x250 are statistically insignificant, but they might be more significant when more gestures are added. I especially like that there is little discrepancy between training and testing sets, which hopefully indicates that their approach works for the general user.
Labels:
hand gesture,
neural networks,
sign language,
vision
Tuesday, March 18, 2008
Wiizards: 3D Gesture Recognition for Game Play Input
Summary:
Kratz, Smith, and Lee use Wiimotes in a game where two wizards cast spells to damage one another. Each spell consists of a series of gestures and modifiers, and a wizard can block a spell by performing a blocking gesture and then mimicking their opponent's casting gestures.
Wii controller accelerometer data is used to gather a 3-dimensional gravitational reading for the three x, y, z axes. An observation vector is a collection of these data values, and Gaussians are applied to the observations to determine distribution probabilities. Classification maximizes over the probability that a gesture sequence was performed, given the observation data.
Without training, their system's HMM model with 15 states has around 50% accuracy and varies widely. Training can boost the accuracy to around 90%, but training cannot be performed in a real-time environment.
Discussion:
I'm curious as to how long it actually takes the system to train. The axis for the training figure did not specify, and if it only takes 30 seconds to train, this is not much longer than an initial load screen (and it would only have to happen once). If it takes 30 minutes to train, then we have a problem.
Also, the number of gestures in the system would hurt this time factor. Even 10 seconds over 100 gestures is unacceptable.
Kratz, Smith, and Lee use Wiimotes in a game where two wizards cast spells to damage one another. Each spell consists of a series of gestures and modifiers, and a wizard can block a spell by performing a blocking gesture and then mimicking their opponent's casting gestures.
Wii controller accelerometer data is used to gather a 3-dimensional gravitational reading for the three x, y, z axes. An observation vector is a collection of these data values, and Gaussians are applied to the observations to determine distribution probabilities. Classification maximizes over the probability that a gesture sequence was performed, given the observation data.
Without training, their system's HMM model with 15 states has around 50% accuracy and varies widely. Training can boost the accuracy to around 90%, but training cannot be performed in a real-time environment.
Discussion:
I'm curious as to how long it actually takes the system to train. The axis for the training figure did not specify, and if it only takes 30 seconds to train, this is not much longer than an initial load screen (and it would only have to happen once). If it takes 30 minutes to train, then we have a problem.
Also, the number of gestures in the system would hurt this time factor. Even 10 seconds over 100 gestures is unacceptable.
TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning
Summary:
Lieberman and Breazeal created a system to train human motor movements. Their hardware uses small vibrotactile actuators built into a sensory vest and sleeve. For each joint/sensor if the angle of the student's joint is different than the angle of the teacher (known) joint, then feedback is given to the joint. Higher errors increase the vibrating 'force field' effect.
To test their system, the authors had 40 subjects split into a pure visual group and a visual/haptic group. These subjects were given the task to mimic movements of a teacher. Overall, the error for the feedback group was much less than the error for the visual group, even after repeated trials.
Discussion:
Well written paper, clear subject manner, and a damn good results and evaluation section. I don't have much more to say than that.
Lieberman and Breazeal created a system to train human motor movements. Their hardware uses small vibrotactile actuators built into a sensory vest and sleeve. For each joint/sensor if the angle of the student's joint is different than the angle of the teacher (known) joint, then feedback is given to the joint. Higher errors increase the vibrating 'force field' effect.
To test their system, the authors had 40 subjects split into a pure visual group and a visual/haptic group. These subjects were given the task to mimic movements of a teacher. Overall, the error for the feedback group was much less than the error for the visual group, even after repeated trials.
Discussion:
Well written paper, clear subject manner, and a damn good results and evaluation section. I don't have much more to say than that.
Sunday, March 16, 2008
AStEINDR
S:
ST-Isomap PCA LLE LSCTN KNTN CTN ATN MDS SCTN DOF WTF
D:
I might come back to this later when I'm not so mad.
ST-Isomap PCA LLE LSCTN KNTN CTN ATN MDS SCTN DOF WTF
D:
I might come back to this later when I'm not so mad.
Articulated Hand Tracking by PCA-ICA Approach
Summary:
Kato et al. used Independent Component Analysis (ICA) to find basis vectors for hand motion features. The authors first use Principle Components Analysis to reduce the dimensionality of their system, and then they use ICA to find a set of vectors that are statistically independent from each other (i.e., basis vectors).
Data on 20 angles was collected with a glove. The authors then sandwiched all of the data for the 20 sensors together into one large vector; each sensor was sampled across 100 time points, and the data from all 20 sensors was merged into a 2000-dimension vector.
ICA is used to find the basis vectors for a hand such that a linear combination of these vectors will produce a desired hand movement. The basis vectors U are found through a weight matrix W and a sample of motion data X (where X is a matrix of hand motions). A neural learning algorithm (in this case, gradient descent) is used to calculate the weights. The resulting 5 basis vectors are the movement of each finger individually.
The authors then deviated from their abstract and discussed actually tracking a hand using particle filtering. A hand's current position can be estimated from its prior positions, so each basis vector can estimate where it believes the finger will be given its prior positions. The authors also segment a hand out of an image by doing some thresholding on an image and overlaying a hand model in the image to find the hand location.
There are no results.
Discussion:
There are no results.
The basis vectors seem obvious, but I'm glad that ICA found them.
There are no results.
Kato et al. used Independent Component Analysis (ICA) to find basis vectors for hand motion features. The authors first use Principle Components Analysis to reduce the dimensionality of their system, and then they use ICA to find a set of vectors that are statistically independent from each other (i.e., basis vectors).
Data on 20 angles was collected with a glove. The authors then sandwiched all of the data for the 20 sensors together into one large vector; each sensor was sampled across 100 time points, and the data from all 20 sensors was merged into a 2000-dimension vector.
ICA is used to find the basis vectors for a hand such that a linear combination of these vectors will produce a desired hand movement. The basis vectors U are found through a weight matrix W and a sample of motion data X (where X is a matrix of hand motions). A neural learning algorithm (in this case, gradient descent) is used to calculate the weights. The resulting 5 basis vectors are the movement of each finger individually.
The authors then deviated from their abstract and discussed actually tracking a hand using particle filtering. A hand's current position can be estimated from its prior positions, so each basis vector can estimate where it believes the finger will be given its prior positions. The authors also segment a hand out of an image by doing some thresholding on an image and overlaying a hand model in the image to find the hand location.
There are no results.
Discussion:
There are no results.
The basis vectors seem obvious, but I'm glad that ICA found them.
There are no results.
Thursday, March 6, 2008
A Hidden Markov Model Based Sensor Fusion Approach for Recognizing Continuous Human Grasping Sequences
Summary:
Bernardin et al. created a system to recognize human grasping gestures using a CyberGlove and pressure sensor data. Basic grasps are distinguished in 14 different ways by Kamakura's grasping primitives. These grasps include "5 power grasps, 4 intermediate grasps, 4 precision grasps, and one thumbless grasp".
To recognize these grasps, an 18 sensor CyberGlove is used, along with finger tip and palm sensors. 14 different pressure sensors are sewn into a glove, which is worn under the CyberGlove. The sensor data is passed into HMMs for recognition. A 9-state HMM is built for each gesture using the HTK. After each grasp, the grasped object must be released.
On a total of 112 training gestures from 4 users, the user dependent models were between 77 and 92%, whereas a user independent model was in the low 90s for all 4 users. This is most likely due to the increase in training data when all the user data is combined.
Discussion:
I thought the use of HMMs in this paper was actually quite good. The problem I have with HMMs is that they are absolutely horrible and explode when data is not properly segmented. In the case of grasps, it is probably less likely that somebody is going to go from 1 grasp to another without releasing the object they are holding. For most, general cases, the computer can assume that the lack of tactile input from the palm would indicate a grasp has ended.
Bernardin et al. created a system to recognize human grasping gestures using a CyberGlove and pressure sensor data. Basic grasps are distinguished in 14 different ways by Kamakura's grasping primitives. These grasps include "5 power grasps, 4 intermediate grasps, 4 precision grasps, and one thumbless grasp".
To recognize these grasps, an 18 sensor CyberGlove is used, along with finger tip and palm sensors. 14 different pressure sensors are sewn into a glove, which is worn under the CyberGlove. The sensor data is passed into HMMs for recognition. A 9-state HMM is built for each gesture using the HTK. After each grasp, the grasped object must be released.
On a total of 112 training gestures from 4 users, the user dependent models were between 77 and 92%, whereas a user independent model was in the low 90s for all 4 users. This is most likely due to the increase in training data when all the user data is combined.
Discussion:
I thought the use of HMMs in this paper was actually quite good. The problem I have with HMMs is that they are absolutely horrible and explode when data is not properly segmented. In the case of grasps, it is probably less likely that somebody is going to go from 1 grasp to another without releasing the object they are holding. For most, general cases, the computer can assume that the lack of tactile input from the palm would indicate a grasp has ended.
The 3D Tractus: A Three-Dimensional Drawing Board
Summary:
Lapides et al. designed and built a Tablet PC stand that can move vertically, allowing for a 3D drawing platform that switches the screen's view as the table is moved. The authors state that using the 3D Tractus will allow for a "direct mapping between physical and virtual spaces."
The frame of the 3D Tractus consists of aluminum bars and a table top, along with a counterweight that will balance the weight of the tablet and allow for the table top to slide up and down easier. The counterweight has to be tuned for each tablet's weight. A height sensor is built into the frame.
The drawing software for the system takes into account the height of the table when displaying a viewing angle to the user. The system uses line width as a depth cue, with farther lines thin and closer lines thick. An orthographic (cube) projection is used to demonstrate 3D depth, as well. Also, nothing of the sketch is displayed above the current tablet surface.
Discussion:
Although the idea of having a tactile way to sketch in 3D sounds appealing, the system could be implemented much better without a tactile, movable desk. Instead, having a z-axis button/wheel/control in the software will alleviate the issues with custom counterweights, a height constraint, awkward hand/arm positioning, and lack of mobility.
Also, the system is rather constrained with any large sketches since the user can move in the tablet's plane in infinite direction, but the vertical plane is limited to something like 40 centimeters.
Lapides et al. designed and built a Tablet PC stand that can move vertically, allowing for a 3D drawing platform that switches the screen's view as the table is moved. The authors state that using the 3D Tractus will allow for a "direct mapping between physical and virtual spaces."
The frame of the 3D Tractus consists of aluminum bars and a table top, along with a counterweight that will balance the weight of the tablet and allow for the table top to slide up and down easier. The counterweight has to be tuned for each tablet's weight. A height sensor is built into the frame.
The drawing software for the system takes into account the height of the table when displaying a viewing angle to the user. The system uses line width as a depth cue, with farther lines thin and closer lines thick. An orthographic (cube) projection is used to demonstrate 3D depth, as well. Also, nothing of the sketch is displayed above the current tablet surface.
Discussion:
Although the idea of having a tactile way to sketch in 3D sounds appealing, the system could be implemented much better without a tactile, movable desk. Instead, having a z-axis button/wheel/control in the software will alleviate the issues with custom counterweights, a height constraint, awkward hand/arm positioning, and lack of mobility.
Also, the system is rather constrained with any large sketches since the user can move in the tablet's plane in infinite direction, but the vertical plane is limited to something like 40 centimeters.
Labels:
3D inference,
sketching,
user interfaces,
user study
Wednesday, March 5, 2008
Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series
Summary:
Kadous created TClass, which uses metafeatures of gesture data to form a syntactic representation of a gesture.
The system was tested on two data sets: Auslan and Nintendo data. Auslan is an Australian sign language that has signs not like ASL, although the overall ideas of hand shape, location, and movement are present in the language. The Nintendo data comes from a Powerglove test set comprised of 95 basic hand movements.
Tests were conducted using a Powerglove (P5) and a Flock of Birds. On Kadous's tests, the initial error rate for TClass was extremely high compared to the best error rates (for both sets of data). Using AdaBoost, the system's accuracy became more tolerable, but it was never as good as a fine, hand-picked set of features.
Discussion:
I have mixed feelings about this system. I like the addition of metafeatures that are readable with TClass, but I also don't quite know what to make of the system's poor accuracy in some cases. The accuracy results presentation was confusing, since the author gave horrible results first, then semi-poor results after when using AdaBoost, but the horrible results also included a TClass with AdaBoost (AB) field, so what the hell is going on? Also, the explanation that the Nintendo dataset is "hard" does not fly; if a "naive" algorithm beats you , you cannot say that poor results are because of the test set.
Nevertheless, I think that research in this area of trying to find both accurate and understandable results is worthwhile.
Kadous created TClass, which uses metafeatures of gesture data to form a syntactic representation of a gesture.
The system was tested on two data sets: Auslan and Nintendo data. Auslan is an Australian sign language that has signs not like ASL, although the overall ideas of hand shape, location, and movement are present in the language. The Nintendo data comes from a Powerglove test set comprised of 95 basic hand movements.
Tests were conducted using a Powerglove (P5) and a Flock of Birds. On Kadous's tests, the initial error rate for TClass was extremely high compared to the best error rates (for both sets of data). Using AdaBoost, the system's accuracy became more tolerable, but it was never as good as a fine, hand-picked set of features.
Discussion:
I have mixed feelings about this system. I like the addition of metafeatures that are readable with TClass, but I also don't quite know what to make of the system's poor accuracy in some cases. The accuracy results presentation was confusing, since the author gave horrible results first, then semi-poor results after when using AdaBoost, but the horrible results also included a TClass with AdaBoost (AB) field, so what the hell is going on? Also, the explanation that the Nintendo dataset is "hard" does not fly; if a "naive" algorithm beats you , you cannot say that poor results are because of the test set.
Nevertheless, I think that research in this area of trying to find both accurate and understandable results is worthwhile.
Labels:
adaboost,
decision tree,
gesture,
hand gesture,
sign language
Tuesday, March 4, 2008
Using Ultrasonic Hand Tracking to Augment Motion Analysis Based Recognition of Manipulative Gestures
Summary:
Ogris et al. use ultrasonics to track hand motion in a 3D environment. The ultrasonics, when combined with other data from motion sensors, can greatly improve recognition rates.
Ultrasonics emit a sound beacon, which is then reflected back to sensors. Because ultrasonics use sound waves, the beacon is susceptible to reflection, occlusion, and temporal issues. Reflection is where the wave reflects off a surface at an odd angle, occlusions are blocked signals, and the temporal issues involve the time it takes for the sound to bounce back and forth. These issues limit ultrasonics to controlled, indoor scenarios. Placing the sensors on hands or other moving appendages is also a problem with ultrasonics, since all of the above problems can occur with fast moving parts.
To test the ultrasonics, the authors used a bicycle repair setup where the performer had 3 ultrasonic sensors and 9 gyroscopes on their arms, legs, and body. The performer then made various bicycle repair gestures, such as screwing/unscrewing, pumping, and wheel spinning.
Using a k-nearest-neighbor (kNN) approach to classification, the accuracy of the system jumps when using ultrasonics as opposed to just using motion sensors.
Discussion:
The use of ultrasonics probably does help the system. I am still not convinced that the ultrasonics themselves are useful, though. More sensors can almost always improve accuracy of a system, but since they "overlapped" the gyroscopes with ultrasonics at points the accuracy jump must be from the sensor type and not quantity.
My main issue is that ultrasonics seem to have an incredibly low Hz rate, or at least the sensors the authors were using were quite poor. Furthermore, noise problems (via bouncing signals, background sonics, or fast-moving sensors) seem to heavily detract from the ultrasonic's usage.
Ogris et al. use ultrasonics to track hand motion in a 3D environment. The ultrasonics, when combined with other data from motion sensors, can greatly improve recognition rates.
Ultrasonics emit a sound beacon, which is then reflected back to sensors. Because ultrasonics use sound waves, the beacon is susceptible to reflection, occlusion, and temporal issues. Reflection is where the wave reflects off a surface at an odd angle, occlusions are blocked signals, and the temporal issues involve the time it takes for the sound to bounce back and forth. These issues limit ultrasonics to controlled, indoor scenarios. Placing the sensors on hands or other moving appendages is also a problem with ultrasonics, since all of the above problems can occur with fast moving parts.
To test the ultrasonics, the authors used a bicycle repair setup where the performer had 3 ultrasonic sensors and 9 gyroscopes on their arms, legs, and body. The performer then made various bicycle repair gestures, such as screwing/unscrewing, pumping, and wheel spinning.
Using a k-nearest-neighbor (kNN) approach to classification, the accuracy of the system jumps when using ultrasonics as opposed to just using motion sensors.
Discussion:
The use of ultrasonics probably does help the system. I am still not convinced that the ultrasonics themselves are useful, though. More sensors can almost always improve accuracy of a system, but since they "overlapped" the gyroscopes with ultrasonics at points the accuracy jump must be from the sensor type and not quantity.
My main issue is that ultrasonics seem to have an incredibly low Hz rate, or at least the sensors the authors were using were quite poor. Furthermore, noise problems (via bouncing signals, background sonics, or fast-moving sensors) seem to heavily detract from the ultrasonic's usage.
Subscribe to:
Posts (Atom)