A Sketchy Blog: 2008

Monday, April 28, 2008

Invariant features for 3-D gesture recognition

Summary:

Campbell et al. use HMMs and a list of features to find a good recognition rate for a set of T'ai Chi gestures that are performed by users in a swivel chair; a hand gesture's change in polar coordinates provided the highest recognition for the 18 gestures tested.

Discussion:

Performing T'ai Chi in a chair kind of defeats the purpose of T'ai Chi. That's like trying to study race car drivers by observing people who take the bus.

Wednesday, April 23, 2008

FreeDrawer - A Free-Form Sketching System on the Responsive Workbench

Summary:

Wesche et al. created a 3D sketching tool where the skeleton of a model is created. A user can draw curves in a virtual space. A new curve can be drawn anywhere, but additional curves must be merged with the present model. Altering curves can be done on a local or global scale. Surfaces can be filled in at closed curve loops. Surfaces can also be smoothed.

Discussion:

This paper had some nice pictures, but very little material was actually presented. How does the computer know where the pen point is? How does the user interact with the pen? Range of motion? Gestures?

Interacting with human physiology

Summary:

The authors Pavlidis et al. propose a system to monitor humans for stress levels and altered psychological states using high-end infrared cameras. This system could then be used for a variety of purposes such as stress management of UIs, illness detection, or lie detection.

The system tracks the user's face through tandem tracking to track a small, keys section of the face. These sections include the nose, forehead, and temporal regions. The tracker models each region by its center of mass and orientation. Blood flow is tracked in the face through a perfusion model and directional model. The model involves a differential equation set to measure the "volumetric metabolic heat" flow in the face. Other measurements tracked include pulse, heat transfer in areas, and breathing rate.

Discussion:

The ideas behind this system are great, although talking with Pavlidis showed us that there are issues with the current system's usability. Sweat and minor body temperature fluctuations can alter the system's reliability (since the system is trying to measure minor fluctuations). Unfortunately, the cost for one of these high-end cameras is $60k, so we won't be seeing this any time soon.

3D Object Modeling Using Spatial and Pictographic Gestures

Summary:

Nishino et al. designed a 3D object modeling system that uses stereoscopic glasses, CyberGloves, and polhemus trackers.

The system allows the creation of superellipsoids that can have smooth or squarish parameters. These primitive shapes can be bent, stretched, twisted, and merged with other shapes. Hand postures control these actions, such as grasping and pointing. Virtual hands are displayed on a 200-inch arched screen, along with the object, in stereoscopic mode. The virtual hands allow the user to easily see where they can touch and modify the 3D model.

The authors tested the system by having users attempt two types of objects: symmetric and asymmetric. The symmetric object was a bottle, and the asymmetric object was a teapot.
Creation of the objects took up to 120 minutes. The size of the stored objects was much less than a competing program, Open Inventor.

Discussion:

For a paper in 1998, this was a pretty advanced system and seemed to offer some benefits over other systems. I would have liked to have seen feedback from the users, though, since I'm not sure how hard the system is to use.

Toward Natural Gesture/Speech HCI: A Case Study of Weather Narration

Summary:

Poddar et al. use HMMs and speech to recognize complementary keyword-gesture pairs in weather reporting. The multi-modal system would combine speech and gesture recognition to improve recognition accuracy.

HMMs work well for separating 20 isolated gestures from weather narrators, but continuous gestures drop the accuracy to between 50-80% for sequences. Co-occurrence analysis seeks to study the meaning behind keywords occurring with gestures, the frequency behind these occurrences, and the temporal alignment of the gestures and keywords. A table they presented shows that certain types of gestures (contour, area, and pointing) are more heavily associated with certain keywords ("here", "direction", "location"). The accuracy of recognizing the gestures can improve with both video (gesture) and audio (keywords).

Discussion:

This paper does not necessarily add much to solutions, but it was written ten years ago and did show some nice results that combining speech and gesture can improve recognition. Since the authors did not use a speech recognition system, the errors with that system would also produce interesting results that differ from the given accuracies.

Thursday, April 17, 2008

Discourse Topic and Gestural Form

Summary:

Eisenstein et al. applied an unsupervised learning technique and a Bayesian network model to study the correlation between gestures and presentation topics.

Their system looks at "interest points" within a video image, where each interest point is said to have come from a mixture model. Interest points from a similar model are clustered together to create a codebook. A hidden variable determines whether the observation gesture codeword is from a topic-specific or a speaker-specific distribution. The authors use a Bayesian model to learn what distribution each gesture belongs to, based off Gaussians of feature vectors.

The system was tested with fifteen users giving 33 presentations picked from five topics. The experiments show that with correct labels, the topic-specific gestures account for 12% of the gestures, whereas corrupting these labels drops the average to 3%.

Discussion:

This paper is a good start to a longer study on how to incorporate topic-specific gestures into recognition systems. Finding these gestures can help computers understand what topics might be presented, as well as what speakers are presenting a topic or if a speaker is veering off-topic. The system can then be used for speech training, presentation classification, or assistance (Clippy).

Monday, April 14, 2008

Feature selection for grasp recognition from optical markers

Summary:

Chang et al. reduced the number of markers needed on a vision-based hand grasp system from 30 to 5 while retaining around a 90% recognition rate.

Six different grasps are used for classification: cylindrical, spherical, lumbrical, two-finger pinch, tripod, and lateral tripod. The posterior probabilities for a class yk are modeled with a softmax function, which divides the exp value of an observation sequence with the class weights, divided by the sum of all exp(weights * obs) values.

The weight values are determined by maximum conditional likelihood estimation from the training set of observations and classes (X, Y). Gradient descent is used to find the log likelihood with respect to the weights. Input features are found using a "sequential wrapper algorithm" that examines one feature at a time with respect to a target class.

Grasp data measured 38 objects being grasped with a full set of 30 markers. An "optimal", small set of markers was chosen by forward and backward selection.

The results indicate that the small marker set of 5 markers has between a 92-97% "accuracy retention" rate.

Discussion:

Reducing the number of sensors using the forward and backward selection is nice, but simply having a few more sensors increases the accuracy to the actual plateau point. From 10 on there is almost no change in accuracy, but between 5 and 10 sensors the accuracy can jump 5%, or 1/20, which is a huge percentage when taking into account user frustration.

Glove-TalkII--A Neural-Network Interface which Maps Gestures to Parallel Formant Speech Synthesizer Controls

Summary:

Fels and Hinton created Glove-TalkII, a system designed to synthesize voice using complicated glove and feet controls.

The artificial vocal track (AVT) is controlled using a CyberGlove, ContactGlove, polhemus sensor, and foot pedal. The ContactGlove controls 9 stop consonants, such as CH, T, and NG. The foot pedal controls the volume of the speech. Hand position corresponds to a vowel sound. Hand postures map to nonstop consonant phonemes.

The neural networks used include a vowel/consonant network to determine if the sensors are reading a vowel or consonant, and then separate vowel and consonant networks to distinguish between the phonemes.

A single user had to undergo 100 hours of training to be able to use the system.

Discussion:

Impractical. I'm shocked that they had someone train the system for 100 hours, and the fact that it takes a person that long to train the system should indicate that this is a poor way to synthesize voice. The person's final voice is even described as "intelligible and somewhat natural-sounding", which is not a good complement.

Requiring a person to walk around with a one-handed keyboard and type their words is a better solution. The keyboard wouldn't even have a foot pedal.

Wednesday, April 9, 2008

RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas

Summary:

Kim et al. use dual-direction antennas to find the direction of arrival for RF signals transmitted from an RFID tag. The two spiral antennas are perpendicular to each other and their signal strengths are different depending on the angle to the RFID tag.

Obstacles in front of the antennas/tag increase the error in determining the direction. The object can still be tracked, though. In experimental results it worked pretty well.

Discussion:

It works pretty well for its domain. Probably less accurate for incredibly small movements (e.g., finger bends). Seems like every now and then it goes crazy off-track (Figure 8).

Friday, April 4, 2008

Gesture Recognition Using an Acceleration Sensor and Its Application to Musical Performance Control

Summary:

Sawada and Hasimoto use accelerometer data to extract features of gestures and create a music tempo system.

The extracting of features is basic: projections onto certain planes, such as xy or yz, and the bounding box of the acceleration values. Changes of acceleration are measured using a fuzzy partition of radial angles.

The authors recognize or classify gestures using squared error. The actual gesture recognition is trivial.

The music tempo program is where the paper is more interesting as the system has to predict where a beat has been hit in real-time. Systems already existed where a marker is placed on a baton, but the visual processing of these systems usually has a delay of 0.1s (in 1997 computational power). In the author's system, gestures for up, down, and diagonal swings are used to indicate tempo. Other gestures can map to other elements of conducting.

A score is stored in the computer and the user conducts to the score. Often the computer and human are slightly off, and the two try to balance to each other. A simple function for balancing the tempo is given.

Discussion:

The system they use isn't a true conducting system since it relies on defined (and trained) gestures, but the ideas behind the tempo system are good and the simple execution and equations are appreciated.

Wednesday, April 2, 2008

Activity Recognition using Visual Tracking and RFID

Summary:

Krahnstoever et al. use RFID tags in conjunction with computer vision tracking to interpret what is happening within a scene.

A person model tracks a human's movement through their head and hands. The head is a 3D cartesian coordinate location, and each hand is described in spherical coordinates (r, phi, theta) with respect to the head. The models for where the head will be, p(X^t_h, X^t-1_h), and the hands p(X^t_q, X^t-1_h) had to be learned. The priors p(X_q | X_h) also had to be learned. Both hands and head are segmented using skin color.

Each pixel within a given image frame can belong to either the background or the foreground (body part). The likelihood for an image given the observations is taken to be the Improved Iterative Scaling (IIS) of the image section and bounding box of a body part, summed over the parts and sections. I have no idea how IIS works.

RFID tags provide movement and orientation information in 3D spaces. The amount of charge the RFID tag receives depends on its angle to the wave source, where a perpendicular angle receives no energy and a parallel angle is the greatest. The tag then outputs the tag's ID, orientation, and field strength to the signal.

The authors use the RFID information along with the hand and head positions to interpret what is happening in a scene. Agents are somehow used to do this.

Discussion:

The RFID information looks like it helps recognize what is happening within a scene, but I would have liked to have seen a comparison between a pure vision system and a system with the RFID. This could be a bit difficult, but it might help the strength of the paper.

I also would have liked an actual description of the activity agent system.

Monday, March 31, 2008

Enabling fast and effortless customisation in accelerometer based gesture interaction

Summary:

Mäntyjärvi et al. apply discrete HMMs to accelerometer data for gesture recognition. The authors had a previous study that indicated users prefer defining their own gestures, or they prefer intuitive gestures.

The authors add noise into the gestures to increase the recognition of user-defined gestures under certain conditions. This supposedly speeds up the training process since less gestures need to be "drawn". Adding Gaussian noise versus uniform noise might improve the recognition. But not really.

Discussion:

This paper changed courses in the middle and moved from customization to noise addition. The gesture set they tested on was super easy and can be done with Rubine's recognizer. I'd like to see some data that users created and the differences between the user-defined gesture and the DVD gestures.

Thursday, March 27, 2008

Gesture Recognition with a Wii Controller

Summary:

Schlomer et al. showed that the Wii controller is pretty good at recognizing tennis gestures.

Discussion:

Here's a good evaluation study.

SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction

Summary:

Murayama et al. presented a two-handed computer control device that allowed the manipulation of on-screen objects. The system, called SPIDAR G&G, consisted of two balls suspended in two horseshoe apparatus with six strings each. The user moved these balls with six degrees of freedom, which translated onto a cursor or object on the computer. The strings also had pull and resisted movement through small motors. Each ball included a pressure button that detected grip.

The authors evaluated the system using a pointer and a target object. The users had to manipulate the pointer and object with both balls in order to accomplish a goal. Three people tested their system and found that the use of two SPIDAR balls, as opposed to one and a keyboard, allowed the users to manipulate the objects faster. Also, haptic feedback helped.

Discussion:

Although the system sounds interesting, I have a lot of issues with the evaluation. The authors used only three people familiar with VR interfaces, which is quite low. A greater concern is that the system was only tested against another form of itself. SPIDAR G&G was only compared against SPIDAR G + keyboard, when really SPIDAR G&G should have been compared to a mouse and keyboard interface, or a joystick and mouse, or two joysticks, or a roller ball, or any number of more common peripherals. As is stands, I have no basis to say that the suspended ball manipulation method is any better than traditional interfaces. The only definite conclusion is that two balls are better than one, and having the balls touch back is beneficial.

Sunday, March 23, 2008

Taiwan sign language (TSL) recognition based on 3D data and neural networks

Summary:

Lee and Tsai implemented a vision-based hand gesture recognition system to classify 20 hand TSL signs. The system used hand features based on visual distances, and 8 reflective markers were placed on the hand to assist in these readings. The features are then sent into a back-propogation neural network (BPNN) that had 15 features as inputs and the 20 gesture probabilities as outputs.

The features used include the distances between a wrist point and the finger tips, and the distances between each finger pair (spread).

10 students tested the system and produced 2788 gestures, of which half went to training and the other half to testing. The authors tested on neural networks with 2 hidden layers varying in size from 25 x 25 to 250 x 250. The best results were with the BPNN with 250 x 250 hidden nodes, with a testing accuracy of 94.65%. Two gestures were heavily confused because the only difference was the length of the finger shown (i.e., the fingers were bent in one gesture).

Discussion:

This was a pretty decent use of neural nets, and I'm glad that they gave the results at different hidden layers and the recognition rates for each gesture. In fact, now that I think about it, I'm just glad they gave results. These are definitely the best results I've seen and quite promising: one of their main issues was a good feature to distinguish between bent fingers and non-bent fingers.

The differences between 150x150 and 250x250 are statistically insignificant, but they might be more significant when more gestures are added. I especially like that there is little discrepancy between training and testing sets, which hopefully indicates that their approach works for the general user.

Tuesday, March 18, 2008

Wiizards: 3D Gesture Recognition for Game Play Input

Summary:

Kratz, Smith, and Lee use Wiimotes in a game where two wizards cast spells to damage one another. Each spell consists of a series of gestures and modifiers, and a wizard can block a spell by performing a blocking gesture and then mimicking their opponent's casting gestures.

Wii controller accelerometer data is used to gather a 3-dimensional gravitational reading for the three x, y, z axes. An observation vector is a collection of these data values, and Gaussians are applied to the observations to determine distribution probabilities. Classification maximizes over the probability that a gesture sequence was performed, given the observation data.

Without training, their system's HMM model with 15 states has around 50% accuracy and varies widely. Training can boost the accuracy to around 90%, but training cannot be performed in a real-time environment.

Discussion:

I'm curious as to how long it actually takes the system to train. The axis for the training figure did not specify, and if it only takes 30 seconds to train, this is not much longer than an initial load screen (and it would only have to happen once). If it takes 30 minutes to train, then we have a problem.

Also, the number of gestures in the system would hurt this time factor. Even 10 seconds over 100 gestures is unacceptable.

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning

Summary:

Lieberman and Breazeal created a system to train human motor movements. Their hardware uses small vibrotactile actuators built into a sensory vest and sleeve. For each joint/sensor if the angle of the student's joint is different than the angle of the teacher (known) joint, then feedback is given to the joint. Higher errors increase the vibrating 'force field' effect.

To test their system, the authors had 40 subjects split into a pure visual group and a visual/haptic group. These subjects were given the task to mimic movements of a teacher. Overall, the error for the feedback group was much less than the error for the visual group, even after repeated trials.

Discussion:

Well written paper, clear subject manner, and a damn good results and evaluation section. I don't have much more to say than that.

Sunday, March 16, 2008

AStEINDR

S:

ST-Isomap PCA LLE LSCTN KNTN CTN ATN MDS SCTN DOF WTF

D:

I might come back to this later when I'm not so mad.

Articulated Hand Tracking by PCA-ICA Approach

Summary:

Kato et al. used Independent Component Analysis (ICA) to find basis vectors for hand motion features. The authors first use Principle Components Analysis to reduce the dimensionality of their system, and then they use ICA to find a set of vectors that are statistically independent from each other (i.e., basis vectors).

Data on 20 angles was collected with a glove. The authors then sandwiched all of the data for the 20 sensors together into one large vector; each sensor was sampled across 100 time points, and the data from all 20 sensors was merged into a 2000-dimension vector.

ICA is used to find the basis vectors for a hand such that a linear combination of these vectors will produce a desired hand movement. The basis vectors U are found through a weight matrix W and a sample of motion data X (where X is a matrix of hand motions). A neural learning algorithm (in this case, gradient descent) is used to calculate the weights. The resulting 5 basis vectors are the movement of each finger individually.

The authors then deviated from their abstract and discussed actually tracking a hand using particle filtering. A hand's current position can be estimated from its prior positions, so each basis vector can estimate where it believes the finger will be given its prior positions. The authors also segment a hand out of an image by doing some thresholding on an image and overlaying a hand model in the image to find the hand location.

There are no results.

Discussion:

There are no results.
The basis vectors seem obvious, but I'm glad that ICA found them.
There are no results.

Thursday, March 6, 2008

A Hidden Markov Model Based Sensor Fusion Approach for Recognizing Continuous Human Grasping Sequences

Summary:

Bernardin et al. created a system to recognize human grasping gestures using a CyberGlove and pressure sensor data. Basic grasps are distinguished in 14 different ways by Kamakura's grasping primitives. These grasps include "5 power grasps, 4 intermediate grasps, 4 precision grasps, and one thumbless grasp".

To recognize these grasps, an 18 sensor CyberGlove is used, along with finger tip and palm sensors. 14 different pressure sensors are sewn into a glove, which is worn under the CyberGlove. The sensor data is passed into HMMs for recognition. A 9-state HMM is built for each gesture using the HTK. After each grasp, the grasped object must be released.

On a total of 112 training gestures from 4 users, the user dependent models were between 77 and 92%, whereas a user independent model was in the low 90s for all 4 users. This is most likely due to the increase in training data when all the user data is combined.

Discussion:

I thought the use of HMMs in this paper was actually quite good. The problem I have with HMMs is that they are absolutely horrible and explode when data is not properly segmented. In the case of grasps, it is probably less likely that somebody is going to go from 1 grasp to another without releasing the object they are holding. For most, general cases, the computer can assume that the lack of tactile input from the palm would indicate a grasp has ended.

The 3D Tractus: A Three-Dimensional Drawing Board

Summary:

Lapides et al. designed and built a Tablet PC stand that can move vertically, allowing for a 3D drawing platform that switches the screen's view as the table is moved. The authors state that using the 3D Tractus will allow for a "direct mapping between physical and virtual spaces."

The frame of the 3D Tractus consists of aluminum bars and a table top, along with a counterweight that will balance the weight of the tablet and allow for the table top to slide up and down easier. The counterweight has to be tuned for each tablet's weight. A height sensor is built into the frame.

The drawing software for the system takes into account the height of the table when displaying a viewing angle to the user. The system uses line width as a depth cue, with farther lines thin and closer lines thick. An orthographic (cube) projection is used to demonstrate 3D depth, as well. Also, nothing of the sketch is displayed above the current tablet surface.

Discussion:

Although the idea of having a tactile way to sketch in 3D sounds appealing, the system could be implemented much better without a tactile, movable desk. Instead, having a z-axis button/wheel/control in the software will alleviate the issues with custom counterweights, a height constraint, awkward hand/arm positioning, and lack of mobility.

Also, the system is rather constrained with any large sketches since the user can move in the tablet's plane in infinite direction, but the vertical plane is limited to something like 40 centimeters.

Wednesday, March 5, 2008

Temporal Classification: Extending the Classification Paradigm to Multivariate Time Series

Summary:

Kadous created TClass, which uses metafeatures of gesture data to form a syntactic representation of a gesture.

The system was tested on two data sets: Auslan and Nintendo data. Auslan is an Australian sign language that has signs not like ASL, although the overall ideas of hand shape, location, and movement are present in the language. The Nintendo data comes from a Powerglove test set comprised of 95 basic hand movements.

Tests were conducted using a Powerglove (P5) and a Flock of Birds. On Kadous's tests, the initial error rate for TClass was extremely high compared to the best error rates (for both sets of data). Using AdaBoost, the system's accuracy became more tolerable, but it was never as good as a fine, hand-picked set of features.

Discussion:

I have mixed feelings about this system. I like the addition of metafeatures that are readable with TClass, but I also don't quite know what to make of the system's poor accuracy in some cases. The accuracy results presentation was confusing, since the author gave horrible results first, then semi-poor results after when using AdaBoost, but the horrible results also included a TClass with AdaBoost (AB) field, so what the hell is going on? Also, the explanation that the Nintendo dataset is "hard" does not fly; if a "naive" algorithm beats you , you cannot say that poor results are because of the test set.

Nevertheless, I think that research in this area of trying to find both accurate and understandable results is worthwhile.

Tuesday, March 4, 2008

Using Ultrasonic Hand Tracking to Augment Motion Analysis Based Recognition of Manipulative Gestures

Summary:

Ogris et al. use ultrasonics to track hand motion in a 3D environment. The ultrasonics, when combined with other data from motion sensors, can greatly improve recognition rates.

Ultrasonics emit a sound beacon, which is then reflected back to sensors. Because ultrasonics use sound waves, the beacon is susceptible to reflection, occlusion, and temporal issues. Reflection is where the wave reflects off a surface at an odd angle, occlusions are blocked signals, and the temporal issues involve the time it takes for the sound to bounce back and forth. These issues limit ultrasonics to controlled, indoor scenarios. Placing the sensors on hands or other moving appendages is also a problem with ultrasonics, since all of the above problems can occur with fast moving parts.

To test the ultrasonics, the authors used a bicycle repair setup where the performer had 3 ultrasonic sensors and 9 gyroscopes on their arms, legs, and body. The performer then made various bicycle repair gestures, such as screwing/unscrewing, pumping, and wheel spinning.

Using a k-nearest-neighbor (kNN) approach to classification, the accuracy of the system jumps when using ultrasonics as opposed to just using motion sensors.

Discussion:

The use of ultrasonics probably does help the system. I am still not convinced that the ultrasonics themselves are useful, though. More sensors can almost always improve accuracy of a system, but since they "overlapped" the gyroscopes with ultrasonics at points the accuracy jump must be from the sensor type and not quantity.

My main issue is that ultrasonics seem to have an incredibly low Hz rate, or at least the sensors the authors were using were quite poor. Furthermore, noise problems (via bouncing signals, background sonics, or fast-moving sensors) seem to heavily detract from the ultrasonic's usage.

Wednesday, February 27, 2008

American Sign Language Recognition in Game Development for Deaf Children

Summary:

Brashear et al. use GT2k to create an American Sign Language game for deaf children. The system, called CopyCat, teaches language skills to children by having them sign various sentences to interact with a game environment.

A Wizard of Oz study was used to gather data and design their interface. A desk, mouse, and chair was used in the study, along with a pink glove. The students pushed a button and then signed a gesture, and the data was collected using the glove and an IEEE 1394 video camera. The users were 9- to 11-year-olds.

The hand is pulled from the video image by its bright color. The image pixel data is converted to a HSV color space histogram, which is used to binarize the data and find the hand. Accelerometers are also used to track hand movement in x, y, and z positions.

The data from five children was analyzed for user-dependent and -independent models. User-dependence was validated in a 90/10 (training/testing) split, with word accuracy in the low 90s and and sentence accuracy around 70%. The standard deviation for the sentence accuracy is very high, with approximately at 12% deviation.

User-independent models were lower with an average word accuracy of 86.6% and a sentence accuracy of 50.64%.

Discussion:

I like the author's user study with the Wizard of Oz to collect real-world data from children. The system's performance (in essence, GT2k's performance) was very low with sentences, which indicates that segmentation is the largest issue with the toolkit. I'm also worried about the 90/10 split for the user dependent models. That is a huge ratio of training to testing data, and it might be skewing the results to show higher than normal accuracy.

A Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language Sentence

Summary:

Sagawa and Takeuchi created a Japanese Sign Language recognition system that uses "rule-based matching" and segments gestures based on hand velocity and direction changes.

There are thresholds of direction vector changes that account for the segmentation. There are also issues to determine which hand (or both) are being used for gestures, and these are determined by the direction and velocity change thresholds.

The system achieved 86.6% accuracy for signed words, and 58% accuracy for signed sentences.

Discussion:

There's not much to discuss with this paper. The "nugget" of research is with the use of direction and velocity changes to segment the gestures. I became more interested in this paper since I learned it was published a year before Sezgin's, but not by much.

Monday, February 25, 2008

Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition

Summary:

Researchers from Georgia Tech have created a gesture toolkit called GT2k. The purpose behind GT2k is to allow researchers to focus on system development instead of recognition. The toolkit works in conjunction with the Hidden Markov Model Toolkit (HTK) to provide HMM tools to a developer. GT2k usage can be divided into four categories: preparation, training, validation, and recognition.

Preparation involves the developer setting up an initial gesture model, semantic gesture descriptions, and gesture examples. Each model is a separate HMM, and GT2k allows either automatic model generation for novices, or user-generated for experts. Grammars for the model are created in a rule-based fashion and allow for the definition of complex gestures based on simpler ones. Data collection is done with whatever sensing devices are needed.

Training the GT2k models can be done in two ways: cross-validation and leave-one-out. Cross-validation involves separating the data into 2/3 for training and 1/3 for testing. Leave-one-out involves training on the entire set minus one data element, and repeating this process for each element in the set. The results for cross-validation are computed in a batch, whereas the overall statistics for leave-one-out are calculated by each model's performance.

Validation checks to see that the training provided a model that is "accurate enough" for recognition. The process uses substitution, insertion, and deletion errors to calculate this accuracy.

Recognition occurs once valid data is received by a trained model. The GT2k abstracts this process away from the user of the system and calculates the likelihood of each model using hte Viterbi algorithm.

The remainder of the paper listed possible applications for GT2k including: a gesture panel for controlling a car stereo, a blink recognition system, a mobile sign language system, and a "smart" workshop that understands what actions a user is performing.

Discussion:

GT2k seems like a good system that can help beginning researchers more easily add HMMs into their gesture systems without worrying about implementation issues. Yet, the applications mentioned for GT2k are rather weak in both their concept and their results. HMMs are really only "needed" for one of the applications (sign language), whereas the other applications can be done more easily with simple techniques or moving the sensors away from a hand gesture.

This was a decent paper in writing style, presentation, and (possibly) contribution, but I'm curious to know what researchers have used GT2k and the systems they have created with it.

As a side note, I also am unclear as to why leave-one-out training is good, since with a large data set training the system could take a hell of a long time.

Computer Vision-Based Gesture Recognition For An Augmented Realtiy Interface

Summary:

Storring et al. from Aalborg University created an augmented reality system to create a "less obtrusive and more intuitive" interface.

The gestures used in the system are the mapped to the hand signs for 0-6, i.e. no fist, index finger, index and middle, etc. This gesture set can be recognizable in a 2D plane with a camera. In order for these gestures to work, the hand needs to be segmented from the image. The authors use normalized RGB values, called chromaticities, to minimize the variance of the color intensity. The distributions for the background and skin chromaticities are found and are modeled as 2D Gaussians. The hands are assumed to be a minimum number and maximum number of pixels.

Gestures are found by counting the number of fingers. A polar transformation counts the number of spikes (fingers) on currently shown on the hand. Click gestures can be found by checking the bounding box width of the hand between the regular index finger gesture and a "thumb click" addition.

Discussion:

For a system that is supposed to be less obtrusive and more intuitive than current interfaces, virtual reality with unintuitive gestures does not seem like a good solution. Using "finger numbers" is a poor choice, and having a gigantic head-mounted display with cameras is probably less comfortable than looking at a computer screen. Furthermore, if the authors are focusing on using head equipment, why not just use gloves to increase the gesture possibilities?

Thursday, February 21, 2008

3D Visual Detections of Correct NGT Sign Production

Summary:

Lichtenauer et al. created an interactive Dutch sign language system that would help train children to use the correct gesture. Their system has various requirements including: working under mixed lighting, being user independent, having immediate response, adaptive to skill level, and invariance to valid signs.

The authors' system uses two cameras to digitally track a person's head and hands, and a touch screen is placed in front of the user for software interactivity. The skin color of the person is first determined by finding the face, which is done by having a system's operator press a pixel inside of the face and a pixel around the outside of the head. These pixels than provide a way to train the skin color model of the system, which is a a Gaussian perpendicular in RGB space. The face and hands are separated into a Left and Right RGB distribution; the authors feel that a light source will typically coming from one direction, such as an open window. Hands are detected through their number of skin pixels, and the motion of a hand starts the tracking.

The system uses fifty 2D and 3D properties (features) related to hand location and movement. These properties are assumed to be independent, and base classifiers for each figure are computed and summed together to get a total classification value. These base classifiers use Dynamic Time Warping (DTW) to find the correspondence between two feature signals over time. These classifiers are trained with the "best" 50% of the training set for each feature. A sign is classified as correct if the average classifier probability for a class is above a threshold.

The results from the authors mention that they achieve "95% true positives" of the data.

Discussion:

In class, we have already discussed the issue of having a 95% positive rate, since the system is set up so that each symbol is known and the user is supposed to gesture the correct system. Always returning true will produce 100% accuracy.

I think the larger issue is that the classifier itself needs to be tested independent of the system. Theoretically, a separate classifier can be fine tuned for each gesture so that it can correctly recognize a single gesture 100% of the time. The issues involved with using a generic classifier will then be avoided.

Wednesday, February 20, 2008

Television Control by Hand Gesture

Summary:

Freeman and Weissman devised a way to control a TV with hand gestures using computer vision. In their system, the user's hand acts as a mouse. The user moves their open hand in front of the camera, palm facing toward the television, and the computer detects their hand and maps it to an on-screen mouse. When the user holds their hand over a control for a brief time period, the control is executed. Closing their hand or moving it out of the computer's vision deactivates the mouse.

The hand movement is detected by checking the angle difference between two vectors of pixels, where the pixels correspond to the pixels in an image frame and its offset. The dx and dy information is calculated for the image gradient, and this provides an orientation that can be handled in different lighting scenarios.

Discussion:

This paper was quaint. The actual algorithms used were rather simple, but the concept of controlling a TV via hand waving intrigued me. My main concern is that this application would train people watching a TV to not make any sudden movements so that the on-screen menu would not appear. Also, it forces people to walk through a living room slowly so that the TV does not catch their hand in any rapid movements. Some better gestures would benefit this system, such as twisting motions for channel or volume control.

A Survey of Hand Posture and Gesture Recognition Techniques and Technology

Summary:

This paper by LaViola presented a summary of key gesture recognition techniques. Hand posture and gesture recognition was divided up into several categories: feature extraction, statistics, models, and learning approaches. Some approaches, such as template matching, are more suited for postures, whereas HMMs are used solely for gestures. Feature extraction is used for both, but the feature set can be computationally heavy for the large dimension spaces.

Possible applications for gestures and postures include sign language, presentation assistance, 3D modeling, and virtual environments.

Discussion:

This paper is a good summary of current techniques and their strengths and weaknesses. There's not much to summarize in the paper since the summarizing an 80 page summary is rather dull and pointless, but I will be referring back to this paper for any future work.

Monday, February 18, 2008

Real-time Locomotion Control by Sensing Gloves

Summary:

Komura and Lam propose using P5 gloves to control character motion. The authors feel that using a "walking fingers" can provide a more tangible interface for controlling motion than traditional joystick or keyboard techniques.

The authors use a P5 glove for their gesture capture, and the user first calibrates the fingers by moving them in time with a given walking animation displayed on a computer screen. This calibration happens by a simple function comparing the cycle of the user's fingers versus the cycle of the animation.

After calibration, the user's fingers should be in-sync with the walking motions. For animating quadrupeds, there might need to be a phase shift between the back and front legs.

To test their system, the authors used a CyberGlove and had users play mock games with characters jumping and navigating a maze. Their results showed that navigating with the glove is potentially easier in terms of the number of collisions in a maze, and the glove and keyboard controls allow maze navigation in approximately the same time.

Discussion:

There's not much to say about this paper. The results that they gave were odd, since User 2 completed the maze with a keyboard in 18 seconds but had 22 collisions, and with the glove in 31 seconds with 3 collisions. I'm not sure what to make of that data...

Other than that, the research aspect of this paper basically took a finger sine and mapped it to an animation's sine. It might make navigating in certain games easier, but only if you need to control the speed of the character with better precision.

Wednesday, February 13, 2008

Shape Your Imagination: Iconic Gestural-Based Interaction

Summary:

Marsh and Watt performed a user study to determine how people represent different types of objects using only hand gestures. Gestures can be either substitutive (where the gestures act as if the object is being interacted with) or virtual (which describe the object in a virtual world).

The authors had 12 subjects of varying academic degree and major make gestures for the primitives circle, triangle, square, cube, cylinder, sphere, and pyramid. The users also gestured the complex and compound shapes for football, chair, French baguette, table, vase, car, house, and table-lamp. The users were told to gesture the describe the shapes with non-verbal hand gestures.

Overall, users used virtual hand depictions (75%) over substitutive (17.9%), with some objects having both gestures (7.1%). 3D shapes were always expressed with two hands, whereas primitives had some one-handed gestures (27.8%), like circle. Some objects were too hard for certain users to gesture, such as chair (4) and French baguette (1).

Discussion:

The user study was interesting in some respects, such as seeing how the majority of people describe objects by their virtual shapes, but overall I was disappointed by the paper. Images showing the various stages of depiction would have really helped, as well as actual answers from the questionnaire.

I was confused as to whether the authors were looking for only hand gestures or allowed full body movement, since the authors mention that they wanted hand gestures to the users but they did not seem to care that many users walked around the room. That's a pretty large detail that they glossed over.

A Dynamic Gesture Recognition System for the Korean Sign Language (KSL)

Summary:

Kim, Jang, and Bien use fuzzy min-max neural networks to recognize a small set of 25 basic Korean Sign Language gestures. The authors use two data gloves, each with 10 flex sensors, 3 location (x, y, z) sensors, and 3 orientation (pitch, yaw, roll).

Kim et al. find that the 25 gestures they use contain 10 different direction types, shown below

The authors also discovered that the data often has derivations within 4 inches of other data, so the x and y coordinates are split into 8 separate regions from -16 to 16 inches, with 4 inch ticks. The change in x, y direction (CD) is recorded for each time step simply as + and - symbols, and this data is recorded for four steps. CD change templates are then made for the 10 directions, D1 ... D10.

The 25 gestures contain 14 different hand gestures based on finger flex position. This flex value is sent to a fuzzy min-max neural network (FMMN) that separates the flex angles within a 10-dimensional "hyper box".

To classify a full gesture, the change of direction is first taken and compared against the templates, and then the flex angles are run through the FMNN. If the total (accuracy/probability) value is above a threshold, the gesture is classified.

The authors achieve approximately 85% accuracy.

Discussion:

Although this paper had some odd sections and interesting choices, such as making the time step 1/15th of a second and having gestures over 4/15ths of a second, the overall idea is quaint. I appreciate that the algorithm separates the data into two categories--direction change and flex angle--and separates the two components to hierarchically choose gestures.

I still do not like the use of neural networks, but if they work I am willing to forgive. My annoyance is also alleviated by the fact that the authors provide thresholds and numerical values for some equations within the network.

I'm very curious why they chose those 10 directions (from the figure). D1 and D8 could be confused if the user is sloppy, and D4 and D7 can be confused with their unidirection counterparts if the user is does their gestures slower than 1/4 of a second. Which is, of course, absurd.

Monday, February 11, 2008

A Survey of POMDP Applications

Summary:

Cassandra's survey summarizes some uses for partially observable Markov decision problems. MDPs are useful in artificial intelligence and planning applications. The overall structure of these problems involves states and transitions between the states, with costs associated with the transitions and states. The goal of a robot/problem is to find an optimal solution (policy) to a problem in the least number of transitions.

The POMDP model consists of:

States
Actions
Observations
A state transition function
An observation function
An immediate reward function

Cassandra's paper focuses on examples of using POMDPs, but he describes them in more detail here: http://www.pomdp.org/pomdp/index.shtml. Basically, they are MDP problems in which you cannot observe the entire state.

Some example applications include:

Machine maintenance - parts of the machine are modeled as states, and the goal is to minimize the repair costs or maximize the up-time on the machine.
Autonomous robots - robots need to navigate or accomplish a goal with a set of actions, and the world is not always observable
Machine vision - determining where to focus higher resolution (i.e., fovea) of the computer image to focus on specific parts such as hands and heads of people.

POMDPs have a number of limitations. One limitation is that the states need to be discrete. Although continuous states can be discretized, some domains can have trouble with this step. The main issue with POMDPs is in their computation limits. POMDPs become intractable rather quickly since their state spaces are exponential.

Discussion:

This paper had little to nothing to do with what we've been currently discussing in class. Although POMDPs are interesting from a theoretical standpoint, their intractability is a huge factor for avoiding them in any practical domain. I've been trying to think of how to even apply them to gesture recognition, and one idea I came up with included modeling hand positions as states for a single gesture, but then it just becomes an HMM with a reward function, and I'm not sure how beneficial a reward function is when taking the computation costs into account.

Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs*

Summary:

Song and Kim's paper proposes a way to use a sliding window for HMM gesture recognition. The window of 3 slides across observation sequences O, and a probability estimate for a gesture is determined to be the average of the partially observable probabilities at each timestep in the window. The algorithm also performs "forward spotting", which has something to do with the difference between the maximum probability for a gesture we find and the probability of a "non-gesture" at the same timestep. The non-gesture is a wait class that consists of an intermediate, junk state. As long as the "best" gesture probability is greater than the non-gesture probability by some threshold, then the gesture is classified accordingly.

The authors also use accumulative HMMs, which basically take the power set of continuous segmentations within a window and find the combination that produces the highest probability for a gesture.

The set of gestures that the authors classify consists of 8 simple arm position gestures (e.g., arms out, left arm out, etc.). They report recognition rates between 91% and 95%, depending on their choice of thresholds.

Discussion:

The system might work fine, but I really cannot tell because their test set is so simple. The 8 gesture they present are easily separable, and template matching algorithms can distinguish between them with ease. I also feel that their system is intractable as you start adding more gestures or gestures that vary widely in time length--adding more gestures adds an overhead to the probability calculations, and varying the length would likely cause the window to be reconfigured to be larger, which would explode the power set step.

Thursday, February 7, 2008

A similarity measure for motion stream segmentation and recognition

Chuanjun, L. and B. Prabhakaran (2005). A similarity measure for motion stream segmentation and recognition. Proceedings of the 6th international workshop on Multimedia data mining: mining integrated media and complex data. Chicago, Illinois, ACM.

Summary:

Li and Prabhakaran propose a way to "segment" streams of motion data by using singular value decomposition (SVD). SVD is similar to principal component analysis (PCA), and the technique finds the underlying geometric structure of a matrix (i.e., its eigenvectors and values). By using the singular values of matrices storing motion data, the matrices can be compared in similarity by measuring the angular differences (dot products) of these vectors.

The authors store motion data in a matrix consisting of columns of features and rows of timesteps. The first 6 eigenvectors are used when comparing matrix similarity; this value was empirically determined. The segmentation part of the paper involves separating this stream of data after every l timesteps, and then comparing the similarity of the segmented matrix to stored eigenvectors and values for a known motion.

To test their system, the authors merged individual motions together into a "stream" of data and inserted noise inbetween motions. The authors noted that the number of eigenvectors needed to distinguish between matrices (originally, k = 6) varied depending on the data collection method. Their paper reported recognition rates in the mid 90s, but these results depend on how similar motions are to one another.

Discussion:

Although the paper has little to do with segmentation, the actual algorithm for comparing motion data seems interesting and appears to achieve relatively accurate results. I would like to know the actual motions that users performed, since I have no idea what motions are required in Taiqi and Indian dances. They also did not mention the number of people involved in the data capturing, and I assume this number to be close to 1 since they needed a user to wear a motion suit.

Wednesday, February 6, 2008

Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation

Ip, H. H. S., K. C. K. Law, et al. (2005). Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation. Proceedings of the 11th International Multimedia Modelling Conference, 2005.

Summary:

Ip et al. created Cyber Composer, which is a music generation program controlled via hand gestures. The author's motivation is to inspire both musicians and casual listeners to experience music in a new way.

The authors split music composition into three parts: melody, rhythm, and tone. The melody is the "main" part of the music and mainly includes the treble parts, such as the singer. The rhythm keeps the beat of the music and is played by the drums and bass. Tonal accompaniment involves creating harmony across all parts.

In order to keep the tone (harmony) of the music interesting and flowing, the authors create a small "chord affinity" matrix that describes certain chord lead/following strengths. During music composition, chords are automatically chosen with high affinity. Melody notes are also chosen automatically to create musical "tension".

The system was implementing using two 22 sensor CyberGloves and two Polhemus positioning receivers. MIDI was used to produce the musical notes.

The seven gestures used in the system include rhythm, pitch, pitch-shifting, dynamics, volume, dual-instrument mode, and cadence. Rhythm is controlled by the flexing of the right wrist. Pitch is controlled by right-hand height, and it is reset at the beginning of each bar. The user can also "shift" the pitch by performing a similar gesture. Note dynamics and volume are controlled by the right-hand finger flex, with fully flexed fingers forcing forte notes. Dual-instrument mode allows a harmony melody or unison melody to be played along with the main instrument; this mode is activated using the left hand. To end the piece, the left-hand fingers are closed.

There are no results.

Discussion:

This paper aroused me. Some of the gestures they defined were intuitive, such as opening and closing of the fingers for volume and moving the hand up and down for notes. Other gestures just seem awkward, such as the ambiguous dual-instrument mode and constantly flapping your wrist (ouch?) to drive the melody.

I'm familiar with building music composition programs (including "smart" programs that use musical theory to assist composition), and I think this program was trying to market itself as something that it could never become. A music tool has to be either robust to allow experts to use it, sacrifice some features to become simple for novices, or fun for just the casual listener. In the expert category I would place Finale, and on the casual end I would place music games such as Guitar Hero. Novice programs are harder to come by, and the tool I worked on was ImproVisor--a system that used intelligent databases to analyze input notes and determine if the notes "sounded good".

CyberGlove is trying to do everything at once and failing. The lack of any results, even the casual comment by an offhand user, tells me that the system is rather convoluted to use or poor for composition. The hand gestures cannot really control notes in a way that experts would use the system, novices will not understand the theory behind why their hand waving sounds good or bad, and casual musicians will probably have no idea what is going on.

Sunday, February 3, 2008

Hand Tension as a Gesture Segmentation Cue

Philip A. Harling and Alistair D. N. Edwards. Hand tension as a gesture segmentation cue. Progress in Gestural Interaction: Proceedings of Gesture Workshop '96, pages 75--87, Springer, Berlin et al., 1997

Summary:

Harling and Edwards describe a way to segment hand gestures based on hand tension. The basic idea is that as a user dynamically moves between static postures a and b, their hand will reach a "relaxed", low-tension minimum position c that is less tense than either a or b.

Smaller details:

To find the tension for each finger, the authors use Hooke's Law and treat a finger as if it were a spring
The total hand tension is the sum of the finger tensions
They used a Mattel PowerGlove

Discussion:

The idea behind the paper was actually quite good for recognizing between static postures. I have a feeling that the hand tension will not work well for moving gestures since there would be small segmentations within the gesture.

I'm disappointed at their lack of results. I can forgive other papers that were user studies, but I cannot forgive a paper that does not report easily obtainable results when they spent 8 pages discussing a topic that I summarized in one sentence. Segmentation is rather simple to gather data for, and a published paper should at least attempt to find an accuracy number.

On a technical note, I'm curious as to how hand tension is affected by the type of glove worn. I have a feeling that my "hand relaxed" position is going to be different for a P5 glove than it will be for a CyberGlove or even a CyberGlove with a Flock of Birds attached. All the extra weight will most likely force my hand into resting upon the equipment for support.

A Multi-Class Pattern Recognition System for Practical Finger Spelling Translation

Hernandez-Rebollar, J. L., R. W. Lindeman, et al. (2002). A multi-class pattern recognition system for practical finger spelling translation. Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on.

Summary:

Hernandez-Rebollar et al. have a two part paper: they present a glove (AcceleGlove) and they have a test platform for the glove that uses decision trees.

The AcceleGlove contains 5 accelerometer sensors placed at the middle joint of fingers. Each accelerometer has x and y angles that can be measured, ending in a total of 10 sensor readings every 10 milliseconds. The raw data matrix consisting of just the x and y values is transformed into a separate Xg, Yg, and yi values. Xg (x global) measures the finger orientation, roll, and spread. Yg measures the finger bentness of the hand. The third component classifies the hand into three values: closed, horizontal, and vertical. This third component is actually only the index finger's y-component (only in the ASL letters 'F' and 'D' is the index finger not accurate for this measurement).

To classify a posture/gesture, the decision tree first breaks up the letters into vertical, horizontal, and closed. Then the gestures are classified further as rolled, flat, pinky up, and these sections then recognize between the actual letters.

They mention a 100% recognition rate for 21 gestures, with 78% being the worst gesture accuracy.

Discussion:

I like this paper for 2 main reasons:

There are no HMMs
They did not use a CyberGlove

The paper's results and decision tree theory are a bit lacking, but I think that the ideas behind the paper were good and refreshingly different from the ochoish other papers we've read.

I'm curious as to how well the glove they designed can work with gestures instead of postures. The glove polls each accelerometer sequentially, which could be a problem with very quick gestures. This issue is probably not too important, but it might provide slightly more error than a batch poll.

I'm also curious as to how they designed their decision tree. The intuition behind the partitioning is not made clear, except for the main partition of open/close/horizontal.

Wednesday, January 30, 2008

A Dynamic Gesture Interface for Virtual Environments Based on Hidden Markov Models

Qing, C., A. El-Sawah, et al. (2005). A dynamic gesture interface for virtual environments based on hidden Markov models. Haptic Audio Visual Environments and their Applications, 2005. IEEE International Workshop on.

Summary:

The authors of this paper used the HMM & CyberGlove dynamic duo in conjunction with standard deviations.

Qing et al. claim that using the standard deviation of finger positions allows them to fix the "gesture spotting" (segmentation/fragmentation) issue with a continuous data stream. The glove data is sampled at 10Hz, and then the standard deviations of each sensor are calculated. The standard deviations also help transform a series of vectors (observations) into a single vector. They then take this vector and perform VQ on it to get a discrete value.

The three gestures they used to test their system controlled the rotation of a cube. The gestures included 1 finger bending, 2 fingers bending, and a twisting motion with your thumb.

Discussion:

Sigh, no results. I have no idea how the system actually solves the gesture spotting problem because they are just trading the "is this observation the start of a gesture?" problem into a "does this standard deviation vector look like it might be the start of a gesture?" problem.

Also, with only three gestures standard deviations might work for distinguishing between gestures. But continually moving one's hand indicates that the standard deviation for every finger will be fluctuating wildly.

I now know more about the bone structure of a hand.

Online, Interactive Learning of Gestures for Human/Robot Interfaces

Lee, C. and X. Yangsheng (1996). Online, interactive learning of gestures for human/robot interfaces. Robotics and Automation, 1996. Proceedings., 1996 IEEE International Conference on.

Summary:

Lee and Yangsheng created a HMM system that allows for online updating of gestures. If the system is certain about a gesture (i.e., above or below a threshold), then the system performs the action associated with the gesture. Otherwise, the system asks the user for the gesture's confirmation. The HMM then updates through using the Baum-Welch algorithm (an EM algorithm for finding state and transition probabilities for an HMM given data).

Their system uses a CyberGlove to capture the hand gestures. The gestures are first captured from the glove, then resampled and smoothed before performing vector quantization. Gestures are segmented by having the user stop or remain still for a short time.

Gestures are evaluated on a logarithmic scale of the sums of the probability of the model / probability of the observation sequence. If the gesture is below a threshold it is considered correct, and if it us above the threshold it is considered suspect or incorrect.

The domain for testing the system was 14 sign language letters that were distinct enough to be used with VQ.

Discussion:

I'm very confused by the graphs they give. They mention that if their "V" values corresponding to the correct/incorrect threshold are below -2, then the gesture is correct. Yet their graphs only show 2 examples ever even bordering on the -2 mark; all other values were way below -2. Does this mean that their system was always confident?

I also have an issue with telling the computer what the correct gesture is. Although I've done almost the exact same thing in recent work, hand-gesturing systems are geared toward non-keyboard-monitor use. For instance, to control a robot, I'd probably be looking at the robot and not a monitor. In the field I would not want to turn around, find my keyboard, punch up the correct gesture, and continue.

Monday, January 28, 2008

An Architecture for Gesture-Based Control of Mobile Robots

Iba, S., J. M. V. Weghe, et al. (1999). An architecture for gesture-based control of mobile robots. Intelligent Robots and Systems, 1999. IROS '99. Proceedings. 1999 IEEE/RSJ International Conference on.

Summary:

Iba et al. describe a gesture-based control scheme for robots. HMMs are used to define seven gestures: closed fist, open hand, wave left, wave right, pointing, opening, and "wait". These gestures correspond to actions that a robot can take, such as accelerating and turning.

The mobile robot that the system uses has IR sensors, sonar sensors, a camera, and a wireless transmitter. The gesture capturing is done with a CyberGlove with 18 sensors.

Gesture recognition is performed with an HMM-based recognizer. The recognizer first preprocesses the sensor data to change the 18-dimensional sensor data into a 10-dimensional feature vector. The derivatives of each feature are computed as well, to produce a 20-dimensional column. Each column is then reduced to a "codeword" that maps the input to one of 32 possible codewords, or actions. This codebook is trained offline, and at runtime the feature vectors are mapped to a codeword.

The HMM takes a sequence of codewords and determines which gesture the user is performing. It is important to note that if no suitable gesture is found, the recognizer can return "none". To overcome some HMM problems, the "wait state" is the first node in the model and transitions to the other 6 gestures. If no gesture is currently seen, the wait state is the most probable. As more observations push the gesture toward another state, the correct gesture probability is altered and the gesture spotter picks the gesture with the highest score.

Discussion:

I'd have liked to know the intuition behind using 32 codewords. The inclusion of the wait state is also odd in combination with the "opening" state, which does not seemed to be mapped to anything. So technically the opening state is a wait+1 for either the close or opened state. I don't have much more to say on this one.

HoloSketch: A Virtual Reality Sketching / Animation Tool

Deering, Michael F. HoloSketch: A Virtual Reality Sketching/Animation Tool. (1995) ACM Transactions on Computer-Human Interaction.

Summary:

Deering's 3D VR system, HoloSketch, aimed to allow the creation of three-dimensional objects in a virtual reality environment. Users donned VR goggles with a supercool 960x680 20'' CRT monitor and interacted with the virtual world via a six-axis wand. The head-tracking goggles allow the user to look around images hovering in front of them.

HoloSketch prides itself in displaying stable images that do not "float" or "swim" as the user moves their head. They accomplish this by having a highly accurate absolute orientation tracker in the goggles. The use of a flat-screen CRT also helps, as well as program corrections for interocular distances.

A good chunk of the paper focused on user interactions, such as menu navigation. Deering's system uses a 3D pie (radial) menu that can be activated by holding down the right click on the wand. The user can then navigate the menu while holding the button and "poke" the menu to activate submenus and items. To create and draw objects, the user first selects a primitive from the menu and then places the primitive by hitting a button on the wand. The user can then rotate, size, and position the object using a combination of wand-waving and keyboard buttons.

Users can also create animations with the system. Some animations require still shots of slightly altered objects that can be grouped temporally (like a VR flipbook). Other animations can be added to objects or groups, such as a rotor property or blinking colors.

An artist tested the system for a month and provided feedback. Overall the artist found the tool easy to work with after a few days, although some of the features available in other applications were missing from HoloSketch. One issue that Deering noticed was the lack of a user's head movement when trying to view the object; users are too used to stable heads that examining an object from different angles was not intuitive.

Discussion:

HoloSketch seems like an interesting application and provides a variety of ideas, some which I believe are beneficial, while others are not. The "poking" of menus seems intuitive, and if the system has a high absolute accuracy this should work well. Yet, Deering mentioned how user's arms can get tired and are unstable, and supporting an arm and wrist is out of the question when you try to make an environment natural. Instead HoloSketch had some button that reduced the jitter somehow when activated, which seems like a hack that allows for a quick fix of a potentially serious issue with using the system.

I also understand why people would not want to constantly move their head around the display. If the display was on a round table this would be a non-issue, but constantly moving around in a chair and leaning different directions is a strain to a user. Furthermore, the 20" CRT is not that large of a screen the the user would be able to "see" all around the object; I would have liked to know the actual viewing angle.

Overall, though, I liked the system and the paper itself was well-written and gave a good overview of the features.

Thursday, January 24, 2008

An Introduction to Hidden Markov Models

Summary:

Rabiner and Juang's paper on Hidden Markov Models (HMMs) introduces the models, defines the three main problems associated with HMMs, and provides examples for utilizing HMMs.

HMMs are a time-dependent model that consist of observations and hidden states. As an example, the authors discuss possible coin flip models that can have coins of varying probability (states) and transitions that probabilistically determine which coin will be flipped. One person could continuously flip coins and record the data. Another person is only receiving the outcomes of the flips, i.e., O = O1, ..., OT,. The person flipping is hidden to the observer.

Rabiner and Juang define three main elements of HMMs as:

1) HMMs have a finite number of states, N
2) A "new" state is entered at time, t, depending on a given transition probability distribution.
3) Observable output is made after each transition, and this output depends on the current state.

The formal notation for an HMM is:

T = the time length of the observable sequences (i.e., how many observations seen)
N = the number of states
M = the number of observation symbols (if observations are discrete)
Q = the states {q1, q2, ... , qN}
V = the observations {v1, v2, ... , vM}
A = the state probability distribution {aij}, aij = P(qj at t + 1 | qi at t). The probability we are in qj given that we were in qi in the last timestep.
B = the observation symbol probability distribution in state j, {bj(k)}, bj(k) = P(vk at t | qj at t)
pi = initial state distribution, pij = P(qi at t = 1)

The three problems for HMMs are:

1) Given an observation sequence O = O1, ..., OT, and the

Solutions to these problems are presented in the paper, but mathematical symbols are difficult to represent in the blog, and many of the images used are illegible. Instead, I'll jump to the author's discussion of uses and issues.

One issue with HMMs is underflow, since the values at at(i) and Bt(i) approach zero very quickly (they are products of 0.0-1.0 probabilities). Another issue is how to actually build HMMs, i.e. what are the transitions and states?

HMMs are good for modeling sequential information where the current state relies only on the previous (or previous 2) states. These models, such as for isolated word recognition, are easy to build and not too computationally intensive. People usually do not insert random sounds into the middle of a word, so the probability distributions for these models are easy to build.

Discussion:

Overall the HMM paper is a good overview of HMMs. I really don't have much to say about this paper, except that I wish I had page 14 and I wish that the figures were readable.

As far as HMMs in hand gestures go, I have alway shied away from using HMMs because I feel that the power you get from them is offset by huge constraints and a large overhead with implementation issues and computation time. The class could theoretically model some types of sign gestures with HMMs, but I guess we'll see what data the class gets to see if any sorts of probability distributions present themselves.

Wednesday, January 23, 2008

American Sign Language Finger Spelling Recognition System

Allen, J., Pierre, K., and Foulds, R. American Sign Language Finger Spelling Recognition System. (2003) IEEE.

Summary:

Allen et al.'s created an ASL recognition system using neural networks and an 18-sensor CyberGlove. The authors propose that a wearable glove recognition system can help translate ASL into English and assist deaf (and even blind) people by allowing them to converse with the hearing unimpaired.

The authors used a character set of 24 letters, omitting 'J' and 'Z' due to their usage of arm motions. Instead, the remaining 24 characters use only hand positions. Data from the CyberGlove was collected and recognized in Matlab program, and a second program called Labview would output the corresponding audio for a recognized character.

The recognition system for ASLFSR is a perceptron network with an input of 18x24 (18 sensors, 24 characters) and a desired output of 24x24 (identity matrix for the recognized symbols). The network was trained with an "adapt" function.

The system worked well for a single user and had results up to 90%.

Discussion:

The authors claim that they can achieve a better level of accuracy by training the network on data from multiple subjects, but I completely disagree. That's like saying a hand-tailored suit fits alright, but the pin-stripe at the blue light special is better since it has been designed for the average Joe.

To improve their accuracy they should improve their model. Perceptrons are not that powerful since they clobber values, and using some different neurons (Adalines?) might improve their results. Also, neural networks sometimes work better with more than just 2 layers, and data from 18 non-distinct inputs would probably benefit from even a 3-layer NN . Multiple layer NNs are notoriously tricky to design "well" (i.e. guess and check).

Flexible Gesture Recognition for Immersive Virtual Environments

Deller, M., A. Ebert, et al. (2006). Flexible Gesture Recognition for Immersive Virtual Environments. Information Visualization, 2006. IV 2006. Tenth International Conference on.

Summary:

Deller et al.'s publication used hand-gestures with a P5 glove to control various aspects of a desktop environment. The glove will allow users to manipulate virtual objects in three dimensions.

The apparatus that the authors used is the P5 glove, which has 5 finger sensors and an infrared tracking system. The glove was used to create hand gestures, where a gesture is a hand position held for approximately half a sentence. Gestures are stored as sensor vector templates, and each new gesture is compared against the gesture library via a simple distance measurement.

The authors had users test the system.

Discussion:

The application of hand gestures is simple, such as the use of distance for gesture classification. Using a more complex classifier might improve their accuracy, but with only 5 sensors the gestures might be simple and different enough that a simple solution is necessary.

I hope that presenting some results, at least in user study form, is the norm for the remaining papers we read. I cannot really take anything from this paper since I'm not sure if anything works well. The methods are so simple that I can implement them quickly, but it would be nice to have a baseline to compare to.

Tuesday, January 22, 2008

Environmental Technology: Making the Real World Virtual

Myron, W. K. (1993). "Environmental technology: making the real world virtual." Commun. ACM 36(7): 36-37.

Summary:

Kreuger's short paper described applications possible with a sensor-filled environment. Kreuger focused on having a human be the mechanism for interaction, i.e. a person's hand and body would interact with non-wearable sensory equipment.

One application had a user interact with a 1000-sensor room to project images onto a screen. Depending on a user's position, the user would be projected into a maze or control musical notes.
Another application showed hand projections from two people miles away via a teleconference. The two people could interact in a shared space and discuss objects by pointing at them.

A "windshield" application allowed a user to "fly" across a graphical world by manipulating their hand positions. This application existed in Kreuger's VIDEOPLACE environment, which is basically a collection of these types of virtual world creations and interactions.

Discussion:

Krueger's paper mentions a great number of interesting applications but does not discuss any in detail. Since the applications mentioned are listed as references I'll have to look them up sometime. From the paper it sounds like some of the applications are impressive, but they were also created in the 70s and 80s so they might not work well with respect to their network and graphical capabilities. I'm also interested to see what he has done since this.