A Sketchy Blog: February 2008

Wednesday, February 27, 2008

American Sign Language Recognition in Game Development for Deaf Children

Summary:

Brashear et al. use GT2k to create an American Sign Language game for deaf children. The system, called CopyCat, teaches language skills to children by having them sign various sentences to interact with a game environment.

A Wizard of Oz study was used to gather data and design their interface. A desk, mouse, and chair was used in the study, along with a pink glove. The students pushed a button and then signed a gesture, and the data was collected using the glove and an IEEE 1394 video camera. The users were 9- to 11-year-olds.

The hand is pulled from the video image by its bright color. The image pixel data is converted to a HSV color space histogram, which is used to binarize the data and find the hand. Accelerometers are also used to track hand movement in x, y, and z positions.

The data from five children was analyzed for user-dependent and -independent models. User-dependence was validated in a 90/10 (training/testing) split, with word accuracy in the low 90s and and sentence accuracy around 70%. The standard deviation for the sentence accuracy is very high, with approximately at 12% deviation.

User-independent models were lower with an average word accuracy of 86.6% and a sentence accuracy of 50.64%.

Discussion:

I like the author's user study with the Wizard of Oz to collect real-world data from children. The system's performance (in essence, GT2k's performance) was very low with sentences, which indicates that segmentation is the largest issue with the toolkit. I'm also worried about the 90/10 split for the user dependent models. That is a huge ratio of training to testing data, and it might be skewing the results to show higher than normal accuracy.

A Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language Sentence

Summary:

Sagawa and Takeuchi created a Japanese Sign Language recognition system that uses "rule-based matching" and segments gestures based on hand velocity and direction changes.

There are thresholds of direction vector changes that account for the segmentation. There are also issues to determine which hand (or both) are being used for gestures, and these are determined by the direction and velocity change thresholds.

The system achieved 86.6% accuracy for signed words, and 58% accuracy for signed sentences.

Discussion:

There's not much to discuss with this paper. The "nugget" of research is with the use of direction and velocity changes to segment the gestures. I became more interested in this paper since I learned it was published a year before Sezgin's, but not by much.

Monday, February 25, 2008

Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition

Summary:

Researchers from Georgia Tech have created a gesture toolkit called GT2k. The purpose behind GT2k is to allow researchers to focus on system development instead of recognition. The toolkit works in conjunction with the Hidden Markov Model Toolkit (HTK) to provide HMM tools to a developer. GT2k usage can be divided into four categories: preparation, training, validation, and recognition.

Preparation involves the developer setting up an initial gesture model, semantic gesture descriptions, and gesture examples. Each model is a separate HMM, and GT2k allows either automatic model generation for novices, or user-generated for experts. Grammars for the model are created in a rule-based fashion and allow for the definition of complex gestures based on simpler ones. Data collection is done with whatever sensing devices are needed.

Training the GT2k models can be done in two ways: cross-validation and leave-one-out. Cross-validation involves separating the data into 2/3 for training and 1/3 for testing. Leave-one-out involves training on the entire set minus one data element, and repeating this process for each element in the set. The results for cross-validation are computed in a batch, whereas the overall statistics for leave-one-out are calculated by each model's performance.

Validation checks to see that the training provided a model that is "accurate enough" for recognition. The process uses substitution, insertion, and deletion errors to calculate this accuracy.

Recognition occurs once valid data is received by a trained model. The GT2k abstracts this process away from the user of the system and calculates the likelihood of each model using hte Viterbi algorithm.

The remainder of the paper listed possible applications for GT2k including: a gesture panel for controlling a car stereo, a blink recognition system, a mobile sign language system, and a "smart" workshop that understands what actions a user is performing.

Discussion:

GT2k seems like a good system that can help beginning researchers more easily add HMMs into their gesture systems without worrying about implementation issues. Yet, the applications mentioned for GT2k are rather weak in both their concept and their results. HMMs are really only "needed" for one of the applications (sign language), whereas the other applications can be done more easily with simple techniques or moving the sensors away from a hand gesture.

This was a decent paper in writing style, presentation, and (possibly) contribution, but I'm curious to know what researchers have used GT2k and the systems they have created with it.

As a side note, I also am unclear as to why leave-one-out training is good, since with a large data set training the system could take a hell of a long time.

Computer Vision-Based Gesture Recognition For An Augmented Realtiy Interface

Summary:

Storring et al. from Aalborg University created an augmented reality system to create a "less obtrusive and more intuitive" interface.

The gestures used in the system are the mapped to the hand signs for 0-6, i.e. no fist, index finger, index and middle, etc. This gesture set can be recognizable in a 2D plane with a camera. In order for these gestures to work, the hand needs to be segmented from the image. The authors use normalized RGB values, called chromaticities, to minimize the variance of the color intensity. The distributions for the background and skin chromaticities are found and are modeled as 2D Gaussians. The hands are assumed to be a minimum number and maximum number of pixels.

Gestures are found by counting the number of fingers. A polar transformation counts the number of spikes (fingers) on currently shown on the hand. Click gestures can be found by checking the bounding box width of the hand between the regular index finger gesture and a "thumb click" addition.

Discussion:

For a system that is supposed to be less obtrusive and more intuitive than current interfaces, virtual reality with unintuitive gestures does not seem like a good solution. Using "finger numbers" is a poor choice, and having a gigantic head-mounted display with cameras is probably less comfortable than looking at a computer screen. Furthermore, if the authors are focusing on using head equipment, why not just use gloves to increase the gesture possibilities?

Thursday, February 21, 2008

3D Visual Detections of Correct NGT Sign Production

Summary:

Lichtenauer et al. created an interactive Dutch sign language system that would help train children to use the correct gesture. Their system has various requirements including: working under mixed lighting, being user independent, having immediate response, adaptive to skill level, and invariance to valid signs.

The authors' system uses two cameras to digitally track a person's head and hands, and a touch screen is placed in front of the user for software interactivity. The skin color of the person is first determined by finding the face, which is done by having a system's operator press a pixel inside of the face and a pixel around the outside of the head. These pixels than provide a way to train the skin color model of the system, which is a a Gaussian perpendicular in RGB space. The face and hands are separated into a Left and Right RGB distribution; the authors feel that a light source will typically coming from one direction, such as an open window. Hands are detected through their number of skin pixels, and the motion of a hand starts the tracking.

The system uses fifty 2D and 3D properties (features) related to hand location and movement. These properties are assumed to be independent, and base classifiers for each figure are computed and summed together to get a total classification value. These base classifiers use Dynamic Time Warping (DTW) to find the correspondence between two feature signals over time. These classifiers are trained with the "best" 50% of the training set for each feature. A sign is classified as correct if the average classifier probability for a class is above a threshold.

The results from the authors mention that they achieve "95% true positives" of the data.

Discussion:

In class, we have already discussed the issue of having a 95% positive rate, since the system is set up so that each symbol is known and the user is supposed to gesture the correct system. Always returning true will produce 100% accuracy.

I think the larger issue is that the classifier itself needs to be tested independent of the system. Theoretically, a separate classifier can be fine tuned for each gesture so that it can correctly recognize a single gesture 100% of the time. The issues involved with using a generic classifier will then be avoided.

Wednesday, February 20, 2008

Television Control by Hand Gesture

Summary:

Freeman and Weissman devised a way to control a TV with hand gestures using computer vision. In their system, the user's hand acts as a mouse. The user moves their open hand in front of the camera, palm facing toward the television, and the computer detects their hand and maps it to an on-screen mouse. When the user holds their hand over a control for a brief time period, the control is executed. Closing their hand or moving it out of the computer's vision deactivates the mouse.

The hand movement is detected by checking the angle difference between two vectors of pixels, where the pixels correspond to the pixels in an image frame and its offset. The dx and dy information is calculated for the image gradient, and this provides an orientation that can be handled in different lighting scenarios.

Discussion:

This paper was quaint. The actual algorithms used were rather simple, but the concept of controlling a TV via hand waving intrigued me. My main concern is that this application would train people watching a TV to not make any sudden movements so that the on-screen menu would not appear. Also, it forces people to walk through a living room slowly so that the TV does not catch their hand in any rapid movements. Some better gestures would benefit this system, such as twisting motions for channel or volume control.

A Survey of Hand Posture and Gesture Recognition Techniques and Technology

Summary:

This paper by LaViola presented a summary of key gesture recognition techniques. Hand posture and gesture recognition was divided up into several categories: feature extraction, statistics, models, and learning approaches. Some approaches, such as template matching, are more suited for postures, whereas HMMs are used solely for gestures. Feature extraction is used for both, but the feature set can be computationally heavy for the large dimension spaces.

Possible applications for gestures and postures include sign language, presentation assistance, 3D modeling, and virtual environments.

Discussion:

This paper is a good summary of current techniques and their strengths and weaknesses. There's not much to summarize in the paper since the summarizing an 80 page summary is rather dull and pointless, but I will be referring back to this paper for any future work.

Monday, February 18, 2008

Real-time Locomotion Control by Sensing Gloves

Summary:

Komura and Lam propose using P5 gloves to control character motion. The authors feel that using a "walking fingers" can provide a more tangible interface for controlling motion than traditional joystick or keyboard techniques.

The authors use a P5 glove for their gesture capture, and the user first calibrates the fingers by moving them in time with a given walking animation displayed on a computer screen. This calibration happens by a simple function comparing the cycle of the user's fingers versus the cycle of the animation.

After calibration, the user's fingers should be in-sync with the walking motions. For animating quadrupeds, there might need to be a phase shift between the back and front legs.

To test their system, the authors used a CyberGlove and had users play mock games with characters jumping and navigating a maze. Their results showed that navigating with the glove is potentially easier in terms of the number of collisions in a maze, and the glove and keyboard controls allow maze navigation in approximately the same time.

Discussion:

There's not much to say about this paper. The results that they gave were odd, since User 2 completed the maze with a keyboard in 18 seconds but had 22 collisions, and with the glove in 31 seconds with 3 collisions. I'm not sure what to make of that data...

Other than that, the research aspect of this paper basically took a finger sine and mapped it to an animation's sine. It might make navigating in certain games easier, but only if you need to control the speed of the character with better precision.

Wednesday, February 13, 2008

Shape Your Imagination: Iconic Gestural-Based Interaction

Summary:

Marsh and Watt performed a user study to determine how people represent different types of objects using only hand gestures. Gestures can be either substitutive (where the gestures act as if the object is being interacted with) or virtual (which describe the object in a virtual world).

The authors had 12 subjects of varying academic degree and major make gestures for the primitives circle, triangle, square, cube, cylinder, sphere, and pyramid. The users also gestured the complex and compound shapes for football, chair, French baguette, table, vase, car, house, and table-lamp. The users were told to gesture the describe the shapes with non-verbal hand gestures.

Overall, users used virtual hand depictions (75%) over substitutive (17.9%), with some objects having both gestures (7.1%). 3D shapes were always expressed with two hands, whereas primitives had some one-handed gestures (27.8%), like circle. Some objects were too hard for certain users to gesture, such as chair (4) and French baguette (1).

Discussion:

The user study was interesting in some respects, such as seeing how the majority of people describe objects by their virtual shapes, but overall I was disappointed by the paper. Images showing the various stages of depiction would have really helped, as well as actual answers from the questionnaire.

I was confused as to whether the authors were looking for only hand gestures or allowed full body movement, since the authors mention that they wanted hand gestures to the users but they did not seem to care that many users walked around the room. That's a pretty large detail that they glossed over.

A Dynamic Gesture Recognition System for the Korean Sign Language (KSL)

Summary:

Kim, Jang, and Bien use fuzzy min-max neural networks to recognize a small set of 25 basic Korean Sign Language gestures. The authors use two data gloves, each with 10 flex sensors, 3 location (x, y, z) sensors, and 3 orientation (pitch, yaw, roll).

Kim et al. find that the 25 gestures they use contain 10 different direction types, shown below

The authors also discovered that the data often has derivations within 4 inches of other data, so the x and y coordinates are split into 8 separate regions from -16 to 16 inches, with 4 inch ticks. The change in x, y direction (CD) is recorded for each time step simply as + and - symbols, and this data is recorded for four steps. CD change templates are then made for the 10 directions, D1 ... D10.

The 25 gestures contain 14 different hand gestures based on finger flex position. This flex value is sent to a fuzzy min-max neural network (FMMN) that separates the flex angles within a 10-dimensional "hyper box".

To classify a full gesture, the change of direction is first taken and compared against the templates, and then the flex angles are run through the FMNN. If the total (accuracy/probability) value is above a threshold, the gesture is classified.

The authors achieve approximately 85% accuracy.

Discussion:

Although this paper had some odd sections and interesting choices, such as making the time step 1/15th of a second and having gestures over 4/15ths of a second, the overall idea is quaint. I appreciate that the algorithm separates the data into two categories--direction change and flex angle--and separates the two components to hierarchically choose gestures.

I still do not like the use of neural networks, but if they work I am willing to forgive. My annoyance is also alleviated by the fact that the authors provide thresholds and numerical values for some equations within the network.

I'm very curious why they chose those 10 directions (from the figure). D1 and D8 could be confused if the user is sloppy, and D4 and D7 can be confused with their unidirection counterparts if the user is does their gestures slower than 1/4 of a second. Which is, of course, absurd.

Monday, February 11, 2008

A Survey of POMDP Applications

Summary:

Cassandra's survey summarizes some uses for partially observable Markov decision problems. MDPs are useful in artificial intelligence and planning applications. The overall structure of these problems involves states and transitions between the states, with costs associated with the transitions and states. The goal of a robot/problem is to find an optimal solution (policy) to a problem in the least number of transitions.

The POMDP model consists of:

States
Actions
Observations
A state transition function
An observation function
An immediate reward function

Cassandra's paper focuses on examples of using POMDPs, but he describes them in more detail here: http://www.pomdp.org/pomdp/index.shtml. Basically, they are MDP problems in which you cannot observe the entire state.

Some example applications include:

Machine maintenance - parts of the machine are modeled as states, and the goal is to minimize the repair costs or maximize the up-time on the machine.
Autonomous robots - robots need to navigate or accomplish a goal with a set of actions, and the world is not always observable
Machine vision - determining where to focus higher resolution (i.e., fovea) of the computer image to focus on specific parts such as hands and heads of people.

POMDPs have a number of limitations. One limitation is that the states need to be discrete. Although continuous states can be discretized, some domains can have trouble with this step. The main issue with POMDPs is in their computation limits. POMDPs become intractable rather quickly since their state spaces are exponential.

Discussion:

This paper had little to nothing to do with what we've been currently discussing in class. Although POMDPs are interesting from a theoretical standpoint, their intractability is a huge factor for avoiding them in any practical domain. I've been trying to think of how to even apply them to gesture recognition, and one idea I came up with included modeling hand positions as states for a single gesture, but then it just becomes an HMM with a reward function, and I'm not sure how beneficial a reward function is when taking the computation costs into account.

Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs*

Summary:

Song and Kim's paper proposes a way to use a sliding window for HMM gesture recognition. The window of 3 slides across observation sequences O, and a probability estimate for a gesture is determined to be the average of the partially observable probabilities at each timestep in the window. The algorithm also performs "forward spotting", which has something to do with the difference between the maximum probability for a gesture we find and the probability of a "non-gesture" at the same timestep. The non-gesture is a wait class that consists of an intermediate, junk state. As long as the "best" gesture probability is greater than the non-gesture probability by some threshold, then the gesture is classified accordingly.

The authors also use accumulative HMMs, which basically take the power set of continuous segmentations within a window and find the combination that produces the highest probability for a gesture.

The set of gestures that the authors classify consists of 8 simple arm position gestures (e.g., arms out, left arm out, etc.). They report recognition rates between 91% and 95%, depending on their choice of thresholds.

Discussion:

The system might work fine, but I really cannot tell because their test set is so simple. The 8 gesture they present are easily separable, and template matching algorithms can distinguish between them with ease. I also feel that their system is intractable as you start adding more gestures or gestures that vary widely in time length--adding more gestures adds an overhead to the probability calculations, and varying the length would likely cause the window to be reconfigured to be larger, which would explode the power set step.

Thursday, February 7, 2008

A similarity measure for motion stream segmentation and recognition

Chuanjun, L. and B. Prabhakaran (2005). A similarity measure for motion stream segmentation and recognition. Proceedings of the 6th international workshop on Multimedia data mining: mining integrated media and complex data. Chicago, Illinois, ACM.

Summary:

Li and Prabhakaran propose a way to "segment" streams of motion data by using singular value decomposition (SVD). SVD is similar to principal component analysis (PCA), and the technique finds the underlying geometric structure of a matrix (i.e., its eigenvectors and values). By using the singular values of matrices storing motion data, the matrices can be compared in similarity by measuring the angular differences (dot products) of these vectors.

The authors store motion data in a matrix consisting of columns of features and rows of timesteps. The first 6 eigenvectors are used when comparing matrix similarity; this value was empirically determined. The segmentation part of the paper involves separating this stream of data after every l timesteps, and then comparing the similarity of the segmented matrix to stored eigenvectors and values for a known motion.

To test their system, the authors merged individual motions together into a "stream" of data and inserted noise inbetween motions. The authors noted that the number of eigenvectors needed to distinguish between matrices (originally, k = 6) varied depending on the data collection method. Their paper reported recognition rates in the mid 90s, but these results depend on how similar motions are to one another.

Discussion:

Although the paper has little to do with segmentation, the actual algorithm for comparing motion data seems interesting and appears to achieve relatively accurate results. I would like to know the actual motions that users performed, since I have no idea what motions are required in Taiqi and Indian dances. They also did not mention the number of people involved in the data capturing, and I assume this number to be close to 1 since they needed a user to wear a motion suit.

Wednesday, February 6, 2008

Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation

Ip, H. H. S., K. C. K. Law, et al. (2005). Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation. Proceedings of the 11th International Multimedia Modelling Conference, 2005.

Summary:

Ip et al. created Cyber Composer, which is a music generation program controlled via hand gestures. The author's motivation is to inspire both musicians and casual listeners to experience music in a new way.

The authors split music composition into three parts: melody, rhythm, and tone. The melody is the "main" part of the music and mainly includes the treble parts, such as the singer. The rhythm keeps the beat of the music and is played by the drums and bass. Tonal accompaniment involves creating harmony across all parts.

In order to keep the tone (harmony) of the music interesting and flowing, the authors create a small "chord affinity" matrix that describes certain chord lead/following strengths. During music composition, chords are automatically chosen with high affinity. Melody notes are also chosen automatically to create musical "tension".

The system was implementing using two 22 sensor CyberGloves and two Polhemus positioning receivers. MIDI was used to produce the musical notes.

The seven gestures used in the system include rhythm, pitch, pitch-shifting, dynamics, volume, dual-instrument mode, and cadence. Rhythm is controlled by the flexing of the right wrist. Pitch is controlled by right-hand height, and it is reset at the beginning of each bar. The user can also "shift" the pitch by performing a similar gesture. Note dynamics and volume are controlled by the right-hand finger flex, with fully flexed fingers forcing forte notes. Dual-instrument mode allows a harmony melody or unison melody to be played along with the main instrument; this mode is activated using the left hand. To end the piece, the left-hand fingers are closed.

There are no results.

Discussion:

This paper aroused me. Some of the gestures they defined were intuitive, such as opening and closing of the fingers for volume and moving the hand up and down for notes. Other gestures just seem awkward, such as the ambiguous dual-instrument mode and constantly flapping your wrist (ouch?) to drive the melody.

I'm familiar with building music composition programs (including "smart" programs that use musical theory to assist composition), and I think this program was trying to market itself as something that it could never become. A music tool has to be either robust to allow experts to use it, sacrifice some features to become simple for novices, or fun for just the casual listener. In the expert category I would place Finale, and on the casual end I would place music games such as Guitar Hero. Novice programs are harder to come by, and the tool I worked on was ImproVisor--a system that used intelligent databases to analyze input notes and determine if the notes "sounded good".

CyberGlove is trying to do everything at once and failing. The lack of any results, even the casual comment by an offhand user, tells me that the system is rather convoluted to use or poor for composition. The hand gestures cannot really control notes in a way that experts would use the system, novices will not understand the theory behind why their hand waving sounds good or bad, and casual musicians will probably have no idea what is going on.

Sunday, February 3, 2008

Hand Tension as a Gesture Segmentation Cue

Philip A. Harling and Alistair D. N. Edwards. Hand tension as a gesture segmentation cue. Progress in Gestural Interaction: Proceedings of Gesture Workshop '96, pages 75--87, Springer, Berlin et al., 1997

Summary:

Harling and Edwards describe a way to segment hand gestures based on hand tension. The basic idea is that as a user dynamically moves between static postures a and b, their hand will reach a "relaxed", low-tension minimum position c that is less tense than either a or b.

Smaller details:

To find the tension for each finger, the authors use Hooke's Law and treat a finger as if it were a spring
The total hand tension is the sum of the finger tensions
They used a Mattel PowerGlove

Discussion:

The idea behind the paper was actually quite good for recognizing between static postures. I have a feeling that the hand tension will not work well for moving gestures since there would be small segmentations within the gesture.

I'm disappointed at their lack of results. I can forgive other papers that were user studies, but I cannot forgive a paper that does not report easily obtainable results when they spent 8 pages discussing a topic that I summarized in one sentence. Segmentation is rather simple to gather data for, and a published paper should at least attempt to find an accuracy number.

On a technical note, I'm curious as to how hand tension is affected by the type of glove worn. I have a feeling that my "hand relaxed" position is going to be different for a P5 glove than it will be for a CyberGlove or even a CyberGlove with a Flock of Birds attached. All the extra weight will most likely force my hand into resting upon the equipment for support.

A Multi-Class Pattern Recognition System for Practical Finger Spelling Translation

Hernandez-Rebollar, J. L., R. W. Lindeman, et al. (2002). A multi-class pattern recognition system for practical finger spelling translation. Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on.

Summary:

Hernandez-Rebollar et al. have a two part paper: they present a glove (AcceleGlove) and they have a test platform for the glove that uses decision trees.

The AcceleGlove contains 5 accelerometer sensors placed at the middle joint of fingers. Each accelerometer has x and y angles that can be measured, ending in a total of 10 sensor readings every 10 milliseconds. The raw data matrix consisting of just the x and y values is transformed into a separate Xg, Yg, and yi values. Xg (x global) measures the finger orientation, roll, and spread. Yg measures the finger bentness of the hand. The third component classifies the hand into three values: closed, horizontal, and vertical. This third component is actually only the index finger's y-component (only in the ASL letters 'F' and 'D' is the index finger not accurate for this measurement).

To classify a posture/gesture, the decision tree first breaks up the letters into vertical, horizontal, and closed. Then the gestures are classified further as rolled, flat, pinky up, and these sections then recognize between the actual letters.

They mention a 100% recognition rate for 21 gestures, with 78% being the worst gesture accuracy.

Discussion:

I like this paper for 2 main reasons:

There are no HMMs
They did not use a CyberGlove

The paper's results and decision tree theory are a bit lacking, but I think that the ideas behind the paper were good and refreshingly different from the ochoish other papers we've read.

I'm curious as to how well the glove they designed can work with gestures instead of postures. The glove polls each accelerometer sequentially, which could be a problem with very quick gestures. This issue is probably not too important, but it might provide slightly more error than a batch poll.

I'm also curious as to how they designed their decision tree. The intuition behind the partitioning is not made clear, except for the main partition of open/close/horizontal.

A Sketchy Blog