Wednesday, January 30, 2008

A Dynamic Gesture Interface for Virtual Environments Based on Hidden Markov Models

Qing, C., A. El-Sawah, et al. (2005). A dynamic gesture interface for virtual environments based on hidden Markov models. Haptic Audio Visual Environments and their Applications, 2005. IEEE International Workshop on.


Summary:

The authors of this paper used the HMM & CyberGlove dynamic duo in conjunction with standard deviations.

Qing et al. claim that using the standard deviation of finger positions allows them to fix the "gesture spotting" (segmentation/fragmentation) issue with a continuous data stream. The glove data is sampled at 10Hz, and then the standard deviations of each sensor are calculated. The standard deviations also help transform a series of vectors (observations) into a single vector. They then take this vector and perform VQ on it to get a discrete value.

The three gestures they used to test their system controlled the rotation of a cube. The gestures included 1 finger bending, 2 fingers bending, and a twisting motion with your thumb.


Discussion:

Sigh, no results. I have no idea how the system actually solves the gesture spotting problem because they are just trading the "is this observation the start of a gesture?" problem into a "does this standard deviation vector look like it might be the start of a gesture?" problem.

Also, with only three gestures standard deviations might work for distinguishing between gestures. But continually moving one's hand indicates that the standard deviation for every finger will be fluctuating wildly.

I now know more about the bone structure of a hand.

Online, Interactive Learning of Gestures for Human/Robot Interfaces

Lee, C. and X. Yangsheng (1996). Online, interactive learning of gestures for human/robot interfaces. Robotics and Automation, 1996. Proceedings., 1996 IEEE International Conference on.


Summary:

Lee and Yangsheng created a HMM system that allows for online updating of gestures. If the system is certain about a gesture (i.e., above or below a threshold), then the system performs the action associated with the gesture. Otherwise, the system asks the user for the gesture's confirmation. The HMM then updates through using the Baum-Welch algorithm (an EM algorithm for finding state and transition probabilities for an HMM given data).

Their system uses a CyberGlove to capture the hand gestures. The gestures are first captured from the glove, then resampled and smoothed before performing vector quantization. Gestures are segmented by having the user stop or remain still for a short time.

Gestures are evaluated on a logarithmic scale of the sums of the probability of the model / probability of the observation sequence. If the gesture is below a threshold it is considered correct, and if it us above the threshold it is considered suspect or incorrect.

The domain for testing the system was 14 sign language letters that were distinct enough to be used with VQ.


Discussion:

I'm very confused by the graphs they give. They mention that if their "V" values corresponding to the correct/incorrect threshold are below -2, then the gesture is correct. Yet their graphs only show 2 examples ever even bordering on the -2 mark; all other values were way below -2. Does this mean that their system was always confident?

I also have an issue with telling the computer what the correct gesture is. Although I've done almost the exact same thing in recent work, hand-gesturing systems are geared toward non-keyboard-monitor use. For instance, to control a robot, I'd probably be looking at the robot and not a monitor. In the field I would not want to turn around, find my keyboard, punch up the correct gesture, and continue.

Monday, January 28, 2008

An Architecture for Gesture-Based Control of Mobile Robots

Iba, S., J. M. V. Weghe, et al. (1999). An architecture for gesture-based control of mobile robots. Intelligent Robots and Systems, 1999. IROS '99. Proceedings. 1999 IEEE/RSJ International Conference on.


Summary:


Iba et al. describe a gesture-based control scheme for robots. HMMs are used to define seven gestures: closed fist, open hand, wave left, wave right, pointing, opening, and "wait". These gestures correspond to actions that a robot can take, such as accelerating and turning.

The mobile robot that the system uses has IR sensors, sonar sensors, a camera, and a wireless transmitter. The gesture capturing is done with a CyberGlove with 18 sensors.

Gesture recognition is performed with an HMM-based recognizer. The recognizer first preprocesses the sensor data to change the 18-dimensional sensor data into a 10-dimensional feature vector. The derivatives of each feature are computed as well, to produce a 20-dimensional column. Each column is then reduced to a "codeword" that maps the input to one of 32 possible codewords, or actions. This codebook is trained offline, and at runtime the feature vectors are mapped to a codeword.

The HMM takes a sequence of codewords and determines which gesture the user is performing. It is important to note that if no suitable gesture is found, the recognizer can return "none". To overcome some HMM problems, the "wait state" is the first node in the model and transitions to the other 6 gestures. If no gesture is currently seen, the wait state is the most probable. As more observations push the gesture toward another state, the correct gesture probability is altered and the gesture spotter picks the gesture with the highest score.


Discussion:

I'd have liked to know the intuition behind using 32 codewords. The inclusion of the wait state is also odd in combination with the "opening" state, which does not seemed to be mapped to anything. So technically the opening state is a wait+1 for either the close or opened state. I don't have much more to say on this one.

HoloSketch: A Virtual Reality Sketching / Animation Tool

Deering, Michael F. HoloSketch: A Virtual Reality Sketching/Animation Tool. (1995) ACM Transactions on Computer-Human Interaction.

Summary:


Deering's 3D VR system, HoloSketch, aimed to allow the creation of three-dimensional objects in a virtual reality environment. Users donned VR goggles with a supercool 960x680 20'' CRT monitor and interacted with the virtual world via a six-axis wand. The head-tracking goggles allow the user to look around images hovering in front of them.

HoloSketch prides itself in displaying stable images that do not "float" or "swim" as the user moves their head. They accomplish this by having a highly accurate absolute orientation tracker in the goggles. The use of a flat-screen CRT also helps, as well as program corrections for interocular distances.

A good chunk of the paper focused on user interactions, such as menu navigation. Deering's system uses a 3D pie (radial) menu that can be activated by holding down the right click on the wand. The user can then navigate the menu while holding the button and "poke" the menu to activate submenus and items. To create and draw objects, the user first selects a primitive from the menu and then places the primitive by hitting a button on the wand. The user can then rotate, size, and position the object using a combination of wand-waving and keyboard buttons.

Users can also create animations with the system. Some animations require still shots of slightly altered objects that can be grouped temporally (like a VR flipbook). Other animations can be added to objects or groups, such as a rotor property or blinking colors.

An artist tested the system for a month and provided feedback. Overall the artist found the tool easy to work with after a few days, although some of the features available in other applications were missing from HoloSketch. One issue that Deering noticed was the lack of a user's head movement when trying to view the object; users are too used to stable heads that examining an object from different angles was not intuitive.


Discussion:

HoloSketch seems like an interesting application and provides a variety of ideas, some which I believe are beneficial, while others are not. The "poking" of menus seems intuitive, and if the system has a high absolute accuracy this should work well. Yet, Deering mentioned how user's arms can get tired and are unstable, and supporting an arm and wrist is out of the question when you try to make an environment natural. Instead HoloSketch had some button that reduced the jitter somehow when activated, which seems like a hack that allows for a quick fix of a potentially serious issue with using the system.

I also understand why people would not want to constantly move their head around the display. If the display was on a round table this would be a non-issue, but constantly moving around in a chair and leaning different directions is a strain to a user. Furthermore, the 20" CRT is not that large of a screen the the user would be able to "see" all around the object; I would have liked to know the actual viewing angle.

Overall, though, I liked the system and the paper itself was well-written and gave a good overview of the features.

Thursday, January 24, 2008

An Introduction to Hidden Markov Models

Summary:

Rabiner and Juang's paper on Hidden Markov Models (HMMs) introduces the models, defines the three main problems associated with HMMs, and provides examples for utilizing HMMs.

HMMs are a time-dependent model that consist of observations and hidden states. As an example, the authors discuss possible coin flip models that can have coins of varying probability (states) and transitions that probabilistically determine which coin will be flipped. One person could continuously flip coins and record the data. Another person is only receiving the outcomes of the flips, i.e., O = O1, ..., OT,. The person flipping is hidden to the observer.

Rabiner and Juang define three main elements of HMMs as:

1) HMMs have a finite number of states, N
2) A "new" state is entered at time, t, depending on a given transition probability distribution.
3) Observable output is made after each transition, and this output depends on the current state.

The formal notation for an HMM is:

T = the time length of the observable sequences (i.e., how many observations seen)
N = the number of states
M = the number of observation symbols (if observations are discrete)
Q = the states {q1, q2, ... , qN}
V = the observations {v1, v2, ... , vM}
A = the state probability distribution {aij}, aij = P(qj at t + 1 | qi at t). The probability we are in qj given that we were in qi in the last timestep.
B = the observation symbol probability distribution in state j, {bj(k)}, bj(k) = P(vk at t | qj at t)
pi = initial state distribution, pij = P(qi at t = 1)

The three problems for HMMs are:

1) Given an observation sequence O = O1, ..., OT, and the

Solutions to these problems are presented in the paper, but mathematical symbols are difficult to represent in the blog, and many of the images used are illegible. Instead, I'll jump to the author's discussion of uses and issues.

One issue with HMMs is underflow, since the values at at(i) and Bt(i) approach zero very quickly (they are products of 0.0-1.0 probabilities). Another issue is how to actually build HMMs, i.e. what are the transitions and states?

HMMs are good for modeling sequential information where the current state relies only on the previous (or previous 2) states. These models, such as for isolated word recognition, are easy to build and not too computationally intensive. People usually do not insert random sounds into the middle of a word, so the probability distributions for these models are easy to build.


Discussion:

Overall the HMM paper is a good overview of HMMs. I really don't have much to say about this paper, except that I wish I had page 14 and I wish that the figures were readable.

As far as HMMs in hand gestures go, I have alway shied away from using HMMs because I feel that the power you get from them is offset by huge constraints and a large overhead with implementation issues and computation time. The class could theoretically model some types of sign gestures with HMMs, but I guess we'll see what data the class gets to see if any sorts of probability distributions present themselves.


Wednesday, January 23, 2008

American Sign Language Finger Spelling Recognition System

Allen, J., Pierre, K., and Foulds, R. American Sign Language Finger Spelling Recognition System. (2003) IEEE.

Summary:

Allen et al.'s created an ASL recognition system using neural networks and an 18-sensor CyberGlove. The authors propose that a wearable glove recognition system can help translate ASL into English and assist deaf (and even blind) people by allowing them to converse with the hearing unimpaired.

The authors used a character set of 24 letters, omitting 'J' and 'Z' due to their usage of arm motions. Instead, the remaining 24 characters use only hand positions. Data from the CyberGlove was collected and recognized in Matlab program, and a second program called Labview would output the corresponding audio for a recognized character.

The recognition system for ASLFSR is a perceptron network with an input of 18x24 (18 sensors, 24 characters) and a desired output of 24x24 (identity matrix for the recognized symbols). The network was trained with an "adapt" function.

The system worked well for a single user and had results up to 90%.


Discussion:

The authors claim that they can achieve a better level of accuracy by training the network on data from multiple subjects, but I completely disagree. That's like saying a hand-tailored suit fits alright, but the pin-stripe at the blue light special is better since it has been designed for the average Joe.

To improve their accuracy they should improve their model. Perceptrons are not that powerful since they clobber values, and using some different neurons (Adalines?) might improve their results. Also, neural networks sometimes work better with more than just 2 layers, and data from 18 non-distinct inputs would probably benefit from even a 3-layer NN . Multiple layer NNs are notoriously tricky to design "well" (i.e. guess and check).

Flexible Gesture Recognition for Immersive Virtual Environments

Deller, M., A. Ebert, et al. (2006). Flexible Gesture Recognition for Immersive Virtual Environments. Information Visualization, 2006. IV 2006. Tenth International Conference on.

Summary:

Deller et al.'s publication used hand-gestures with a P5 glove to control various aspects of a desktop environment. The glove will allow users to manipulate virtual objects in three dimensions.

The apparatus that the authors used is the P5 glove, which has 5 finger sensors and an infrared tracking system. The glove was used to create hand gestures, where a gesture is a hand position held for approximately half a sentence. Gestures are stored as sensor vector templates, and each new gesture is compared against the gesture library via a simple distance measurement.

The authors had users test the system.


Discussion:

The application of hand gestures is simple, such as the use of distance for gesture classification. Using a more complex classifier might improve their accuracy, but with only 5 sensors the gestures might be simple and different enough that a simple solution is necessary.

I hope that presenting some results, at least in user study form, is the norm for the remaining papers we read. I cannot really take anything from this paper since I'm not sure if anything works well. The methods are so simple that I can implement them quickly, but it would be nice to have a baseline to compare to.

Tuesday, January 22, 2008

Environmental Technology: Making the Real World Virtual

Myron, W. K. (1993). "Environmental technology: making the real world virtual." Commun. ACM 36(7): 36-37.

Summary:

Kreuger's short paper described applications possible with a sensor-filled environment. Kreuger focused on having a human be the mechanism for interaction, i.e. a person's hand and body would interact with non-wearable sensory equipment.

One application had a user interact with a 1000-sensor room to project images onto a screen. Depending on a user's position, the user would be projected into a maze or control musical notes.
Another application showed hand projections from two people miles away via a teleconference. The two people could interact in a shared space and discuss objects by pointing at them.

A "windshield" application allowed a user to "fly" across a graphical world by manipulating their hand positions. This application existed in Kreuger's VIDEOPLACE environment, which is basically a collection of these types of virtual world creations and interactions.


Discussion:

Krueger's paper mentions a great number of interesting applications but does not discuss any in detail. Since the applications mentioned are listed as references I'll have to look them up sometime. From the paper it sounds like some of the applications are impressive, but they were also created in the 70s and 80s so they might not work well with respect to their network and graphical capabilities. I'm also interested to see what he has done since this.