A Sketchy Blog: April 2008

Monday, April 28, 2008

Invariant features for 3-D gesture recognition

Summary:

Campbell et al. use HMMs and a list of features to find a good recognition rate for a set of T'ai Chi gestures that are performed by users in a swivel chair; a hand gesture's change in polar coordinates provided the highest recognition for the 18 gestures tested.

Discussion:

Performing T'ai Chi in a chair kind of defeats the purpose of T'ai Chi. That's like trying to study race car drivers by observing people who take the bus.

Wednesday, April 23, 2008

FreeDrawer - A Free-Form Sketching System on the Responsive Workbench

Summary:

Wesche et al. created a 3D sketching tool where the skeleton of a model is created. A user can draw curves in a virtual space. A new curve can be drawn anywhere, but additional curves must be merged with the present model. Altering curves can be done on a local or global scale. Surfaces can be filled in at closed curve loops. Surfaces can also be smoothed.

Discussion:

This paper had some nice pictures, but very little material was actually presented. How does the computer know where the pen point is? How does the user interact with the pen? Range of motion? Gestures?

Interacting with human physiology

Summary:

The authors Pavlidis et al. propose a system to monitor humans for stress levels and altered psychological states using high-end infrared cameras. This system could then be used for a variety of purposes such as stress management of UIs, illness detection, or lie detection.

The system tracks the user's face through tandem tracking to track a small, keys section of the face. These sections include the nose, forehead, and temporal regions. The tracker models each region by its center of mass and orientation. Blood flow is tracked in the face through a perfusion model and directional model. The model involves a differential equation set to measure the "volumetric metabolic heat" flow in the face. Other measurements tracked include pulse, heat transfer in areas, and breathing rate.

Discussion:

The ideas behind this system are great, although talking with Pavlidis showed us that there are issues with the current system's usability. Sweat and minor body temperature fluctuations can alter the system's reliability (since the system is trying to measure minor fluctuations). Unfortunately, the cost for one of these high-end cameras is $60k, so we won't be seeing this any time soon.

3D Object Modeling Using Spatial and Pictographic Gestures

Summary:

Nishino et al. designed a 3D object modeling system that uses stereoscopic glasses, CyberGloves, and polhemus trackers.

The system allows the creation of superellipsoids that can have smooth or squarish parameters. These primitive shapes can be bent, stretched, twisted, and merged with other shapes. Hand postures control these actions, such as grasping and pointing. Virtual hands are displayed on a 200-inch arched screen, along with the object, in stereoscopic mode. The virtual hands allow the user to easily see where they can touch and modify the 3D model.

The authors tested the system by having users attempt two types of objects: symmetric and asymmetric. The symmetric object was a bottle, and the asymmetric object was a teapot.
Creation of the objects took up to 120 minutes. The size of the stored objects was much less than a competing program, Open Inventor.

Discussion:

For a paper in 1998, this was a pretty advanced system and seemed to offer some benefits over other systems. I would have liked to have seen feedback from the users, though, since I'm not sure how hard the system is to use.

Toward Natural Gesture/Speech HCI: A Case Study of Weather Narration

Summary:

Poddar et al. use HMMs and speech to recognize complementary keyword-gesture pairs in weather reporting. The multi-modal system would combine speech and gesture recognition to improve recognition accuracy.

HMMs work well for separating 20 isolated gestures from weather narrators, but continuous gestures drop the accuracy to between 50-80% for sequences. Co-occurrence analysis seeks to study the meaning behind keywords occurring with gestures, the frequency behind these occurrences, and the temporal alignment of the gestures and keywords. A table they presented shows that certain types of gestures (contour, area, and pointing) are more heavily associated with certain keywords ("here", "direction", "location"). The accuracy of recognizing the gestures can improve with both video (gesture) and audio (keywords).

Discussion:

This paper does not necessarily add much to solutions, but it was written ten years ago and did show some nice results that combining speech and gesture can improve recognition. Since the authors did not use a speech recognition system, the errors with that system would also produce interesting results that differ from the given accuracies.

Thursday, April 17, 2008

Discourse Topic and Gestural Form

Summary:

Eisenstein et al. applied an unsupervised learning technique and a Bayesian network model to study the correlation between gestures and presentation topics.

Their system looks at "interest points" within a video image, where each interest point is said to have come from a mixture model. Interest points from a similar model are clustered together to create a codebook. A hidden variable determines whether the observation gesture codeword is from a topic-specific or a speaker-specific distribution. The authors use a Bayesian model to learn what distribution each gesture belongs to, based off Gaussians of feature vectors.

The system was tested with fifteen users giving 33 presentations picked from five topics. The experiments show that with correct labels, the topic-specific gestures account for 12% of the gestures, whereas corrupting these labels drops the average to 3%.

Discussion:

This paper is a good start to a longer study on how to incorporate topic-specific gestures into recognition systems. Finding these gestures can help computers understand what topics might be presented, as well as what speakers are presenting a topic or if a speaker is veering off-topic. The system can then be used for speech training, presentation classification, or assistance (Clippy).

Monday, April 14, 2008

Feature selection for grasp recognition from optical markers

Summary:

Chang et al. reduced the number of markers needed on a vision-based hand grasp system from 30 to 5 while retaining around a 90% recognition rate.

Six different grasps are used for classification: cylindrical, spherical, lumbrical, two-finger pinch, tripod, and lateral tripod. The posterior probabilities for a class yk are modeled with a softmax function, which divides the exp value of an observation sequence with the class weights, divided by the sum of all exp(weights * obs) values.

The weight values are determined by maximum conditional likelihood estimation from the training set of observations and classes (X, Y). Gradient descent is used to find the log likelihood with respect to the weights. Input features are found using a "sequential wrapper algorithm" that examines one feature at a time with respect to a target class.

Grasp data measured 38 objects being grasped with a full set of 30 markers. An "optimal", small set of markers was chosen by forward and backward selection.

The results indicate that the small marker set of 5 markers has between a 92-97% "accuracy retention" rate.

Discussion:

Reducing the number of sensors using the forward and backward selection is nice, but simply having a few more sensors increases the accuracy to the actual plateau point. From 10 on there is almost no change in accuracy, but between 5 and 10 sensors the accuracy can jump 5%, or 1/20, which is a huge percentage when taking into account user frustration.

Glove-TalkII--A Neural-Network Interface which Maps Gestures to Parallel Formant Speech Synthesizer Controls

Summary:

Fels and Hinton created Glove-TalkII, a system designed to synthesize voice using complicated glove and feet controls.

The artificial vocal track (AVT) is controlled using a CyberGlove, ContactGlove, polhemus sensor, and foot pedal. The ContactGlove controls 9 stop consonants, such as CH, T, and NG. The foot pedal controls the volume of the speech. Hand position corresponds to a vowel sound. Hand postures map to nonstop consonant phonemes.

The neural networks used include a vowel/consonant network to determine if the sensors are reading a vowel or consonant, and then separate vowel and consonant networks to distinguish between the phonemes.

A single user had to undergo 100 hours of training to be able to use the system.

Discussion:

Impractical. I'm shocked that they had someone train the system for 100 hours, and the fact that it takes a person that long to train the system should indicate that this is a poor way to synthesize voice. The person's final voice is even described as "intelligible and somewhat natural-sounding", which is not a good complement.

Requiring a person to walk around with a one-handed keyboard and type their words is a better solution. The keyboard wouldn't even have a foot pedal.

Wednesday, April 9, 2008

RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas

Summary:

Kim et al. use dual-direction antennas to find the direction of arrival for RF signals transmitted from an RFID tag. The two spiral antennas are perpendicular to each other and their signal strengths are different depending on the angle to the RFID tag.

Obstacles in front of the antennas/tag increase the error in determining the direction. The object can still be tracked, though. In experimental results it worked pretty well.

Discussion:

It works pretty well for its domain. Probably less accurate for incredibly small movements (e.g., finger bends). Seems like every now and then it goes crazy off-track (Figure 8).

Friday, April 4, 2008

Gesture Recognition Using an Acceleration Sensor and Its Application to Musical Performance Control

Summary:

Sawada and Hasimoto use accelerometer data to extract features of gestures and create a music tempo system.

The extracting of features is basic: projections onto certain planes, such as xy or yz, and the bounding box of the acceleration values. Changes of acceleration are measured using a fuzzy partition of radial angles.

The authors recognize or classify gestures using squared error. The actual gesture recognition is trivial.

The music tempo program is where the paper is more interesting as the system has to predict where a beat has been hit in real-time. Systems already existed where a marker is placed on a baton, but the visual processing of these systems usually has a delay of 0.1s (in 1997 computational power). In the author's system, gestures for up, down, and diagonal swings are used to indicate tempo. Other gestures can map to other elements of conducting.

A score is stored in the computer and the user conducts to the score. Often the computer and human are slightly off, and the two try to balance to each other. A simple function for balancing the tempo is given.

Discussion:

The system they use isn't a true conducting system since it relies on defined (and trained) gestures, but the ideas behind the tempo system are good and the simple execution and equations are appreciated.

Wednesday, April 2, 2008

Activity Recognition using Visual Tracking and RFID

Summary:

Krahnstoever et al. use RFID tags in conjunction with computer vision tracking to interpret what is happening within a scene.

A person model tracks a human's movement through their head and hands. The head is a 3D cartesian coordinate location, and each hand is described in spherical coordinates (r, phi, theta) with respect to the head. The models for where the head will be, p(X^t_h, X^t-1_h), and the hands p(X^t_q, X^t-1_h) had to be learned. The priors p(X_q | X_h) also had to be learned. Both hands and head are segmented using skin color.

Each pixel within a given image frame can belong to either the background or the foreground (body part). The likelihood for an image given the observations is taken to be the Improved Iterative Scaling (IIS) of the image section and bounding box of a body part, summed over the parts and sections. I have no idea how IIS works.

RFID tags provide movement and orientation information in 3D spaces. The amount of charge the RFID tag receives depends on its angle to the wave source, where a perpendicular angle receives no energy and a parallel angle is the greatest. The tag then outputs the tag's ID, orientation, and field strength to the signal.

The authors use the RFID information along with the hand and head positions to interpret what is happening in a scene. Agents are somehow used to do this.

Discussion:

The RFID information looks like it helps recognize what is happening within a scene, but I would have liked to have seen a comparison between a pure vision system and a system with the RFID. This could be a bit difficult, but it might help the strength of the paper.

I also would have liked an actual description of the activity agent system.

A Sketchy Blog