Monday, November 26, 2007

SketchREAD: A Multi-Domain Sketch Recognition Engine

Summary:

Alvarado's SketchREAD is sketch recognition system built with Bayesian networks. The engine can be tuned to run in multiple domains, and Bayesian networks allow for small errors to be corrected.

SketchREAD uses a geometric sketching language, much like the one found in LADDER, to describe simple domain shapes. The context of how these shapes appear within a domain, such as how they arrows are used to connect lineages together in family trees, is a higher level than simple geometric recognizers. Trying every possible combination of strokes to find the "best" fit for all the shapes is time consuming. SketchREAD seeks to model this context with Bayesian networks.

Shapes themselves have hypotheses linking to primitives and constraints. For instance, the hypothesis for an Arrow would cause three Lines and the constraints between them. Higher context models can also be portrayed, such as a Mother-Son link causing a Mother, Son, and a Line. Partial hypotheses can also be generated by incorporating "virtual" nodes that are primitive hypotheses not linked to observations.

To generate hypotheses, SketchREAD has three steps:
  1. Bottom-up: Strokes that the user draws are recognized as primitives and low-level shapes
  2. Top-down: System attempts to find subshapes missing from possible interpretations. Strokes can be reinterpreted.
  3. Pruning: Unlikely interpretations are removed from considerations.
As an example, Alvarado proposes that an ellipse is drawn in a family tree domain. This ellipse is recognized as a low-level shape, and then an interpretations for ellipse is created, as well as partial interpretations for Mother-Son, Mother-Daughter, etc. These partial hypotheses are templates, and the shape drawn is fit into a single slot of the template. Later, the shape can be shuffled within the template. To keep the interpretations from exploding and being intractible, high-level hypotheses are generated in the Bayesian network from only complete templates. Also, any polyline is assumed to be part of only one shape/interpretation.

In the domain of family trees, SketchREAD improves over baseline performance in symbol recognition by reducing the errors in recognition by over 50%. Circuit diagrams provide a harder domain, and here SketchREAD improves over a baseline by reducing the number of errors by 17%. The time it takes to process each stroke increases with the stroke number.

Discussion:

Although SketchREAD improves the accuracy for the tested domains, the final accuracy was not yet good enough to be used within any complex domain's interface, which was one of the goals of the system. In the paper's discussion, Alvarado also mentioned this. Also, the issue with allowing polylines to be part of only one interpretation greatly hurts circuit diagram domains, since many circuits symbols can be drawn with a single stroke.

Monday, November 19, 2007

Ink Features for Diagram Recognition

Summary:

Patel et al. use ink features to divide a sketch into text and shapes via a dividing tree. Forty-six features were defined by the authors for use in the divider, all of which were defined in an appendix.

Sketch data was collected from 26 people, each person drawing 9 sketches. Each stroke in the sketch was then labeled as being part of text or a shape, and analysis was performed on these sketches to determine the relevant features distinguishing between the two components. The authors used a statistical partitioning technique available in the R statisical package to divide the data into two components in the feature space and determine the most relevant features. A classification tree was then built with the most relevant feature as the root of the tree, and each node defines a threshold that separates the space into text and shape components.

The authors obtained great results with the simple classification tree technique. The new classifier greatly reduced the number of misclassified strokes as a whole from both Microsoft's classifier and InkKit's. Although in the test results the new classifier misclassifies text a bit more than either baseline system, text misclassification is much lower in the new system.

Discussion:

The actual accuracy numbers are not impressive, since the system still misclassifies too many strokes to be considered "accurate", but the system's improvement is very impressive and pushes the research on text/shape classification into a new direction. The author's analysis of the features (and inclusion of the features in an appendix!) was appreciated, and they discussed which features the Rpart partitioning system found most helpful. Since this model should run very quickly after training, it could easily be combined with other models to improve text/shape recognition even more.

Bayesian Networks Without Tears

Summary:

Charniak provides a decent overview of Bayes nets in this publication. Bayesian networks model causality and the probabilities associated with cause-and-effect relationships. Since Bayesian networks are a highly graphical and visual model, it would be rather convoluted to try and describe them in detail here. Instead, I will provide brief comments on topics brought up in the paper.

Bayesian networks are DAGs, or directed acyclic graphs. The reason for this is so that there are no infinite cause-and-effect relationship between two or more nodes. The nodes themselves are states in the world, and arcs (edges) in the graph specify the causal connections between the nodes. All nodes in the graph must have prior probabilities specified, where these probabilities are defined by experts or empirical data.

One of the main attractions to use Bayesian networks is the incredible savings in calculating joint probabilities for the graph. In a typical case where there are n binary variables, there would be 2^n - 1 joint probabilities in order to obtain a complete distribution. Yet, in Bayesian networks, the model grows exponentially only with the causal connections entering per node. If a node has only 1 connection entering in, then the node will have 2 joint probabilities. 2 connections provides 4 joint probabilities, 3 connections 8 joint probabilites, etc. This phenomenon is attributed to independence assumptions present in Bayesian networks. Two nodes are dependent on each other only if there is a d-connecting path between the two nodes, where a d-connecting path is defined as being a path where either the two nodes are linear or diverging from a source, or if, given the evidence E, the nodes are converging to a node in the evidence.

Wednesday, November 14, 2007

Sketch Interpretation Using Multiscale Models of Temporal Patterns

Summary:

Certain sketch domains contain appropriate temporal information that can assist in symbol recognition. For instance, digital circuit diagrams can be highly time-dependent when restricted to certain symbols. Resistors are typically drawn in order, as are capacitors and batteries. Using HMMs to take advantage of this temporal information can improve sketch recognition accuracy.

Sezgin uses a HMM modeled with DBNs to maximize the likelihood of the observable features given the grouping's label. The DBN model takes observables as input, obtained through features computed on the grouping's primitives, and infers the probability of a stroke-level model given the observables. The observables are also modeled with a mixture of Gaussians, although I'm not sure what the mixture model is used for. When this DBN is combined into an HMM, the to other nodes added include an object hypothesis and an ending hypothesis. The object hypothesis predicts the object type (Resistor, Wire, etc.), whereas the ending hypothesis predicts when the symbol is finished drawing.

The inference of a DBN is linear, whereas the inference on an HHMM (hierarchical HMM) is O(T^3). Therefore, Sezgin converts the model to a DBN before inference is conducted. This step was not explained. During training, the use of continuous variables could cause numerical underflow during belief propagation. A specialized algorithm, the Lauritzen-Jensen belief propagation algorithm, was used to avoid the instability issues.

Overall, the model worked well in the domain and improved the recognition (lowered the error rates) for all 8 participants involved in the test. Since the model relies on time, any interspersing (drawing two or more objects simultaneously) introduces errors. This causes primitives to be missed in sketches, with over 6% of the primitives missed on average due to this issue.

Discussion:

Relying on time data is tricky with sketch recognition, since time information can only be used in certain domains. Circuit diagram recognition is not necessarily one of these domains, as shown by the interspersing data. By increasing the model to be greater than first-order the model might be able to account for some issues, but then the model would not be able to run in real-time, which was a large proponent of the system.

Saturday, November 10, 2007

Gestures without Libraries, Toolkits or Training: A $1 Recognizer for User Interface Prototypes

Summary:

Wobbrock et al.'s $1 recognizer is a system designed to be a simple recognizer that does not rely on any mathematical background. With this recognizer, novices to sketch recognition can have an use simple gestures in their interfaces.

The $1 recognizer has four steps: (1) resampling, (2) rotation, (3) scaling, and (4) classification. Points in a gesture are resampled into N equidistant points defined by the developer. The gesture is then rotated so that the line between the center of the gesture and the starting point is at the 0 degree position, i.e. the center-start point axis is at 3 o'clock. The gesture is then scaled to fit within a square of some size and translated so that the center of the gesture is at the (0,0) origin.

Finally, the gesture is classified by first calculating the average distance between points in the gesture and the points in all templates (known gestures). This is called the path-distance by the authors. A score is determined directly from this path distance and the scaled square's size.

Overall, the recognizer has good results with around 98% accuracy for simple gestures. The classification relies heavily on the number of templates in the system.

Discussion:

This recognizer really does not introduce anything new to sketch recognition, but it does wrap up some basic vision and sketch recognition techniques into a simple (and given) algorithm. The technique used by $1 is not much different then general template/bitmap comparison, except using the points is a bit nicer when working with single strokes. On the other hand, bitmap comparison allows for multiple strokes in any order. This technique also relies highly on visual differences in gestures, and it does not allow for the same gesture at different scales or rotations. For instance, a forward slash gesture with a mouse or pen is a common gesture to indicate move forward in a web browser. Similarly, a backwards slash/dash indicates "page back". This recognizer cannot handle these cases.

Wednesday, November 7, 2007

Speech and Sketching: An Empirical Study of Multimodal Interaction

Summary:

In this paper, Adler and Davis explore multimodal speech and sketch interfaces through a user study. Their goal is to allow the computer to provide feedback to the user as the user talks and draws, and the computer will influence the design during this process by asking questions and clarifying information. Having the computer understand everything about the design is not the goal; instead, the computer should know enough to ask motivating questions when necessary in order to engage the user. The system also does not want to constrain the user's drawing or speech style.

The user study conducted involved 18 users in a Wizard-of-Oz study. The users were asked to design a floor plan, full adder, AC/DC transformer, and a digital circuit. Sketches was done on Tablet PCs in software that allowed for drawing and highlighting in 5 different colors. During the study, the experimenter sat at a table across from the user. The study was filmed and the audio, visual, and sketching components of the study were synchronized.

The study showed some interesting results concerning color, questions, and speech timing. Users tended to rely on multiple colors to indicate portions of the sketch. The color linked parts of the sketch together, referred back to previous parts, and reflected the real-world colors of objects. When speaking, users typically had phrase and word repetition when they were thinking aloud. This could allow the computer to discern key words from the user-computer dialogue. Responses from computer questions also caused the user to repeat the questions, and simple questions could prompt more information than what was asked. Some users even redesigned their drawings after simple questions were asked, such as inquiring if two objects were similar. Speech and sketching started simultaneously in the study. Yet, certain parts of the speech, such as an entire phrase, tended to start before the sketch, and certain key words said alone tended to be heard after a sketch was started.

Discussion:

The two best components of Adler's study show how computers can assist humans during design steps by relying on the human design and thought process, instead of having an actual understanding. In lieu of training the computer to understand all of the components of a design, basic understanding of object similarity and grouping should be enough to produce a motivating dialogue. Also, the fact that the user constantly repeats words provides the computer with an indication of important information without the need of a large vocabulary.

I wish the study also went into more interface issues, such as when the computer should ask a question (e.g. during sketching, during a pause, etc.). Also, it would have been beneficial to see the average pause time of a user and if the user was speaking or mumbling during the pause by going "hmm" or something similar. Do the pauses for sketch indicate that the user is speaking, and do pauses for speaking indicate the user is sketching? Do the pauses for both modes line up?

Monday, November 5, 2007

Three main concerns in sketch recognition and an approach to addressing them

Summary:

Mahoney and Fromhertz discuss problems involved with matching models of hand-drawn sketches of stick figures. The figures are in simple polylines, but can be in any configuration with other figures or distracting objects in the background.

The system input (drawn figure) is highly variable, and the authors define certain problems in the variability. Failures of co-termination involve strokes over- or under-shooting one another (i.e. the endpoints of two disjoint strokes that are supposed to be connected do not touch). Articulation problems are encountered when strokes ore over- or under-segmented in preprocessing. Interaction with background context involves the figure to match or recognize set against a background of context strokes, other figures, or noise and distracting data.

In this paper, the matching process involves creating a graph of the model to find (figure) and searching for a mapping between the model and a data subgraph. Ambiguity is handled by adding alternative, plausible substructures to the graph. This happens through proximity linking, virtual junction splitting, spurious segment jumping, and continuity tracing. All four of these methods involve creating new links in the structure by searching for subtle connections, splitting current strokes, merging strokes together, and creating new stroke segments.

Subgraph matching is translated into a constraint satisfaction problem (CSP), where each stroke (node) in the graph is a variable, and the constraints are edges between the nodes. The final match tries to have the smallest link length, and the length of the segments should be in the appropriate ratios. Matching can also be influenced by a priori knowledge that defines certain components.

Discussion:

The use of a CSP seems to work very well in this case. The fact that the system can discern a stick figure in a sea of seemingly random lines is rather amazing, since I can barely see the figure myself.

One issue is that this system cannot seem to discern "fixed" figures, i.e. a square versus a parallelogram. For "loose" figures, the figures must also be connected with endpoint to endpoint. For a system with only a few strokes this can be simple, but if the user was trying to draw, say, a centipede, the system could find many good examples within a collection of strokes.

Constellation Models for Sketch Recognition

Summary:

In Sharon and van de Panne's paper, sketch objects are represented with constellation models. A constellation model is a visual model that captures individual and pairwise features between strokes. For their model the site features include:
  • The x-coordinate of the stroke's bounding box center
  • The y-coordinate of the stroke's bounding box center
  • The length of the bounding box diagonal
  • The angle of the bounding box diagonal
Interaction features include:
  • Delta x between the strokes
  • Delta y between the strokes
  • The minimum distance between the endpoints of stroke a and any point on stroke b
  • The minimum distance between the endpoints of stroke b and any point on stroke a
Since the interaction features are pairwise, the model scales with the number of strokes by O(n^2). The authors chose to allow labels to be optional or mandatory in order to reduce the run time of the model.

Each element of the feature vectors (site F and interaction G) have their mean and covariances computed for the training data set. The labels in the training data then have and individual probabilistic model computed for the label's features.

For the entire sketch, a likelihood function estimates the probability of an entire labeling L. This function is given in the paper. The overall function tries to maximize the probability of an individual label, as well as the probability of the interactions between two individual labels.

A maximize likelihood search tries to maximize the likelihood function defined, i.e. the find maximum probable labeling. The search is multipass, with the first pass labeling only mandatory strokes objects. All possible label assignments are then searched for using a "branch-and-bound search tree [where each] node in the search tree represents a partial labeling of the sketch." The depth of the tree corresponds to the number of labels applied. The search advances by choosing the best assignment of mandatory labels, and the "cost" of this assignment bounds the rest of the search. The multipass algorithm also bounds branches of the tree before a full labeling is found. If a new likelihood is worse than the bound, then the branches associated with the likelihood's labeling are pruned. If no complete solution is found with the bounding, the bound is loosened until a labeling is found.

In the results section, the multipass thresholding/bounding greatly reduces the time required to find a complete labeling.

Discussion:

Although the paper discussed how using a branch-and-bound search tree with multipass thresholding improves the runtime, no mention was given to the accuracy of the final system. I've worked with systems similar to this before, and I know that the accuracy can drop drastically as the number of labels increases. The authors seem to curb this by having "optional" labels, and by forcing each component to be drawn in a single stroke, but results for how the system's accuracy scales with the number of labels would have been benefitial.