Monday, October 29, 2007

Envisioning Sketch Recognition: A Local Feature Based Approach to Recognizing Informal Sketches

Summary:

Oltman's Ph.D. thesis uses computer vision techniques to recognize freely-drawn symbols and sketches. Freely-drawn sketches do not constrain the user in drawing style, so many issues need to be taken into accounted. For instance, stroke overtracing is a large problem in free-form sketches, especially when doodles or notes are involved. Also, noise is more prevalent, and temporal data cannot be utilized because strokes can be drawn in any order.

To combat these issues, Oltman uses computer-vision based techniques that lessen the issues from overtracing and noise while ignoring any temporal features of the sketch. The technique used is dubbed "bullseye" by Oltman, and consists of a radial partition of space around a point. The radial partitions corresponds to a histogram that keeps track of the number of stroke points within that section/bucket/slice. Stroke points are preprocessed to be relatively equidistant from one another. Each symbol has a corresponding set of bullseye patterns stored in the "codebook" for the system. Matching a series of found histograms with the trained codebook patterns allows for symbol classification. The system is trained using a SVM on known data.

To find a symbol within an entire sketch, though, is rather difficult. Oltman tries to find "candidate regions" for a symbol by looking at small, overlapping windows of points and the combining these regions into a large regions. The large regions are found using EM clustering techniques to group the smaller regions together. Any cluster that is considered too large is split into smaller clusters. The bounding box of the cluster is then taken to be the symbol's region, and the bullseye histogram is taken for the ink within the box.

Overall, the system works well for individual symbols (94.4%), especially when compared to existing systems for noisy data. The system faired slightly worse when taking the entire sketch into account, achieving an accuracy of 92.3%.

Discussion:

The use of vision techniques to classify freely-drawn symbols is a good idea because stroke data has so much noise. Using vision mapping, whether it is histogram or simple pixel overlay, tends to avoid corner segmentation and overtracing issues.

I find the results for Oltman's full sketch tests slightly skewed. The vast majority of the shapes tested in the full sketch were wire segments and resistors. Since the wires are broken down into smaller segments, there were roughly 9,000 wires, and there were only approximately 14,000-15,000 shapes total. The accuracy for resistors (2000 shapes) was also extremely high, but the accuracy for the rest of the shapes was around 60-80%. So over 2/3 of the shapes were easy to detect with high accuracy because they were either a straight line (wire) or unique (resistor). It seems like the system has a very hard time distinguishing between similar shapes, such as batteries vs. capacitors, mainly because the histograms are just not accurate enough to catch small variations with noisy data.

I'm also interested to see how the histograms can be tweaked depending on the scale of the image. The histograms used were hard-coded to 40 pixel radii, and possibly having a variable histogram size would help.

Wednesday, October 24, 2007

Naturally Conveyed Explanations of Device Behavior

Summary:

Oltmans and Davis present ASSISTANCE, a multimodal system capable of understanding simple 2D physics diagrams. The diagrams can contain bodies, pin joints, springs, pulleys, and rods. Arrows are used to describe movement of objects, as well as verbal cues.

In ASSISTANCE, the user first draws the system they want to model. Then, the user verbally describes the system while pointing at objects in the drawing. ASSISTANCE constantly updates its interpretation of the drawing, and the user can ask for the computer's interpretation at any time. This interpretation is a "causal model" for the drawn system (i.e. a sequence of cause and effect actions).

To generate the causal model, ASSISTANCE first finds the degree of freedom each object has, such as rotation or translation freedom. The system then utilizes the verbal description of the system, as well as any arrows the user draws. Verbal information is parsed to separate key objects and actions. For example, the phrase "Body 2 pushes Body 3" will parse into "Body 2", "pushes", and "Body 3". These verbal phrases, as well as the drawn bodies and arrows, are converted into propositional statements, and ASSISTANCE performs reasoning using a forward-chaining algorithm and a truth maintenance system.

Often, the same action will be described in multiple ways, such as with a verbal description and an arrow indicating movement. When this happens, the two events are merged. The system assumes that only one motion can affect a body, so multiple descriptions affecting the same body would indicate that the descriptions describe the same event.

The final causal model is created by examining the causal events and constructing the most likely model for the system, given the description. To do this, ASSISTANCE uses known causal events and plausible causal events, along with constraint propagation. Events that do not have a cause are considered to be plausible and require an implicit cause by an outside force. The system tries to minimize these plausible causes, and the model is created when all clauses have events.

Discussion:

ASSISTANCE seems to be a great system, and I'm really curious how users evaluated it. There was no formal evaluation for this paper, but since we're reading his thesis next week I'll find out what users say shortly.

I love multimodal systems, but I also understand why there are not many multimodal applications commercially available. Being able to describe a drawing verbally and with gestural cues is great, and using both input modes can improve the system's accuracy when the two modes rely on each other for information. On the other hand, if the system does not force users to use all input modes, then the accuracy rate for each individual input still has to be very high, as if the separate input modes could not be relied on.

Wednesday, October 17, 2007

Ambiguous Intentions: a Paper-like Interface for Creative Design

Summary:

Gross and Do describe the Electronic Cocktail Napkin program that allows users to draw ambiguously as if they were sketching a design on a piece of scrap paper. The Napkin then takes the sketch, examines objects within the ambiguous drawing, interprets the drawing's context, and recognizes the objects.

Domain recognizers are user defined, where the user draws examples and the Napkin identifies the symbols and patterns. Symbols can span across multiple domains, such as a circle acting as a table in a floor plan domain, and the same circle representing a node in a graph domain.

As a user draws in a new, blank Napkin, the system discerns whether each stroke or symbol can be recognized, is ambiguous, or has failed to be recognized. If a symbol is recognized then it can only be recognized in the known domain, or only one domain. If a symbol is ambiguous then it can be recognized in multiple domains. Otherwise, the system cannot recognize the symbol at the time and stores that information for later. A user can also manually specify a symbol, if need be.

Symbols are related to each other via constraints specific to a domain. For instance, graphs and circuits have connected constraints, and floor plans can have perpendicular constraints for walls. Low-level recognition is accomplished with 3x3 normalized "glyphs" with features. Glyphs can also be grouped into configurations, which are ranked hierarchically above individual glyphs.

Discussion:

Overall, I think the Electronic Cocktail Napkin was a good idea that was too ahead of its time and too tied down to one system. The authors even mentioned that the system suffered from not having access to the best type of technology for the system, which they mentioned as LCD digitizing displays the user could draw on. The equipment that the Napkin was using involved a separate digitizing pad and display, which breaks the entire mapping from napkin to sketch pad.

Monday, October 15, 2007

Graphical Input Through Machine Recognition of Sketches

Summary:

Herot's short paper gave a brief, but comprehensive, look at sketch interaction systems in the mid 70s.

The paper first looks at a general recognizer, HUNCH, that tries to see if accurate knowledge can be obtained without using a specific domain. The system takes data drawn on a large tablet with a special pencil, and the raw input data is recorded by the computer. The HUNCH system used another application, called STRAIT, that found corners in data by examining the user's pen speed. The system also used a process called latching to snap endpoints of close lines together. Unfortunately, the HUNCH system had problem with consistency between different users. Users drawing at different pen speeds produced different corners, and the latching technique sometime distorted an intended image, such as oversnapping lines in a cube. The system also handles overtraced lines by merging lines together, provides some 3D image inference through unexplained techniques, and can create floor maps by looking at boxed rooms and doorways.

Context is an important part of a sketch, and Herot recognizes this fact by mentioning how data interpretations should have context information. The context should be specified to the computer as to avoid issues of recognizing the domain. Herot briefly mentions a top-down processing for recognizing sketches with a context architecture.

Lastly, Herot mentions that user input is a key component of a sketch recognition system that should not be ignored. More complex interfaces need to be developed so that a user can interact with a program and correct mistakes, and corner finding algorithms need to be tuned for an individual user.

Discussion:

Although none of the topics mentioned in Herot are new to me, the fact that all of these issues were mentioned in a paper written in 1976 is surprising. For instance, I had been under the assumption that using pen speed to detect corners was a relatively new fad.

I also am very surprised that the system tried (and from the one example, succeeded) at incorporating 3D image analysis. I remember reading a paper about using edges and vertices to detect whether an image is 3D, but I cannot seem to recall the author involved, so it's hard for me to construct a timeline for that research.

Wednesday, October 10, 2007

Perceptually Based Learning of Shape Descriptions for Sketch Recognition

Summary:

Veselova and Davis present a system to train a recognizer after only one drawn symbol example. The system relies on perceptual information in symbols, which was learned from previous experiments conducted with humans. Goldmeier is the main source of Veselova's perceptual information, and his research shows that people rely on singularities when distinguishing between objects. Singularities are important geometric properties in that are highly sensitive to change, such as parallelism, vertical and horizontal lines, and symmetry. Goldmeier's research also shows that the vertical axis of symmetry of an object was more important than the horizontal axis.

Veselova and Davis use this singularity information, as well as other constraints they define to be important, and create a ranked list of important constraints. They also use work from Arnheim, who describes tension lines, which are the important vertical, horizontal, and diagonal lines of an image. Gestalt principles for grouping are also used in their system to group small objects into larger wholes. These groupings also have hierarchy.

The user study for the system involved a group of people distinguishing between a description and 20 shapes, 10 of which agree with the description and 10 which do not. Humans then ranked what shapes agree and disagree with the description, and the recognition system does the same. If the vast majority of humans agreed on an example, the system performed very well and matched the human performance. Overall, the system performed about 5% less than humans when matching examples and descriptions.


Discussion:

I really enjoyed this paper. Using more cognitive science techniques, such as Gestalt groupings and preservation of symmetry, are not necessarily new methods to computer science (ScanScribe), but the explanation behind the principles was decent for the short length of the paper.

I am surprised that the system did not incorporate "must not" constraints, since the addition of nots shouldn't have been too difficult and is important in perception.

Tuesday, October 9, 2007

Interactive Learning of Structural Shape Descriptions from Automatically Generated Near-miss Examples

Summary:

Hammond and Davis present a debugging system in LADDER for shape definition errors. Errors can be either syntactical or conceptual, where a syntactical error is an error within the definition expressions themselves, whereas a conceptual error deals with over- or under-constrained definitions. An under-constrained definition finds many false positives due to too few constraints, whereas an over-constrained definition has many false negatives by allowing only very specific instances of a shape to be recognized.

The paper does not go into depth on handling syntactical errors, but the GUI for LADDER notifies the developer of any syntactical errors by highlighting definition problems in red. For over-constrained errors, the system first needs to have a drawn example and a definition that positively recognizes that example. Once this is given, the system scales and rotates the drawn shape and asks the user whether the new shape should be positively recognized. The system removes constraints from the initial list if needed. Then, the system generates near-miss examples by looking at the remaining constraints and testing them one at a time. Each constraint is negated, and if the user indicates that the resulting image should be recognized, then the tested constraint is removed from the list.

For under-constrained errors, the system again requires a drawn example and a definition. A list of possible constraints is then generated for the image and the constraints are tested one at a time by adding the negation of the constraint into the current constraint list. If the new set with the negation constraint should be positively recognized, then the constraint is not added to the system.

To generate the shapes during the testing process, each shape is converted into a set of numerical features that can be used in equations to stretch, rotate, and translate parts of the shape.

Discussion:

Knowing that this system existed in LADDER before I started creating my shapes would have been helpful. The idea behind eliminating or adding constraints is relatively simple, since constraint and not constraint cannot both exist in a definition.

Yet, after working with LADDER, I have found that one of my most frequent errors is not with over- or under-constraint, but more with the constraints themselves. For instance, when drawing a rectangle, I want the endpoints of lines to be touching or close to touching. But I cannot edit the value for "close", so instead I have to either force the user to draw a rectangle in a certain fashion so I know that the lines are connected, or I have to eliminate the close endpoints constraint. Both options are not optimal, and I'll experiment with the LADDER debugging system to determine if the system finds new constraints I did not know existed.

Wednesday, October 3, 2007

LADDER, a sketching language for user interface developers

Summary:

Hammond and Davis created a sketch language, LADDER, that allows developers to easily design a geometric shape recognition system. The language handles shape drawing, displaying, and editing, and shapes are defined using primitives, constraints, and other shapes. Many primitives and constraints are already defined for the user, with primitives such as Lines, Arcs, and Spirals. Constraints are for individual shapes, such as if the shape has a positive slope, and constraints can also describe shape interactions, such as if two shapes are intersecting.

Shape definitions are hierarchical and incorporate geometric information about primitives and their relationship to each other. For instance, an arrow is defined as a shaft and a head, where the shaft is defined by a line and the head is defined as two lines meeting at an acute angle. One of the shaft points must also touch the head where the two head lines meet. Shape groups are also formed from shapes that are often drawn together. One example the authors give involves arrows touching polygons, which are defined as forces within a mechanical engineering domain. Shapes can also have editing constraints. A shape can be allowed to be "rubber-banded", or stretched, on a fixed point, rotated around a center, or forced to remain static if the system designer chooses.

Shapes, when displayed to the user, are displayed by their ideal strokes. If only lines are allowed in a domain, then only shapes made from lines can be drawn. Creating these ideal shapes happens through a series of mathematical equations. The features of a shape are quantified, and then the constraints are translated into equations that the features are plugged into.


Discussion:

The LADDER system provides a great foundation for geometric recognition systems. It seems as if entire applications can be built relatively quickly, as long as the domains involved do not have complex shapes that cannot be broken down into primitives.

Since I haven't used the system too much, I'm not sure how tough it is to build complex applications. I also do not know how easy it would be to integrate the language into an existing application, or if LADDER is only designed to build complete systems. Having the ability to pick and choose what LADDER offers might also be beneficial; for instance, I might not want the system to clean the strokes as they are drawn. The paper did not provide many details about customizablity in this regard.