|Watch What I Do: Programming By Demonstration|
|chapter 24:||Using Voice Input to Disambiguate Intent
Programming by demonstration systems use mouse input and the point and click paradigm as the primary form of user interaction. During a demonstration, multiple, valid interpretations of a user's actions can be made. A major problem in programming by demonstration is being able to disambiguate which of these interpretations the user had intended. Most systems use a fixed model for determining the most "plausible inference," but unfortunately there is no guaranteed way to identify the user's intent correctly or be able to resolve the ambiguity without further assistance [Cypher 86]. Part of the problem stems from the limited amount of information the mouse can convey. The point and click style of interaction restricts the user's ability to successfully demonstrating their intent, which in turn causes ambiguity. As a result, other means for specifying detailed information, such as dialog boxes, pop-up menus, gravity points, grids, and keyboard input, have been invented. While these alternatives offer a solution to the problem, their use is often unnatural. Keyboard input requires that the user put down the mouse in order to type. Dialog boxes interrupt the user's concentration by forcing them to switch back and forth between different modes of interaction. To make matters worse, using these secondary methods can be just as ambiguous. For example, when using a grid to align two objects, should a system infer that the objects are positioned at grid unit (X, Y), or aligned relative to one another? If a menu command is chosen, does the system record the action of selecting the operation as part of the demonstration? Having recognized this problem, some systems will employ interactive techniques such as snap-dragging [Bier 89] and semantic gravity points [Lieberman 92b] to disambiguate mouse input, instead of using a secondary method. Interaction techniques which can be used while the mouse action is taking place, may provide a more natural and effective solution, since they allow the user to indicate intent in a way which does not disrupt them from their primary task. One such technique is voice input. This chapter highlights an experimental extension to the Mondrian system described in chapter 16. It explores the potential of voice input as a convenient means for disambiguating intent, by allowing users to control how the system interprets their mouse actions.
|Voice Input's Role
||Voice input has been incorporated into the system as an additional input
mechanism. Interface tools for drawing and object manipulation are invoked using the mouse. Voice input allows users
to control how the system interprets their actions by issuing interactive audio advice called voice commands
[Articulate 90] while the current mouse operation is taking place. Voice commands are pre-defined descriptions of how a
mouse action should be interpreted and contain information which the system uses to modify its execution.
|Figure 1. A sequential draw and
|Voice Input's Role
||Typical uses of voice input have included tasks which replace mouse commands
altogether. Users will issue a voice command to close a window rather than clicking its close box or for selecting menu
items and icons in the same manner. While operations such as these provide an alternative to mouse interaction, they do
not utilize the full range which multimodal input devices can accommodate. Maulsby's proposed use of audio, as a
tool for giving verbal hints about what features a system should focus on during a demonstration, in his Turvy
experiment described in chapter 11, begins to touch upon the potential of voice input.
|Why Voice Input?
||In everyday interaction, we use many different forms of communication
[Stifelman 92]. We talk, listen, and perform hand gestures. When giving road directions, we illustrate the route
visually with our hands while describing it verbally. Current secondary input methods are solely dependent on manual
interaction. When users perform an action, their hands are already preoccupied with the mouse. Voice input provides
users with a very natural and easy to use method of communication which mirrors our normal social interaction. One of
the powerful advantages voice input has over secondary methods such as dialog boxes, is its ability to work in parallel
with the mouse. When using mouse input, users are restricted to performing operations in a sequential order (figure 1).
Voice commands, on the other hand, can be defined as a sequence of operations, and issued while a mouse action is in
progress (figure 2). As each command is given, the system uses it to modify the current mouse action, allowing users to
customize a general operation, such as "drawing a rectangle," into a highly specific, intentional action, such as
"drawing a rectangle centered on the screen, with the dimensions 20 pixels by 200 pixels." Several voice commands may be
given during an operation, to describe specific parts. Users can invoke a drawing operation, issue one command to take
care of an alignment problem, then another to indicate an object's dimensions. The more specific the voice commands
become, the less likely it is that the system will misinterpret the mouse action's intent.
|An Example of
Using Voice Input
|As an example of how voice input's use can disambiguate intent, consider the
following page layout problem. Using a programming by demonstration system, a designer would like to graphically
demonstrate how to create a particular layout style, such as that found in figure 3, and have the system generalize this
into a function that could be used to format new documents in the same style. The criteria for creating this layout is
that it contains three objects, a title and two rulebars; the top rulebar's length should be equaled to the length of
the title; the two objects should be left aligned; the bottom rulebar's horizontal position should be centered with the
title and its length should equal 200 pixels. At present, Mondrian only contains drawing routines for colored
rectangles. In the following example, we will assume that the system's graphical language also includes the lexical
category of text. The designer will use voice input to control the size and positioning of the graphical objects drawn
while demonstrating the example, and as a means of informing the system that the distinct geometric relationships
between the objects, stated above, should be noted when defining the procedure.
|Figure 2. Voice commands used to
size and align an object.
|Figure 3. A layout with distinct
|What Happened When
the Voice Commands
|The system combines the information from the voice commands with the mouse
action to create a description of the operation. From the mouse, the system knew that the user was drawing a
rectangular object at a certain location. The voice command "Align-left" told the system that this object's position
needed be altered and aligned with another object, even though it was supplied with an initial location from the mouse.
Since the title was selected as the reference point, the system then looked at its position to get the appropriate value
and applied it to the drawing action to properly position the rulebar as the designer had intended. Built into Mondrian
is a fixed set of heuristics for inferring spatial relationships such as "above," "below," "left," "right," "centered,"
and dimensional relationships such as "half-of" and "one-to-one." Yet, the system only invokes these heuristics if the
drawing or positioning of two arguments, given a tolerance value, is equal. During the demonstration, when the designer
began to draw the top rulebar, its left edge was not close enough to the title's edge for the system to recognize this
|Figure 4. Re-application of the layout
function on new arguments.
||This chapter has briefly illustrated how voice input can aid programming by
demonstration systems in disambiguating a user's intent. Unfortunately, there are a number of disadvantages to using
this method of input in general. Depending on the scope of its use, voice input can often pose more problems than other
means of input for several reasons. In the Turvy experiment described in chapter 11, users actually said they would
rather select from a menu or dialog box than use voice input since they did not trust the capabilities of the voice
input device. Recognition of audio sounds in some systems is often poor, and misinterpretation of the voice commands
can potentially be worse for input than menus since the failure rate is often higher or keyboard entry since typing has
the advantage of being more forgiving. Using voice input also involves three levels of interpretation: the actual audio
sound, its English translation, and its representation as a computational action. In systems which use a rich
vocabulary of voice commands, users also have the problem of remembering the commands and knowing which terms they are
allowed to say. The other serious problem with using voice input, is that the voice commands themselves may be
ambiguous. Depending on what they are defined to do and the order in which they are issued, their use can result in
more ambiguity than other methods. In Mondrian, the default behavior of the system is to establish relationships
between the reference point and other objects. If this method was not employed and the user issued the voice command
"Align-left," how would the system know what and where to align it to? What if the user issued two conflicting
commands, "Align-left" and "Align-center" or issued them in the wrong order? Should the system default to performing
and remembering these commands sequentially? If so, then the advantages to using voice input would be no different than
that of existing secondary input methods.
||The voice input device used in this work was the Voice Navigator [Articulate
90]. Voice Navigator enables users to communicate with a Macintosh computer by triggering interface actions, normally
achieved through manual interaction, when specific audio cues are given. In order to use the system, a voice file
containing recorded audio samples of an individual user's voice and a language file with explicit interface actions must
predefined before hand. Since it is not true voice recognition, users must train the system about the correspondence
between sounds and actions by recording the audio cues and linking them to the actions using software which accompanies
the system. Since Mondrian runs in the Macintosh Common Lisp environment, the voice commands' actions were defined as
Lisp functions. When the voice commands were executed, their actions were expanded into low-level function calls which
were sent to a buffer, enabling the system to directly copy the buffer and evaluate it.
||This chapter illustrates how voice input can offer a possible solution to the
ambiguity problem in programming by demonstration. While it is not a complete solution, it can provide users with a
convenient mechanism for controlling how a system interprets and executes actions, which in turn decreases the number of
possible, plausible inferences it makes. As voice commands are issued and applied, general mouse actions are
transformed into highly specialized operations, which can successfully convey the user's intent to the system.
||The author would like to thank Muriel Cooper, director of the Visible Language
Workshop and Henry Lieberman for their support and encouragement of this work. Special thanks to B.C. Krishna for
helpful comments and discussion. The Visible Language Workshop at MIT is sponsored in part by research grants from
Alenia Corporation, Apple Computer Inc., DARPA, Digital Equipment Corporation, NYNEX, and Paws Inc.
Voice Navigator: Quick Start, Instruction Manual, Articulate Systems Inc., Cambridge, MA, 1990.
|Watch What I Do: Programming By Demonstration|