Abstract
This chapter describes interfaces that enable users to combine digital pen and speech input for interacting with computing systems. Such interfaces promise natural and efficient interaction, taking advantage of skills that users have developed over many years. Many applications for such systems have been explored, such as speech and pen systems for computer-aided design (CAD), with which an architect can sketch to create and position entities while speaking information about them. For instance, a user could draw a hardwood floor outline while saying "threefourths inch thick heart pine." In response, the CAD system would create a floor of the correct shape, thickness, and materials, while also updating the list of materials to purchase for the job. Then the user could touch the floor and say "finish with polyurethane." The user of such a system could concentrate on creating the planned building, without interrupting their concentration to navigate a complex interface menu system. In fact, multimodal CAD systems like Think3 are preferred by users, and have been documented to significantly increase their productivity by speeding up interaction 23% [Engineer Live 2013, Price 2004].
This chapter will discuss how speech and pen multimodal systems have been built, and also how well they have performed. By pen input we include such devices as light pens, styluses, wireless digital pens, and digital pens that can write on paper while either storing digital data, or streaming it to a receiver [Anoto 2016]. We will also occasionally refer to other devices that can, like digital pens, provide a continuous stream of < x, y >coordinates---such as tracked laser pointers, finger input on touch-screens, and the ubiquitous mouse. Pen input devices can be used for a number of communicative functions, such as handwriting letters and numbers, drawing symbols, sketching diagrams or shapes, pointing, or gesturing (e.g., drawing an arrow to scroll a map). See the Glossary for defined terms.
This chapter begins by discussing users' multimodal speech and pen interaction patterns, and the documented advantages of this type of multimodal system (Section 10.2). Section 10.3 describes the simulation infrastructure that's ideally required for prototyping new systems, and the process of collecting multimodal data resources. In terms of system development, Sections 10.4 and 10.5 outline general signal processing and information flow, and major architectural components. Section 10.6 describes implemented approaches to multimodal fusion and semantic integration. Section 10.7 presents examples of multimodal speech and pen systems, some of which are commercial applications [Tumuluri 2017], with the Sketch-Thru-Plan system provided as a walk-through case study. The chapter concludes with Section 10.8 by discussing future directions for research and development. As an aid to comprehension, readers are referred to the Glossary for newly introduced terms throughout the chapter, and also the Focus Questions at the end of the chapter.