Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3452918.3458806acmconferencesArticle/Chapter ViewAbstractPublication PagesimxConference Proceedingsconference-collections

Mixing Modalities of 3D Sketching and Speech for Interactive Model Retrieval in Virtual Reality

Published: 23 June 2021 Publication History


Sketch and speech are intuitive interaction methods that convey complementary information and have been independently used for 3D model retrieval in virtual environments. While sketch has been shown to be an effective retrieval method, not all collections are easily navigable using this modality alone. We design a new challenging database for sketch comprised of 3D chairs where each of the components (arms, legs, seat, back) are independently colored. To overcome this, we implement a multimodal interface for querying 3D model databases within a virtual environment. We base the sketch on the state-of-the-art for 3D Sketch Retrieval, and use a Wizard-of-Oz style experiment to process the voice input. In this way, we avoid the complexities of natural language processing which frequently requires fine-tuning to be robust. We conduct two user studies and show that hybrid search strategies emerge from the combination of interactions, fostering the advantages provided by both modalities.


A. Adler and R. Davis. 2007. Speech and Sketching: An Empirical Study of Multimodal Interaction. In Proceedings of the 4th Eurographics Workshop on Sketch-based Interfaces and Modeling (Riverside, California) (SBIM ’07). ACM, New York, NY, USA, 83–90. https://doi.org/10.1145/1384429.1384449
Aaron Adler and Randall Davis. 2007. Speech and sketching for multimodal design. In ACM SIGGRAPH 2007 courses. ACM, New York, NY, USA, 14–es.
C. Alberti, M. Bacchiani, A. Bezman, C. Chelba, A. Drofa, H. Liao, P. Moreno, T. Power, A. Sahuguet, M. Shugrina, and O. Siohan. 2009. An audio indexing system for election video material. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, Taipei, Taiwan, 4873–4876. https://doi.org/10.1109/ICASSP.2009.4960723
Mafkereseb Kassahun Bekele, Roberto Pierdicca, Emanuele Frontoni, Eva Savina Malinverni, and James Gain. 2018. A Survey of Augmented, Virtual, and Mixed Reality for Cultural Heritage. J. Comput. Cult. Herit. 11, 2, Article 7 (March 2018), 36 pages. https://doi.org/10.1145/3145534
Leif Berg and Judy Vance. 2016. Industry use of virtual reality in product design and manufacturing: a survey. Virtual Reality 21 (09 2016). https://doi.org/10.1007/s10055-016-0293-9
Niels Ole Bernsen and Laila Dybkjaer. 2001. Exploring Natural Interaction in the Car. In Proceedings of the International Workshop on Information Presentation and Natural Multimodal Dialogue. University of Southern Denmark, Denmark, 75–79.
GRAPHISOFT (2020). Retrieved October 2020 from https://graphisoft.com/solutions/products/bimx. GRAPHISOFT.
David Bischel, Thomas Stahovich, Eric Peterson, Randall Davis, and Aaron Adler. 2009. Combining Speech and Sketch to Interpret Unconstrained Descriptions of Mechanical Devices. In Proceedings of the 21st International Jont Conference on Artifical Intelligence (Pasadena, California, USA) (IJCAI’09). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1401–1406. http://dl.acm.org/citation.cfm?id=1661445.1661670
Richard A. Bolt. 1980. “Put-That-There”: Voice and Gesture at the Graphics Interface. In Proceedings of the 7th Annual Conference on Computer Graphics and Interactive Techniques (Seattle, Washington, USA) (SIGGRAPH ’80). Association for Computing Machinery, New York, NY, USA, 262–270. https://doi.org/10.1145/800250.807503
LWJ Boves and EA Den Os. 2002. MUST-Multimodal and multilingual services for small mobile terminals.
Tu Bui, Leonardo Ribeiro, Moacir Ponti, and John Collomosse. 2018. Sketching out the details: Sketch-based image retrieval using convolutional neural networks with multi-stage regression. Computers & Graphics 71(2018), 77–87.
P. R. Cohen, M. Dalrymple, D. B. Moran, F. C. Pereira, and J. W. Sullivan. 1989. Synergistic Use of Direct Manipulation and Natural Language. SIGCHI Bull. 20, SI (March 1989), 227–233. https://doi.org/10.1145/67450.67494
Philip R. Cohen, Michael Johnston, David McGee, Sharon Oviatt, Jay Pittman, Ira Smith, Liang Chen, and Josh Clow. 1997. QuickSet: Multimodal Interaction for Distributed Applications. In Fifth ACM International Conference on Multimedia (Seattle, Washington, USA) (MULTIMEDIA ’97). ACM, New York, NY, USA, 31–40. https://doi.org/10.1145/266180.266328
Tobias Drey, Jan Gugenheimer, Julian Karlbauer, Maximilian Milo, and Enrico Rukzio. 2020. VRSketchIn: Exploring the Design Space of Pen and Tablet Interaction for 3D Sketching in Virtual Reality. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3313831.3376628
Mathias Eitz, James Hays, and Marc Alexa. 2012. How Do Humans Sketch Objects?ACM Trans. Graph. 31, 4 (2012), 44:1–44:10.
Hesham Elsayed, Mayra Donaji Barrera Machuca, Christian Schaarschmidt, Karola Marky, Florian Müller, Jan Riemann, Andrii Matviienko, Martin Schmitz, Martin Weigel, and Max Mühlhäuser. 2020. VRSketchPen: Unconstrained Haptic Assistance for Sketching in Virtual 3D Environments. In VRST ’20: 26th ACM Symposium on Virtual Reality Software and Technology, Virtual Event, Canada, November 1-4, 2020, Robert J. Teather, Chris Joslin, Wolfgang Stuerzlinger, Pablo A. Figueroa, Yaoping Hu, Anil Ufuk Batmaz, Wonsook Lee, and Francisco R. Ortega(Eds.). ACM, Canada, 3:1–3:11. https://doi.org/10.1145/3385956.3418953
Danilo Gasques, Janet G. Johnson, Tommy Sharkey, and Nadir Weibel. 2019. What You Sketch Is What You Get: Quick and Easy Augmented Reality Prototyping with PintAR. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI EA ’19). Association for Computing Machinery, New York, NY, USA, 1–6. https://doi.org/10.1145/3290607.3312847
Daniele Giunchi, Stuart James, Donald Degraen, and Anthony Steed. 2019. Mixing Realities for Sketch Retrieval in Virtual Reality. In The 17th International Conference on Virtual-Reality Continuum and Its Applications in Industry (Brisbane, QLD, Australia) (VRCAI ’19). ACM, New York, NY, USA, Article 50, 2 pages. https://doi.org/10.1145/3359997.3365751
Daniele Giunchi, Stuart James, and Anthony Steed. 2018. 3D Sketching for Interactive Model Retrieval in Virtual Reality. In Proceedings of the Joint Symposium on Computational Aesthetics and Sketch-Based Interfaces and Modeling and Non-Photorealistic Animation and Rendering (Victoria, British Columbia, Canada) (Expressive ’18). Association for Computing Machinery, New York, NY, USA, Article 1, 12 pages. https://doi.org/10.1145/3229147.3229166
Daniele Giunchi, Stuart James, and Anthony Steed. 2018. 3D Sketching for Interactive Model Retrieval in Virtual Reality. In Proceedings of the Joint Symposium on Computational Aesthetics and Sketch-Based Interfaces and Modeling and Non-Photorealistic Animation and Rendering (Victoria, British Columbia, Canada) (Expressive ’18). ACM, New York, NY, USA, Article 1, 12 pages. https://doi.org/10.1145/3229147.3229166
D. Giunchi, S. James, and A. Steed. 2018. Model Retrieval by 3D Sketching in Immersive Virtual Reality. In 2018 IEEE VR. IEEE, Tuebingen/Reutlingen, Germany, 559–560.
Google (2020). Retrieved October 2020 from https://www.tiltbrush.com/. Google.
Masataka Goto, Jun Ogata, and Kouichirou Eto. 2007. Podcastle: a web 2.0 approach to speech recognition research. In INTERSPEECH 2007, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, August 27-31, 2007. Curran Associates, Inc., Antwerp, Belgium, 2397–2400. http://www.isca-speech.org/archive/interspeech_2007/i07_2397.html
Alexander Gruenstein, Bo-June Paul Hsu, James Glass, Stephanie Seneff, Lee Hetherington, Scott Cyphers, Ibrahim Badr, Chao Wang, and Sean Liu. 2008. A Multimodal Home Entertainment Interface via a Mobile Device. In Proceedings of the ACL-08: HLT Workshop on Mobile Language Processing. Association for Computational Linguistics, Columbus, Ohio, 1–9. https://www.aclweb.org/anthology/W08-0801
J. H. L. Hansen, Rongqing Huang, P. Mangalath, Bowen Zhou, M. Seadle, and J. R. Deller. 2004. SPEECHFIND: spoken document retrieval for a national gallery of the spoken word. In Proceedings of the 6th Nordic Signal Processing Symposium, 2004. NORSIG 2004.IEEE, Espoo, Finland, 1–4.
HTC (2020). Retrieved October 2020 from https://www.vive.com/. HTC.
Anil Jain, Karthik Nandakumar, and Arun Ross. 2005. Score normalization in multimodal biometric systems. Pattern recognition 38, 12 (2005), 2270–2285.
Stuart James and John Collomosse. 2014. Interactive Video Asset Retrieval Using Sketched Queries. In Proceedings of the 11th European Conference on Visual Media Production (London, United Kingdom) (CVMP ’14). Association for Computing Machinery, New York, NY, USA, Article 11, 8 pages. https://doi.org/10.1145/2668904.2668940
Jürgen M. Janas. 1986. The Semantics-Based Natural Language Interface to Relational Databases. In Cooperative Interfaces to Information Systems, Leonard Bolc and Matthias Jarke (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 143–188.
Peter Kay. 1993. Speech-driven graphics: a user interface. Journal of Microcomputer Applications 16, 3 (1993), 223–231.
Gil Keren, Amr El-Desoky Mousa, Olivier Pietquin, Stefanos Zafeiriou, and Björn Schuller. 2018. Deep Learning for Multisensorial and Multimodal Interaction. Association for Computing Machinery and Morgan & Claypool, NY, USA, 99–128. https://doi.org/10.1145/3107990.3107996
Shinya Kikuchi and Partha Chakroborty. 1992. Car-following model based on fuzzy inference system., 82–82 pages.
Yong Jae Lee, C. Lawrence Zitnick, and Michael F. Cohen. 2011. ShadowDraw: real-time user guidance for freehand drawing. ACM Trans. Graph. 30, 4 (2011), 27:1–27:10.
Germán Leiva, Cuong Nguyen, Rubaiat Habib Kazi, and Paul Asente. 2020. Pronto: Rapid Augmented Reality Video Prototyping Using Sketches and Enaction. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Honolulu, HI, USA) (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376160
Bo Li, Yijuan Lu, Fuqing Duan, Shuilong Dong, Yachun Fan, Lu Qian, Hamid Laga, Haisheng Li, Yuxiang Li, Peng Liu, Maks Ovsjanikov, Hedi Tabia, Yuxiang Ye, Huanpu Yin, and Ziyu Xue. 2016. 3D Sketch-based 3D Shape Retrieval. In Proceedings of the Eurographics 2016 Workshop on 3D Object Retrieval (Lisbon, Portugal) (3DOR ’16). Eurographics Association, Goslar Germany, Germany, 47–54. https://doi.org/10.2312/3dor.20161087
Fei Li and Hosagrahar V Jagadish. 2014. NaLIR: An Interactive Natural Language Interface for Querying Relational Databases. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data(Snowbird, Utah, USA) (SIGMOD ’14). ACM, New York, NY, USA, 709–712. https://doi.org/10.1145/2588555.2594519
Gustavo López, Luis Quesada, and Luis A. Guerrero. 2018. Alexa vs. Siri vs. Cortana vs. Google Assistant: A Comparison of Speech-Based Natural User Interfaces. In Advances in Human Factors and Systems Interaction, Isabel L. Nunes (Ed.). Springer International Publishing, Cham, 241–250.
Scott McGlashan. 1995. Speech interfaces to virtual reality.
Scott McGlashan and Tomas Axling. 1996. A speech interface to virtual environments.
Facebook (2020). Retrieved October 2020 from https://www.oculus.com/. Facebook.
Oneirosvr (2020). Retrieved October 2020 from https://oneirosvr.com. Oneirosvr.
Alex Pentland. 1998. Smart rooms, smart clothes. In Proceedings. Fourteenth International Conference on Pattern Recognition (Cat. No. 98EX170), Vol. 2. IEEE, IEEE, Brisbane, QLD, Australia, 949–953.
Patrick Reipschläger and Raimund Dachselt. 2019. DesignAR: Immersive 3D-Modeling Combining Augmented Reality with Interactive Displays. In Proceedings of the 2019 ACM International Conference on Interactive Surfaces and Spaces(Daejeon, Republic of Korea) (ISS ’19). Association for Computing Machinery, New York, NY, USA, 29–41. https://doi.org/10.1145/3343055.3359718
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Yoshua Bengio and Yann LeCun (Eds.). open publishing, San Diego, USA, arXiv:1409.1556. http://arxiv.org/abs/1409.1556
Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik G. Learned-Miller. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proc. International Conference on Computer Vision (ICCV). IEEE, Santiago, Chile, arXiv:1505.00880.
Ryo Suzuki, Rubaiat Habib Kazi, Li Yi Wei, Stephen Diverdi, Wilmot Li, and Daniel Leithinger. 2020. RealitySketch: Embedding responsive graphics and visualizations in AR through dynamic sketching. UIST 2020 - Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology arXiv:2008.08688 (2020), 166–181. https://doi.org/10.1145/3379337.3415892 arxiv:2008.08688
Nikolaos G Tsagarakis, John O Gray, Darwin G Caldwell, Cinzia Zannoni, Marco Petrone, Debora Testi, and Marco Viceconti. 2006. A haptic-enabled multimodal interface for the planning of hip arthroplasty. IEEE MultiMedia 13, 3 (2006), 40–48.
Amrita S. Tulshan and Sudhir Namdeorao Dhage. 2019. Survey on Virtual Assistant: Google Assistant, Siri, Cortana, Alexa. In Advances in Signal Processing and Intelligent Recognition Systems, Sabu M. Thampi, Oge Marques, Sri Krishnan, Kuan-Ching Li, Domenico Ciuonzo, and Maheshkumar H. Kolekar (Eds.). Springer Singapore, Singapore, 190–201.
Boris W van Schooten, R Op Den Akker, Sophie Rosset, Olivier Galibert, Aurelien Max, and Gabriel Illouz. 2009. Follow-up question handling in the IMIX and Ritel systems: A comparative study. Natural Language Engineering 15, 1 (2009), 97–118.
Pixel Legend (2020). Retrieved October 2020 from https://virtualist.app/. Pixel Legend.
Minh Tue Vo and Cindy Wood. 1996. Building an application framework for speech and pen input integration in multimodal learning interfaces. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 6. IEEE, IEEE, Atlanta, USA, 3545–3548.
Wolfgang Wahlster. 2006. SmartKom: foundations of multimodal dialogue systems. Vol. 12. Springer, germany.
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D ShapeNets: A deep representation for volumetric shapes. In CVPR. IEEE Computer Society, Boston, USA, 1912–1920.
Fenggen Yu, Kun Liu, Yan Zhang, Chenyang Zhu, and Kai Xu. 2019. PartNet: A Recursive Part Decomposition Network for Fine-Grained and Hierarchical Shape Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation / IEEE, Long Beach, USA, 9491–9500. https://doi.org/10.1109/CVPR.2019.00972
Yu Zhang, James Qin, Daniel S. Park, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Quoc V. Le, and Yonghui Wu. 2020. Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition. arXiv e-prints arXiv:2010.10504, arXiv:2010.10504, Article arXiv:2010.10504 (Oct. 2020), arXiv::2010.10504 pages. arxiv:2010.10504 [eess.AS]
Lina Zhou, Mohammedammar Shaikh, and Dongsong Zhang. 2005. Natural Language Interface to Mobile Devices. In Intelligent Information Processing II, Zhongzhi Shi and Qing He (Eds.). Springer US, Boston, MA, 283–286.

Cited By

View all
  • (2024)Hey Building! Novel Interfaces for Parametric Design Manipulations in Virtual RealityProceedings of the ACM on Human-Computer Interaction10.1145/36981408:ISS(330-355)Online publication date: 24-Oct-2024
  • (2024)pARam: Leveraging Parametric Design in Extended Reality to Support the Personalization of Artifacts for Personal FabricationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642083(1-22)Online publication date: 11-May-2024
  • (2024)Take a Seat, Make a Gesture: Charting User Preferences for On-Chair and From-Chair Gesture InputProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642028(1-17)Online publication date: 11-May-2024
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Conferences
IMX '21: Proceedings of the 2021 ACM International Conference on Interactive Media Experiences
June 2021
331 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].



Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2021


Request permissions for this article.

Check for updates

Author Tags

  1. CNN
  2. HCI
  3. Sketch
  4. Virtual Reality


  • Research-article
  • Research
  • Refereed limited


IMX '21

Acceptance Rates

Overall Acceptance Rate 69 of 245 submissions, 28%

Upcoming Conference

IMX '25


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)55
  • Downloads (Last 6 weeks)13
Reflects downloads up to 03 Jan 2025

Other Metrics


Cited By

View all
  • (2024)Hey Building! Novel Interfaces for Parametric Design Manipulations in Virtual RealityProceedings of the ACM on Human-Computer Interaction10.1145/36981408:ISS(330-355)Online publication date: 24-Oct-2024
  • (2024)pARam: Leveraging Parametric Design in Extended Reality to Support the Personalization of Artifacts for Personal FabricationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642083(1-22)Online publication date: 11-May-2024
  • (2024)Take a Seat, Make a Gesture: Charting User Preferences for On-Chair and From-Chair Gesture InputProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642028(1-17)Online publication date: 11-May-2024
  • (2024)DreamCodeVR: Towards Democratizing Behavior Design in Virtual Reality with Speech-Driven Programming2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR)10.1109/VR58804.2024.00078(579-589)Online publication date: 16-Mar-2024
  • (2024)Toward More Comprehensive Evaluations of 3D Immersive Sketching, Drawing, and PaintingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.327629130:8(4648-4664)Online publication date: Aug-2024
  • (2024)Dream Mesh: A Speech-to-3D Model Generative Pipeline in Mixed Reality2024 IEEE International Conference on Artificial Intelligence and eXtended and Virtual Reality (AIxVR)10.1109/AIxVR59861.2024.00059(345-349)Online publication date: 17-Jan-2024
  • (2023)Speech-Augmented Cone-of-Vision for Exploratory Data AnalysisProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581283(1-18)Online publication date: 19-Apr-2023
  • (2023)Style-aware Augmented Virtuality Embeddings (SAVE)2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR)10.1109/VR55154.2023.00032(163-172)Online publication date: Mar-2023
  • (2022)ShapeFindAR: Exploring In-Situ Spatial Search for Physical Artifact Retrieval using Mixed RealityProceedings of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491102.3517682(1-12)Online publication date: 29-Apr-2022
  • (2022)Structure-Aware 3D VR Sketch to 3D Shape Retrieval2022 International Conference on 3D Vision (3DV)10.1109/3DV57658.2022.00050(383-392)Online publication date: Sep-2022

View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.


HTML Format

View this article in HTML Format.

HTML Format







Share this Publication link

Share on social media