research-article

Open access

DIY assistant: a multi-modal end-user programmable virtual assistant

Authors:

Michael H. Fischer,

Giovanni Campagna,

Monica S. LamAuthors Info & Claims

PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

Pages 312 - 327

https://doi.org/10.1145/3453483.3454046

Published: 18 June 2021 Publication History

Abstract

While Alexa can perform over 100,000 skills, its capability covers only a fraction of what is possible on the web. Individuals need and want to automate a long tail of web-based tasks which often involve visiting different websites and require programming concepts such as function composition, conditional, and iterative evaluation. This paper presents DIYA (Do-It-Yourself Assistant), a new system that empowers users to create personalized web-based virtual assistant skills that require the full generality of composable control constructs, without having to learn a formal programming language.

With DIYA, the user demonstrates their task of interest in the browser and issues a few simple voice commands, such as naming the skills and adding conditions on the action. DIYA turns these multi-modal specifications into voice-invocable skills written in the ThingTalk 2.0 programming language we designed for this purpose. DIYA is a prototype that works in the Chrome browser. Our user studies show that 81% of the proposed routines can be expressed using DIYA. DIYA is easy to learn, and 80% of users surveyed find DIYA useful.

References

[1]

James Allen, Nathanael Chambers, George Ferguson, Lucian Galescu, Hyuckchul Jung, Mary Swift, and William Taysom. 2007. Plow: A collaborative task learning agent. In AAAI. 7, 1514–1519.

[2]

Tal Ater. 2019. annyang! Speech recognition for your site. https://github.com/TalAter/annyang

[3]

Shaon Barman, Sarah Chasins, Rastislav Bodik, and Sumit Gulwani. 2016. Ringer: Web Automation by Demonstration. SIGPLAN Not., 51, 10 (2016), Oct., 748–764. issn:0362-1340 https://doi.org/10.1145/3022671.2984020

Digital Library

[4]

berstend. 2020. puppeteer-extra-plugin-stealth. https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra-plugin-stealth

[5]

Julia Cambre, Alex C Williams, Afsaneh Razi, Ian Bicking, Abraham Wallin, Janice Tsai, Chinmay Kulkarni, and Jofish Kaye. 2021. Firefox Voice: An Open and Extensible Voice Assistant Built Upon the Web.

[6]

Giovanni Campagna, Rakesh Ramesh, Silei Xu, Michael Fischer, and Monica S. Lam. 2017. Almond: The Architecture of an Open, Crowdsourced, Privacy-Preserving, Programmable Virtual Assistant. In Proceedings of the 26th International Conference on World Wide Web - WWW ’17. ACM Press, New York, New York, USA. 341–350. isbn:9781450349130 https://doi.org/10.1145/3038912.3052562

Digital Library

[7]

Giovanni Campagna, Silei Xu, Mehrad Moradshahi, Richard Socher, and Monica S. Lam. 2019. Genie: A Generator of Natural Language Semantic Parsers for Virtual Assistant Commands. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). ACM, New York, NY, USA. 394–410. isbn:978-1-4503-6712-7 https://doi.org/10.1145/3314221.3314594

Digital Library

[8]

Sarah Chasins and Rastislav Bodik. 2017. Skip Blocks: Reusing Execution History to Accelerate Web Scripts. Proc. ACM Program. Lang., 1, OOPSLA (2017), Article 51, Oct., 28 pages. https://doi.org/10.1145/3133875

Digital Library

[9]

Sarah E. Chasins, Maria Mueller, and Rastislav Bodik. 2018. Rousillon: Scraping Distributed Hierarchical Web Data. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST ’18). Association for Computing Machinery, New York, NY, USA. 963–975. isbn:9781450359481 https://doi.org/10.1145/3242587.3242661

Digital Library

[10]

Allen Cypher. 1995. EAGER: PROGRAMMING REPETITIVE TASKS BY EXAMPLE. In Readings in Human–Computer Interaction, RONALD M. BAECKER, JONATHAN GRUDIN, WILLIAM A.S. BUXTON, and SAUL GREENBERG (Eds.). Morgan Kaufmann, 804–810. isbn:978-0-08-051574-8 https://doi.org/10.1016/B978-0-08-051574-8.50083-2

[11]

Michael Fischer, Giovanni Campagna, Silei Xu, and Monica S. Lam. 2018. Brassau: Automatic Generation of Graphical User Interfaces for Virtual Assistants. In Proceedings of the 20th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI ’18). Association for Computing Machinery, New York, NY, USA. Article 33, 12 pages. isbn:9781450358989 https://doi.org/10.1145/3229434.3229481

Digital Library

[12]

Jack Franklin. 2020. Puppeteer Headless Chrome Node.js API. https://github.com/puppeteer/puppeteer

[13]

Sandra G Hart. 2006. NASA-task load index (NASA-TLX); 20 years later. In Proceedings of the human factors and ergonomics society annual meeting. 50, 904–908.

[14]

2011. If This Then That. http://ifttt.com

[15]

Tessa Lau, Steven A. Wolfman, Pedro Domingos, and Daniel S. Weld. 2003. Programming by Demonstration Using Version Space Algebra. Mach. Learn., 53, 1–2 (2003), Oct., 111–156. issn:0885-6125 https://doi.org/10.1023/A:1025671410623

Digital Library

[16]

Gilly Leshed, Eben M. Haber, Tara Matthews, and Tessa Lau. 2008. CoScripter: Automating & Sharing How-to Knowledge in the Enterprise. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’08). Association for Computing Machinery, New York, NY, USA. 1719–1728. isbn:9781605580111 https://doi.org/10.1145/1357054.1357323

Digital Library

[17]

Ian Li, Jeffrey Nichols, Tessa Lau, Clemens Drews, and Allen Cypher. 2010. Here’s What i Did: Sharing and Reusing Web Activity with ActionShot. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’10). Association for Computing Machinery, New York, NY, USA. 723–732. isbn:9781605589299 https://doi.org/10.1145/1753326.1753432

Digital Library

[18]

Toby Jia-Jun Li, Amos Azaria, and Brad A. Myers. 2017. SUGILITE: Creating Multimodal Smartphone Automation by Demonstration. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17). Association for Computing Machinery, New York, NY, USA. 6038–6049. isbn:9781450346559 https://doi.org/10.1145/3025453.3025483

Digital Library

[19]

Toby Jia-Jun Li, Igor Labutov, Xiaohan Nancy Li, Xiaoyi Zhang, Wenze Shi, Wanling Ding, Tom M Mitchell, and Brad A Myers. 2018. APPINITE: A Multi-Modal Interface for Specifying Data Descriptions in Programming by Demonstration Using Natural Language Instructions. In 2018 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). 105–114.

[20]

Toby Jia-Jun Li, Marissa Radensky, Justin Jia, Kirielle Singarajah, Tom M. Mitchell, and Brad A. Myers. 2019. PUMICE: A Multi-Modal Agent That Learns Concepts and Conditionals from Natural Language and Demonstrations. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology (UIST ’19). Association for Computing Machinery, New York, NY, USA. 577–589. isbn:9781450368162 https://doi.org/10.1145/3332165.3347899

Digital Library

[21]

Toby Jia-Jun Li, Marissa Radensky, Justin Jia, Kirielle Singarajah, Tom M Mitchell, and Brad A Myers. 2020. Interactive Task and Concept Learning from Natural Language Instructions and GUI Demonstrations. In The AAAI-20 Workshop on Intelligent Process Automation (IPA-20).

[22]

Toby Jia-Jun Li and Oriana Riva. 2018. Kite: Building Conversational Bots from Mobile Apps. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys ’18). Association for Computing Machinery, New York, NY, USA. 96–109. isbn:9781450357203 https://doi.org/10.1145/3210240.3210339

Digital Library

[23]

Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. 2020. Mapping Natural Language Instructions to Mobile UI Action Sequences. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online. 8198–8210. https://doi.org/10.18653/v1/2020.acl-main.729

[24]

Greg Little, Tessa A. Lau, Allen Cypher, James Lin, Eben M. Haber, and Eser Kandogan. 2007. Koala: Capture, Share, Automate, Personalize Business Processes on the Web. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’07). Association for Computing Machinery, New York, NY, USA. 943–946. isbn:9781595935939 https://doi.org/10.1145/1240624.1240767

Digital Library

[25]

Anton Medvedev. 2020. finder: CSS Selector Generator. https://github.com/antonmedv/finder

[26]

Brad A. Myers, Richard G. McDaniel, and David S. Kosbie. 1993. Marquise: Creating Complete User Interfaces by Demonstration. In Proceedings of the INTERACT ’93 and CHI ’93 Conference on Human Factors in Computing Systems (CHI ’93). Association for Computing Machinery, New York, NY, USA. 293–300. isbn:0897915755 https://doi.org/10.1145/169059.169225

Digital Library

[27]

Tim Nolet. 2020. Puppeteer Recorder. https://github.com/checkly/puppeteer-recorder

[28]

Panupong Pasupat, Tian-Shun Jiang, Evan Liu, Kelvin Guu, and Percy Liang. 2018. Mapping natural language commands to web elements. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium. 4970–4976. https://doi.org/10.18653/v1/D18-1540

[29]

Gordon W Paynter. 1999. Familiar: Automating Repetition in Common Applications. In New Zealand Computer Science Research Students’ Conference. 62–69.

[30]

R. Rolim, G. Soares, L. D’Antoni, O. Polozov, S. Gulwani, R. Gheyi, R. Suzuki, and B. Hartmann. 2017. Learning Syntactic Program Transformations from Examples. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). 404–415. https://doi.org/10.1109/ICSE.2017.44

Digital Library

[31]

Alborz Rezazadeh Sereshkeh, Gary Leung, Krish Perumal, Caleb Phillips, Minfan Zhang, Afsaneh Fazly, and Iqbal Mohomed. 2020. VASTA: A Vision and Language-Assisted Smartphone Task Automation System. In Proceedings of the 25th International Conference on Intelligent User Interfaces (IUI ’20). Association for Computing Machinery, New York, NY, USA. 22–32. isbn:9781450371186 https://doi.org/10.1145/3377325.3377515

Digital Library

[32]

Janice Tsai and Jofish Kaye. 2018. Hey Scout: Designing a Browser-Based Voice Assistant. https://aaai.org/ocs/index.php/SSS/SSS18/paper/view/17543

[33]

Nancy Xu, Sam Masling, Michael Du, Giovanni Campagna, Larry Heck, James Landay, and Monica S. Lam. 2021. Grounding Open-Domain Instructions to Automate Web Support Tasks. In Proceedings of the 2021 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT 2021) (To Appear). arxiv:2103.16057

[34]

Tom Yeh, Tsung-Hsiang Chang, and Robert C. Miller. 2009. Sikuli: Using GUI Screenshots for Search and Automation. In Proceedings of the 22nd Annual ACM Symposium on User Interface Software and Technology (UIST ’09). Association for Computing Machinery, New York, NY, USA. 183–192. isbn:9781605587455 https://doi.org/10.1145/1622176.1622213

Digital Library

[35]

Tantek Çelik, Elika J. Etemad, Daniel Glazman, Ian Hickson, Peter Linss, and John Williams. 2018. Selectors Level 3 (W3C Recommendation). https://www.w3.org/TR/selectors-3/

Cited By

Li XZhou XDong RZhang YWang X(2024)Efficient Bottom-Up Synthesis for Programs with Local VariablesProceedings of the ACM on Programming Languages10.1145/36328948:POPL(1540-1568)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632894
Vaithilingam PGlassman EInala JWang C(2024)DynaVis: Dynamically Synthesized UI Widgets for Visualization EditingProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642639(1-17)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642639
Yang JShi YZhang YLi KRosli DJain AZhang SLi TLanday JLam M(2024)ReactGenie: A Development Framework for Complex Multimodal Interactions Using Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642517(1-23)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642517
Show More Cited By

Index Terms

DIY assistant: a multi-modal end-user programmable virtual assistant
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Natural language interfaces
  2. Ubiquitous and mobile computing
    1. Ubiquitous and mobile devices
      1. Personal digital assistants
2. Software and its engineering
  1. Software notations and tools
    1. Context specific languages
      1. Domain specific languages
      2. Programming by example

Recommendations

End-user programming of web-native interactive applications
CompSysTech '11: Proceedings of the 12th International Conference on Computer Systems and Technologies

Web 2.0 has enabled Web users to create and share a variety of hyper-text based artifacts including embedded images, sound, and video on the Web. Creating Web-based interactive artifacts such as computer games, however, has remained a challenge: to end ...
Towards End-User Web Scraping for Customization
Programming '21: Companion Proceedings of the 5th International Conference on the Art, Science, and Engineering of Programming

Websites are malleable: users can run code in the browser to customize them. However, this malleability is typically only accessible to programmers with knowledge of HTML and Javascript. Previously, we developed a tool called Wildcard which empowers end-...
User evaluation of a domain-oriented end-user design environment for building 3D virtual chemistry experiments
USER '12: Proceedings of the First International Workshop on User Evaluation for Software Engineering Researchers

Three-dimensional virtual world technologies have the potential to be applied in the domain of education. However, end users such as teachers found it difficult to apply virtual world technologies because of technical issues. This paper discusses the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

June 2021

1341 pages

ISBN:9781450383912

DOI:10.1145/3453483

General Chair:
Stephen N. Freund
Williams College, USA
,
Program Chair:
Eran Yahav
Technion, Israel

Copyright © 2021 Owner/Author.

This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PLDI '21

Sponsor:

SIGPLAN

PLDI '21: 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation

June 20 - 25, 2021

Virtual, Canada

Acceptance Rates

Overall Acceptance Rate 406 of 2,067 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
962
Total Downloads

Downloads (Last 12 months)304
Downloads (Last 6 weeks)37

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Li XZhou XDong RZhang YWang X(2024)Efficient Bottom-Up Synthesis for Programs with Local VariablesProceedings of the ACM on Programming Languages10.1145/36328948:POPL(1540-1568)Online publication date: 5-Jan-2024
https://dl.acm.org/doi/10.1145/3632894
Vaithilingam PGlassman EInala JWang C(2024)DynaVis: Dynamically Synthesized UI Widgets for Visualization EditingProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642639(1-17)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642639
Yang JShi YZhang YLi KRosli DJain AZhang SLi TLanday JLam M(2024)ReactGenie: A Development Framework for Complex Multimodal Interactions Using Large Language ModelsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642517(1-23)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642517
Abraham AMathew BMathew DMohammad FKrishnan G(2024)Eva: Python-based Desktop Virtual Assistant for Visually Impaired2024 7th International Conference on Circuit Power and Computing Technologies (ICCPCT)10.1109/ICCPCT61902.2024.10673357(582-586)Online publication date: 8-Aug-2024
https://doi.org/10.1109/ICCPCT61902.2024.10673357
Pucci EPiro LPossaghi IMulfari DMatera M(2024)Co-designing the integration of voice-based conversational AI and web augmentation to amplify web inclusivityScientific Reports10.1038/s41598-024-66725-314:1Online publication date: 13-Jul-2024
https://doi.org/10.1038/s41598-024-66725-3
Hirzel M(2023)Low-Code Programming ModelsCommunications of the ACM10.1145/358769166:10(76-85)Online publication date: 22-Sep-2023
https://dl.acm.org/doi/10.1145/3587691
Bartle VAlbright LDell N(2023)"This machine is for the aides": Tailoring Voice Assistant Design to Home Health Care WorkProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581118(1-19)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3581118
Ruoff MMyers BMaedche A(2023)ONYX: Assisting Users in Teaching Natural Language Interfaces Through Multi-Modal Interactive Task LearningProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580964(1-16)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3580964
Pan LYu CHe ZShi Y(2023)A Human-Computer Collaborative Editing Tool for Conceptual DiagramsProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3580676(1-29)Online publication date: 19-Apr-2023
https://dl.acm.org/doi/10.1145/3544548.3580676
Ripa GTorre MUrbieta MRossi GFernandez ATacuri AFirmenich S(2023)Generating voice user interfaces from web sitesBehaviour & Information Technology10.1080/0144929X.2023.2272192(1-24)Online publication date: 30-Oct-2023
https://doi.org/10.1080/0144929X.2023.2272192
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents