Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Interpreting Natural Language Instructions Using Language, Vision, and Behavior

Published: 11 August 2014 Publication History

Abstract

We define the problem of automatic instruction interpretation as follows. Given a natural language instruction, can we automatically predict what an instruction follower, such as a robot, should do in the environment to follow that instruction? Previous approaches to automatic instruction interpretation have required either extensive domain-dependent rule writing or extensive manually annotated corpora. This article presents a novel approach that leverages a large amount of unannotated, easy-to-collect data from humans interacting in a game-like environment. Our approach uses an automatic annotation phase based on artificial intelligence planning, for which two different annotation strategies are compared: one based on behavioral information and the other based on visibility information. The resulting annotations are used as training data for different automatic classifiers. This algorithm is based on the intuition that the problem of interpreting a situated instruction can be cast as a classification problem of choosing among the actions that are possible in the situation. Classification is done by combining language, vision, and behavior information. Our empirical analysis shows that machine learning classifiers achieve 77% accuracy on this task on available English corpora and 74% on similar German corpora. Finally, the inclusion of human feedback in the interpretation process is shown to boost performance to 92% for the English corpus and 90% for the German corpus.

References

[1]
Yoav Artzi and Luke Zettlemoyer. 2011. Bootstrapping semantic parsers from conversations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’11). Association for Computational Linguistics, Stroudsburg, PA, 421--432. http://dl.acm.org/citation.cfm?id=2145432.2145481
[2]
Luciana Benotti. 2009. Frolog: An accommodating text-adventure game. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session (EACL’09). Association for Computational Linguistics, Stroudsburg, PA, 1--4.
[3]
Luciana Benotti and Alexandre Denis. 2011. Prototyping virtual instructors from human-human corpora. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations (ACL’11). Association for Computer Linguistics, Stroudsburg, PA, 62--67.
[4]
Luciana Benotti, Martin Villalba, Tessa Lau, and Julian Cerruti. 2012. Corpus-based interpretation of instructions in virtual environments. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Vol 2: Short Papers). Association for Computational Linguistics, Stroudsburg, PA, 181--186. http://www.aclweb.org/anthology/P12-2036
[5]
Blai Bonet and Héctor Geffner. 2005. mGPT: A probabilistic planner based on heuristic search. Journal of Artificial Intelligence Research 24, 1, 933--944.
[6]
Satchuthananthavale R. K. Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing (ACL-IJNLP’09). Association for Computational Linguistics, Stroudsburg, PA, 82--90.
[7]
Jean Carletta. 1996. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics 22, 2, 249--254.
[8]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3, 27:1--27:27. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
[9]
David L. Chen. 2012. Learning Language from Ambiguous Perceptual Context. Ph.D. Dissertation. University of Texas, Austin.
[10]
David L. Chen, Joohyun Kim, and Raymond J. Mooney. 2010. Training a multilingual sportscaster: Using perceptual context to learn language. Journal of Artificial Intelligence Research 37, 1, 397--436. http://dl.acm.org/citation.cfm?id=1861751.1861761
[11]
David L. Chen and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI’11) 859--865.
[12]
Sonia Chernova, Nick DePalma, and Cynthia Breazeal. 2011. Crowdsourcing real world human-robot dialog and teamwork through online multiplayer games. AI Magazine 32, 4, 100--111.
[13]
Herbert H. Clark. 1996. Using Language. Cambridge University Press.
[14]
Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3, 273--297.
[15]
Heriberto Cuayáhuitl and Nina Dethlefs. 2011. Spatially-aware dialogue control using hierarchical reinforcement learning. ACM Transactions on Speech and Language Processing 7, 3, 5:1--5:26.
[16]
Myroslava O. Dzikovska, James F. Allen, and Mary D. Swift. 2008. Linking semantic and knowledge representations in a multi-domain dialogue system. Journal of Logic and Computation 18, 3, 405--430.
[17]
João Gama, Raquel Sebastião, and Pedro Pereira Rodrigues. 2009. Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, NY, 329--338.
[18]
Andrew Gargett, Konstantina Garoufi, Alexander Koller, and Kristina Striegnitz. 2010. The GIVE-2 corpus of giving instructions in virtual environments. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10).
[19]
James J. Gibson. 1979. The Ecological Approach to Visual Perception. Houghton Mifflin.
[20]
Dan Goldwasser, Roi Reichart, James Clarke, and Dan Roth. 2011. Confidence driven unsupervised semantic parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). Association for Computational Linguistics, Stroudsburg, PA, 1486--1495. http://dl.acm.org/citation.cfm?id=2002472.2002653
[21]
Peter Gorniak and Deb Roy. 2007. Situated language understanding as filtering perceived affordances. Cognitive Science 31, 2, 197--231.
[22]
Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM Special Interest Group on Knowledge Discovery in Data and Data Mining Explorations Newsletter 11, 1, 10--18.
[23]
Jörg Hoffmann. 2003. The Metric-FF planning system: Translating “ignoring delete lists” to numeric state variables. Journal of Artificial Intelligence Research 20, 291--341.
[24]
Bevan Keeley Jones, Mark Johnson, and Sharon Goldwater. 2012. Semantic parsing with Bayesian tree transducers. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers (ACL’12). Association for Computational Linguistics, Stroudsburg, PA, 488--496. http://dl.acm.org/citation.cfm?id=2390524.2390593
[25]
Thomas Kollar, Stefanie Tellex, Deb Roy, and Nicholas Roy. 2010. Toward understanding natural language directions. In Proceedings of the 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI’10). IEEE, Los Alamitos, CA, 259--266.
[26]
Alexander Koller, Ralph Debusmann, Malte Gabsdil, and Kristina Striegnitz. 2004. Put my galakmid coin into the dispenser and kick it: Computational linguistics and theorem proving in a computer game. Journal of Logic, Language and Information 13, 2, 187--206.
[27]
Alexander Koller, Kristina Striegnitz, Andrew Gargett, Donna Byron, Justine Cassell, Robert Dale, Johanna Moore, and Jon Oberlander. 2010. Report on the second challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the 6th International Natural Language Generation Conference (INLG’10). Association for Computational Linguistics, Stroudsburg, PA, 243--250.
[28]
Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2010. Inducing probabilistic CCG grammars from logical form with higher-order unification. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). Association for Computational Linguistics, Stroudsburg, PA, 1223--1233. http://dl.acm.org/citation.cfm?id=1870658.1870777
[29]
Tessa Lau, Julian Cerruti, Guillermo Manzato, Mateo Bengualid, Jeffrey P. Bigham, and Jeffrey Nichols. 2010. A conversational interface to Web automation. In Proceedings of the 23nd Annual ACM Symposium on User Unterface Software and Technology (UIST’10). ACM, New York, NY, 229--238.
[30]
Tessa Lau, Clemens Drews, and Jeffrey Nichols. 2009. Interpreting written how-to instructions. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI’09). Morgan Kaufmann, San Francisco, CA, 1433--1438.
[31]
Anton Leuski, Carsten Eickhoff, James Ganis, and Victor Lavrenko. 2012. The BladeMistress corpus: From talk to action in virtual worlds. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association, Istanbul, Turkey, 4060--4067.
[32]
Vladimir Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory 10, 8, 707--710.
[33]
Percy Liang, Michael Jordan, and Dan Klein. 2013. Learning dependency-based compositional semantics. Computational Linguistics 39, 2, 398--446.
[34]
Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: Connecting language, knowledge, and action in route instructions. In Proceedings of the 21st National Conference on Artificial Intelligence—Volume 2 (AAAI’06). 1475--1482.
[35]
Cynthia Matuszek, Dieter Fox, and Karl Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI’10). ACM, New York, NY, 251--258.
[36]
Sreerama K. Murthy. 1998. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery 2, 4, 345--389.
[37]
Dana Nau, Malik Ghallab, and Paolo Traverso. 2004. Automated Planning: Theory and Practice. Morgan Kaufmann, San Francisco, CA.
[38]
Masoud Nikravesh, Tomohiro Takagi, Masanori Tajima, Akiyoshi Shinmura, Ryosuke Ohgaya, Koji Taniguchi, Kazuyosi Kawahara, Kouta Fukano, and Akiko Aizawa. 2005. Soft computing for perception-based decision processing and analysis: Web-based BISC-DSS. In Soft Computing for Information Processing and Analysis, Masoud Nikravesh, Lotfi Zadeh, and Janusz Kacprzyk (Eds.). Studies in Fuzziness and Soft Computing, Vol. 164. Springer, 93--188.
[39]
Jeff Orkin and Deb Roy. 2009. Automatic learning and generation of social behavior from collective human gameplay. In Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems—Volume 1. 385--392.
[40]
Jeff Orkin and Deb Roy. 2007. The restaurant game: Learning social behavior and language from thousands of players online. Journal of Game Development 3, 1, 39--60.
[41]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). Association for Computational Linguistics, Stroudsburg, PA, 311--318.
[42]
Matthew Purver. 2004. The Theory and Use of Clarification Requests in Dialogue. Ph.D. Dissertation. King’s College, University of London. http://www.dcs.qmul.ac.uk/∼mpurver/papers/purver04thesis.pdf.
[43]
Verena Rieser and Oliver Lemon. 2010. Learning human multimodal dialogue strategies. Natural Language Engineering 16, 1, 3--23.
[44]
Sharon Gower Small, Jennifer Stromer-Galley, and Tomek Strzalkowski. 2011. Multi-modal annotation of quest games in Second Life. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies—Volume 1 (ACL-HLT’11). Association for Computational Linguistics, Stroudsburg, PA, 171--179. http://dl.acm.org/citation.cfm?id=2002472.2002495
[45]
Laura Stoia, Donna K. Byron, Darla Magdalene Shockley, and Eric Fosler-Lussier. 2006. Sentence planning for realtime navigational instructions. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers (NAACL-Short’06). Association for Computational Linguistics, Stroudsburg, PA, 157--160.
[46]
Adam Vogel and Dan Jurafsky. 2010. Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). Association for Computational Linguistics, Stroudsburg, PA, 806--814.
[47]
Jason D. Williams and Steve Young. 2007. Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language 21, 2, 393--422.
[48]
Terry Winograd. 1972. Understanding Natural Language. Academic Press, New York, NY.

Cited By

View all
  • (2021)BehavE: Behaviour Understanding Through Automated Generation of Situation ModelsKI 2021: Advances in Artificial Intelligence10.1007/978-3-030-87626-5_27(362-369)Online publication date: 27-Sep-2021
  • (2020)Towards Automated Generation of Semantic Annotation for Activity Recognition Problems2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops)10.1109/PerComWorkshops48775.2020.9156147(1-6)Online publication date: Mar-2020
  • (2018)Extracting Planning Operators from Instructional Texts for Behaviour InterpretationKI 2018: Advances in Artificial Intelligence10.1007/978-3-030-00111-7_19(215-228)Online publication date: 30-Aug-2018
  • Show More Cited By

Index Terms

  1. Interpreting Natural Language Instructions Using Language, Vision, and Behavior

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Interactive Intelligent Systems
    ACM Transactions on Interactive Intelligent Systems  Volume 4, Issue 3
    Special Issue on Multiple Modalities in Interactive Systems and Robots
    October 2014
    115 pages
    ISSN:2160-6455
    EISSN:2160-6463
    DOI:10.1145/2660857
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 August 2014
    Accepted: 01 April 2014
    Revised: 01 March 2014
    Received: 01 March 2013
    Published in TIIS Volume 4, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Natural language interpretation
    2. action recognition
    3. multimodal understanding
    4. situated virtual agent
    5. unsupervised learning
    6. visual feedback

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 15 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)BehavE: Behaviour Understanding Through Automated Generation of Situation ModelsKI 2021: Advances in Artificial Intelligence10.1007/978-3-030-87626-5_27(362-369)Online publication date: 27-Sep-2021
    • (2020)Towards Automated Generation of Semantic Annotation for Activity Recognition Problems2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops)10.1109/PerComWorkshops48775.2020.9156147(1-6)Online publication date: Mar-2020
    • (2018)Extracting Planning Operators from Instructional Texts for Behaviour InterpretationKI 2018: Advances in Artificial Intelligence10.1007/978-3-030-00111-7_19(215-228)Online publication date: 30-Aug-2018
    • (2014)Introduction to the Special Issue on Machine Learning for Multiple Modalities in Interactive Systems and RobotsACM Transactions on Interactive Intelligent Systems (TiiS)10.1145/26705394:3(1-6)Online publication date: 14-Oct-2014

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media