Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Interpreting Natural Language Instructions Using Language, Vision, and Behavior

Published: 11 August 2014 Publication History
  • Get Citation Alerts
  • Abstract

    We define the problem of automatic instruction interpretation as follows. Given a natural language instruction, can we automatically predict what an instruction follower, such as a robot, should do in the environment to follow that instruction? Previous approaches to automatic instruction interpretation have required either extensive domain-dependent rule writing or extensive manually annotated corpora. This article presents a novel approach that leverages a large amount of unannotated, easy-to-collect data from humans interacting in a game-like environment. Our approach uses an automatic annotation phase based on artificial intelligence planning, for which two different annotation strategies are compared: one based on behavioral information and the other based on visibility information. The resulting annotations are used as training data for different automatic classifiers. This algorithm is based on the intuition that the problem of interpreting a situated instruction can be cast as a classification problem of choosing among the actions that are possible in the situation. Classification is done by combining language, vision, and behavior information. Our empirical analysis shows that machine learning classifiers achieve 77% accuracy on this task on available English corpora and 74% on similar German corpora. Finally, the inclusion of human feedback in the interpretation process is shown to boost performance to 92% for the English corpus and 90% for the German corpus.

    References

    [1]
    Yoav Artzi and Luke Zettlemoyer. 2011. Bootstrapping semantic parsers from conversations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’11). Association for Computational Linguistics, Stroudsburg, PA, 421--432. http://dl.acm.org/citation.cfm?id=2145432.2145481
    [2]
    Luciana Benotti. 2009. Frolog: An accommodating text-adventure game. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Demonstrations Session (EACL’09). Association for Computational Linguistics, Stroudsburg, PA, 1--4.
    [3]
    Luciana Benotti and Alexandre Denis. 2011. Prototyping virtual instructors from human-human corpora. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Systems Demonstrations (ACL’11). Association for Computer Linguistics, Stroudsburg, PA, 62--67.
    [4]
    Luciana Benotti, Martin Villalba, Tessa Lau, and Julian Cerruti. 2012. Corpus-based interpretation of instructions in virtual environments. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Vol 2: Short Papers). Association for Computational Linguistics, Stroudsburg, PA, 181--186. http://www.aclweb.org/anthology/P12-2036
    [5]
    Blai Bonet and Héctor Geffner. 2005. mGPT: A probabilistic planner based on heuristic search. Journal of Artificial Intelligence Research 24, 1, 933--944.
    [6]
    Satchuthananthavale R. K. Branavan, Harr Chen, Luke Zettlemoyer, and Regina Barzilay. 2009. Reinforcement learning for mapping instructions to actions. In Proceedings of the Joint Conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing (ACL-IJNLP’09). Association for Computational Linguistics, Stroudsburg, PA, 82--90.
    [7]
    Jean Carletta. 1996. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics 22, 2, 249--254.
    [8]
    Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2, 3, 27:1--27:27. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.
    [9]
    David L. Chen. 2012. Learning Language from Ambiguous Perceptual Context. Ph.D. Dissertation. University of Texas, Austin.
    [10]
    David L. Chen, Joohyun Kim, and Raymond J. Mooney. 2010. Training a multilingual sportscaster: Using perceptual context to learn language. Journal of Artificial Intelligence Research 37, 1, 397--436. http://dl.acm.org/citation.cfm?id=1861751.1861761
    [11]
    David L. Chen and Raymond J. Mooney. 2011. Learning to interpret natural language navigation instructions from observations. In Proceedings of the 25th Conference on Artificial Intelligence (AAAI’11) 859--865.
    [12]
    Sonia Chernova, Nick DePalma, and Cynthia Breazeal. 2011. Crowdsourcing real world human-robot dialog and teamwork through online multiplayer games. AI Magazine 32, 4, 100--111.
    [13]
    Herbert H. Clark. 1996. Using Language. Cambridge University Press.
    [14]
    Corinna Cortes and Vladimir Vapnik. 1995. Support-vector networks. Machine Learning 20, 3, 273--297.
    [15]
    Heriberto Cuayáhuitl and Nina Dethlefs. 2011. Spatially-aware dialogue control using hierarchical reinforcement learning. ACM Transactions on Speech and Language Processing 7, 3, 5:1--5:26.
    [16]
    Myroslava O. Dzikovska, James F. Allen, and Mary D. Swift. 2008. Linking semantic and knowledge representations in a multi-domain dialogue system. Journal of Logic and Computation 18, 3, 405--430.
    [17]
    João Gama, Raquel Sebastião, and Pedro Pereira Rodrigues. 2009. Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM International Conference on Knowledge Discovery and Data Mining (KDD’09). ACM, New York, NY, 329--338.
    [18]
    Andrew Gargett, Konstantina Garoufi, Alexander Koller, and Kristina Striegnitz. 2010. The GIVE-2 corpus of giving instructions in virtual environments. In Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10).
    [19]
    James J. Gibson. 1979. The Ecological Approach to Visual Perception. Houghton Mifflin.
    [20]
    Dan Goldwasser, Roi Reichart, James Clarke, and Dan Roth. 2011. Confidence driven unsupervised semantic parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT’11). Association for Computational Linguistics, Stroudsburg, PA, 1486--1495. http://dl.acm.org/citation.cfm?id=2002472.2002653
    [21]
    Peter Gorniak and Deb Roy. 2007. Situated language understanding as filtering perceived affordances. Cognitive Science 31, 2, 197--231.
    [22]
    Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, and Ian H. Witten. 2009. The WEKA data mining software: An update. ACM Special Interest Group on Knowledge Discovery in Data and Data Mining Explorations Newsletter 11, 1, 10--18.
    [23]
    Jörg Hoffmann. 2003. The Metric-FF planning system: Translating “ignoring delete lists” to numeric state variables. Journal of Artificial Intelligence Research 20, 291--341.
    [24]
    Bevan Keeley Jones, Mark Johnson, and Sharon Goldwater. 2012. Semantic parsing with Bayesian tree transducers. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers (ACL’12). Association for Computational Linguistics, Stroudsburg, PA, 488--496. http://dl.acm.org/citation.cfm?id=2390524.2390593
    [25]
    Thomas Kollar, Stefanie Tellex, Deb Roy, and Nicholas Roy. 2010. Toward understanding natural language directions. In Proceedings of the 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI’10). IEEE, Los Alamitos, CA, 259--266.
    [26]
    Alexander Koller, Ralph Debusmann, Malte Gabsdil, and Kristina Striegnitz. 2004. Put my galakmid coin into the dispenser and kick it: Computational linguistics and theorem proving in a computer game. Journal of Logic, Language and Information 13, 2, 187--206.
    [27]
    Alexander Koller, Kristina Striegnitz, Andrew Gargett, Donna Byron, Justine Cassell, Robert Dale, Johanna Moore, and Jon Oberlander. 2010. Report on the second challenge on generating instructions in virtual environments (GIVE-2). In Proceedings of the 6th International Natural Language Generation Conference (INLG’10). Association for Computational Linguistics, Stroudsburg, PA, 243--250.
    [28]
    Tom Kwiatkowski, Luke Zettlemoyer, Sharon Goldwater, and Mark Steedman. 2010. Inducing probabilistic CCG grammars from logical form with higher-order unification. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (EMNLP’10). Association for Computational Linguistics, Stroudsburg, PA, 1223--1233. http://dl.acm.org/citation.cfm?id=1870658.1870777
    [29]
    Tessa Lau, Julian Cerruti, Guillermo Manzato, Mateo Bengualid, Jeffrey P. Bigham, and Jeffrey Nichols. 2010. A conversational interface to Web automation. In Proceedings of the 23nd Annual ACM Symposium on User Unterface Software and Technology (UIST’10). ACM, New York, NY, 229--238.
    [30]
    Tessa Lau, Clemens Drews, and Jeffrey Nichols. 2009. Interpreting written how-to instructions. In Proceedings of the 21st International Joint Conference on Artificial Intelligence (IJCAI’09). Morgan Kaufmann, San Francisco, CA, 1433--1438.
    [31]
    Anton Leuski, Carsten Eickhoff, James Ganis, and Victor Lavrenko. 2012. The BladeMistress corpus: From talk to action in virtual worlds. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association, Istanbul, Turkey, 4060--4067.
    [32]
    Vladimir Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory 10, 8, 707--710.
    [33]
    Percy Liang, Michael Jordan, and Dan Klein. 2013. Learning dependency-based compositional semantics. Computational Linguistics 39, 2, 398--446.
    [34]
    Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers. 2006. Walk the talk: Connecting language, knowledge, and action in route instructions. In Proceedings of the 21st National Conference on Artificial Intelligence—Volume 2 (AAAI’06). 1475--1482.
    [35]
    Cynthia Matuszek, Dieter Fox, and Karl Koscher. 2010. Following directions using statistical machine translation. In Proceedings of the 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI’10). ACM, New York, NY, 251--258.
    [36]
    Sreerama K. Murthy. 1998. Automatic construction of decision trees from data: A multi-disciplinary survey. Data Mining and Knowledge Discovery 2, 4, 345--389.
    [37]
    Dana Nau, Malik Ghallab, and Paolo Traverso. 2004. Automated Planning: Theory and Practice. Morgan Kaufmann, San Francisco, CA.
    [38]
    Masoud Nikravesh, Tomohiro Takagi, Masanori Tajima, Akiyoshi Shinmura, Ryosuke Ohgaya, Koji Taniguchi, Kazuyosi Kawahara, Kouta Fukano, and Akiko Aizawa. 2005. Soft computing for perception-based decision processing and analysis: Web-based BISC-DSS. In Soft Computing for Information Processing and Analysis, Masoud Nikravesh, Lotfi Zadeh, and Janusz Kacprzyk (Eds.). Studies in Fuzziness and Soft Computing, Vol. 164. Springer, 93--188.
    [39]
    Jeff Orkin and Deb Roy. 2009. Automatic learning and generation of social behavior from collective human gameplay. In Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems—Volume 1. 385--392.
    [40]
    Jeff Orkin and Deb Roy. 2007. The restaurant game: Learning social behavior and language from thousands of players online. Journal of Game Development 3, 1, 39--60.
    [41]
    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL’02). Association for Computational Linguistics, Stroudsburg, PA, 311--318.
    [42]
    Matthew Purver. 2004. The Theory and Use of Clarification Requests in Dialogue. Ph.D. Dissertation. King’s College, University of London. http://www.dcs.qmul.ac.uk/∼mpurver/papers/purver04thesis.pdf.
    [43]
    Verena Rieser and Oliver Lemon. 2010. Learning human multimodal dialogue strategies. Natural Language Engineering 16, 1, 3--23.
    [44]
    Sharon Gower Small, Jennifer Stromer-Galley, and Tomek Strzalkowski. 2011. Multi-modal annotation of quest games in Second Life. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies—Volume 1 (ACL-HLT’11). Association for Computational Linguistics, Stroudsburg, PA, 171--179. http://dl.acm.org/citation.cfm?id=2002472.2002495
    [45]
    Laura Stoia, Donna K. Byron, Darla Magdalene Shockley, and Eric Fosler-Lussier. 2006. Sentence planning for realtime navigational instructions. In Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers (NAACL-Short’06). Association for Computational Linguistics, Stroudsburg, PA, 157--160.
    [46]
    Adam Vogel and Dan Jurafsky. 2010. Learning to follow navigational directions. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL’10). Association for Computational Linguistics, Stroudsburg, PA, 806--814.
    [47]
    Jason D. Williams and Steve Young. 2007. Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language 21, 2, 393--422.
    [48]
    Terry Winograd. 1972. Understanding Natural Language. Academic Press, New York, NY.

    Cited By

    View all
    • (2021)BehavE: Behaviour Understanding Through Automated Generation of Situation ModelsKI 2021: Advances in Artificial Intelligence10.1007/978-3-030-87626-5_27(362-369)Online publication date: 27-Sep-2021
    • (2020)Towards Automated Generation of Semantic Annotation for Activity Recognition Problems2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops)10.1109/PerComWorkshops48775.2020.9156147(1-6)Online publication date: Mar-2020
    • (2018)Extracting Planning Operators from Instructional Texts for Behaviour InterpretationKI 2018: Advances in Artificial Intelligence10.1007/978-3-030-00111-7_19(215-228)Online publication date: 30-Aug-2018
    • Show More Cited By

    Index Terms

    1. Interpreting Natural Language Instructions Using Language, Vision, and Behavior

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Interactive Intelligent Systems
      ACM Transactions on Interactive Intelligent Systems  Volume 4, Issue 3
      Special Issue on Multiple Modalities in Interactive Systems and Robots
      October 2014
      115 pages
      ISSN:2160-6455
      EISSN:2160-6463
      DOI:10.1145/2660857
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 August 2014
      Accepted: 01 April 2014
      Revised: 01 March 2014
      Received: 01 March 2013
      Published in TIIS Volume 4, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Natural language interpretation
      2. action recognition
      3. multimodal understanding
      4. situated virtual agent
      5. unsupervised learning
      6. visual feedback

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)11
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 10 Aug 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)BehavE: Behaviour Understanding Through Automated Generation of Situation ModelsKI 2021: Advances in Artificial Intelligence10.1007/978-3-030-87626-5_27(362-369)Online publication date: 27-Sep-2021
      • (2020)Towards Automated Generation of Semantic Annotation for Activity Recognition Problems2020 IEEE International Conference on Pervasive Computing and Communications Workshops (PerCom Workshops)10.1109/PerComWorkshops48775.2020.9156147(1-6)Online publication date: Mar-2020
      • (2018)Extracting Planning Operators from Instructional Texts for Behaviour InterpretationKI 2018: Advances in Artificial Intelligence10.1007/978-3-030-00111-7_19(215-228)Online publication date: 30-Aug-2018
      • (2014)Introduction to the Special Issue on Machine Learning for Multiple Modalities in Interactive Systems and RobotsACM Transactions on Interactive Intelligent Systems (TiiS)10.1145/26705394:3(1-6)Online publication date: 14-Oct-2014

      View Options

      Get Access

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media