Define, Evaluate, and Improve Task-Oriented Cognitive Capabilities for Instruction Generation Models

Lingjun Zhao^*

Khanh Nguyen^*

Hal Daumé III

[Paper]

[Slides]

[GitHub]

[Bibtex]

^* denotes equal contribution.

Dataset

Instructions generated by base speaker models and pragmatic speaker models ("generated_instr"), as well as human annotations with wayfinding scores ("human_annotation", "human_scores").

[Download]

Abstract

Recent work studies the cognitive capabilities of language models through psychological tests designed for humans. While these studies are helpful for understanding the general capabilities of these models, there is no guarantee that a model possessing sufficient capabilities to pass those tests would actually use those capabilities in performing real-life tasks. In this work, we formulate task-oriented cognitive capabilities, which are human-like cognitive capabilities that language models leverage to perform tasks. These capabilities are (i) the ability to quickly generate good candidate utterances (the search capability) (ii) the ability to predict how a listener interprets those utterances and choose the most appropriate one (the pragmatic capability). We design an evaluation scheme for comparing these capabilities of a language model with those of a human. Applying this scheme to examine various models in a navigation instruction generation problem, we find that their pragmatic capability is severely lacking. This insight leads us to augment them with better models of the listener and obtain a significant boost of 11\% in success rate in guiding real humans. Our work advocates for having a principled procedure for aligning language models with humans that involves (i) formulating task-oriented capabilities, (ii) devising a method to quantify their deficiency, and (iii) iteratively improving them.

We aim to build speaker agents that can guide humans through natural language to accomplish goals. Standard evaluation that positions agents on a one-dimensional scale is not helpful for directing the development of the evaluated agents (a). We propose a framework called ``bounded pragmatic agent'' that can characterize the operations of both AI-based and human speakers (b). Viewing AI-based agents and humans through this unifying lens enables us to compare them on more fine-grained capabilities (c), and better instruct future development of these agents towards leveling with human performance (d).

The cognitive process of a bounded pragmatic speaker. In every task, the speaker first imagines a trajectory it wants to convey to the human listener. To reduce the search space, it then uses the base speaker to generate a small set of relevant candidate instructions. After that, it employs the theory-of-mind model to simulate how the human listener would follow each instruction in the candidate set. The speaker finally elects the candidate instruction that causes the theory-of-mind listener to generate the trajectory most similar to the intended trajectory. The output instruction is finally sent to the human listener for a real execution in the environment.

Using ensemble listeners as theory-of-mind model generally has large improvement over base speaker models to communicate instruction with humans:

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.