Reinforcement Learning in Text-based Games: A Key to Understanding Natural Language Processing

Abstract

Reinforcement learning research has focused on motor control, visual, and game tasks with increasingly impressive performance. Success in these tasks indicate exciting theoretical advancements for artificial intelligence. Much of what has been learnt has aided vision, robotics, and other research areas. Although it has not enjoyed the same amount of success, reinforcement learning research in natural language domains may prove just as useful. Research with this objective focuses on text-based games due to their simplified and learnable structure. This research lacks the backing and defined baselines developed in other reinforcement learning research, but recent work with text-based games promises improved standards and performance. This article reviews the current state of reinforcement learning for text-based games and considers the potential benefits of this important research area.

I. Introduction

Reinforcement learning (RL) is a theoretically rich—but practically scarce—field of computer science research. Despite this, industry and academic leaders have invested heavily in it. RL has had the most success in virtual environments such as video games. Using RL techniques, researchers achieved superhuman performance in games such as Go, Chess, and Atari. Later work added simulated robotic tasks teaching agents how to walk and perform complicated navigation and control tasks. Most recently, researchers have created agents that play video games like Dota 2 and Star Craft 2 that compete with professional players. Unfortunately, the techniques used in these virtual environments are not easily transferable to the physical world.

Although it has mainly been limited to simulated environments, RL research has helped computer scientists learn valuable principles. Techniques developed or improved upon for the purposes of RL are applicable to other areas of machine learning. These include computer vision techniques and new network architectures. Although machine learning is only loosely based in biological realities, RL techniques provide insights for neuroscientists’ studies of the human brain. RL is not efficient or practical in the real world, and superhuman performance in videogames may not seem that important. However, the theoretical potential of RL and the usefulness of its insights into other domains justifies the hype and funding for this field.

Natural language processing (NLP) tasks—such as conversational artificial intelligence and personal voice assistants—have proven to be more evasive to RL techniques. This is due to the large amount of words and phrases in natural language that make up the action space in natural language domains. The large vocabulary of natural languages makes action selection very difficult. Additionally, understanding the current state of a conversation in the natural language domain is more complicated than looking at a chess board or reading the joint sensors of a robot. To simplify this problem, researchers have used text-based games to train RL agents in the natural language domain. With current increased and improved research, text-based games may benefit more complicated natural language tasks just as previous RL research has begun to benefit physical tasks in the real world.

Background
- Reinforcement Learning
- Natural Language Processing
Text-based RL
State Space
Action Space
Frameworks

II. Article Body

Background

The research analyzed in this paper requires some understanding of recent work in RL and NLP research. Moved by the rise of deep learning, these fields—variations of which have been studied since the mid-1950s—are being augmented and advanced dramatically.

Reinforcement Learning

The standard RL environment includes a set of states, a set of actions, and partly random transition dynamics. These dynamics define the resulting state after taking an action from any state. Some states are also associated with rewards that the agent is trying to optimize. An agent acting in this environment learns to maximize its reward by understanding the state space and transition dynamics.

Q-Learning is an RL algorithm that learns to act in an environment by learning a Q-value for each state-action pair.¹ This Q-value is the maximum discounted sum of rewards possible from the current state after taking a specific action. Traditional Q-Learning learns Q-values through iterative updates while gaining experience interacting with the environment.

Deep RL incorporates deep learning techniques into traditional RL algorithms to learn optimal actions in more complex state and action spaces. Whereas traditional Q-learning requires thorough exploration of the environment, the incorporation of deep learning allows generalization to novel state-action pairs in environments where it is unfeasible to explore every possibility of state-action pairs. In 2013, Deepmind used a deep Q-network (DQN) to train an agent to play Atari games at super human levels.² This was previously impractical with traditional RL because the visual state space of Atari games was so large. Instead of learning an explicit mapping from each state-action to a Q-value, DQN uses a deep neural network to generalize mappings to novel states based on similar states. Later work has had success in games like Chess, Go, StarCraft, and various robotic control tasks.^3,4

Natural Language Processing

NLP involves processing human language with machines. This includes analyzing, interpreting, and generating both spoken and written language. Speech recognition, text to speech (and visa-versa), and language translation are examples of tasks in the natural language domain.

One of most powerful trends in modern NLP is the use of word embeddings.⁵ This technique has become especially popular with the increased success of deep learning due to the semantic meaning in word embedding space that deep networks are able to take advantage of. Word embeddings map words, or tokens, to an n-dimensional vector. These embeddings lie in an embedding space that captures semantic meaning and relationships between words. This space is created by learning to predict a word’s distance to other words in a reference text corpus.

Another popular structure for NLP is the knowledge graph.⁶ Knowledge graphs create a graph structure with entities as nodes and relationships as edges of the graph. Several models exist for creating knowledge graphs out of text. These models identify entities and relationships between those entities that are suggested by the text and then record them as a graph for later use. These graphs can be used to analyze relationships in text such as the dog—has a—bone or my family—traveled to—the city.

While not unique to NLP, recurrent neural networks (RNN) are often used to represent text data.⁷ RNNs map inputs to outputs like other neural networks, but RNNs also take a hidden state from the network’s previous forward pass as input. Each time the network is run forwards it produces an output and a hidden state to use during the next forward pass. This hidden state acts as a memory unit that stores information from previous passes of the network. RNNs are commonly used as encoders for domains with variable sized input such as text. This way a variable length series of word embeddings can be passed through an RNN one at a time. After the final embedding passes through the network, the final hidden state can act as a representation of the entire sequence. Various improvements have been made to RNNs such as attention and multiheaded attention.^8,9

Text-based RL

Creating RL agents that operate in the natural language, or text, domain comes with new problems not found in other RL tasks. The unknown state space and extremely large action space make working with natural language extremely difficult. One way to simplify these problems is by working in structured and simplified text-based game environments. Text-based games provide a well-defined objective and smaller vocabulary size.

RL research in text-based games is relatively new, unexplored, and has not received as much attention as other RL research. For this reason, researchers tend to approach the problems found in text-based RL in diverse ways. This is beneficial as it encourages exploration into new methods that may be beneficial for NLP research. Unfortunately, this also means that it is difficult to compare research and build on previous work. The following sections present research that represents the current state of text-based RL, areas for future work, and insights from text-based RL for other fields.

Figure 1 Initial observation from the popular text game Zork, followed by a command and the subsequent observation.

State Space

In most RL tasks, the state is directly observed by the agent. The state of a game contains all the information necessary to act optimally in a game. In the case of Atari games, the game state is often the visual output from the game, and this adequately represents the state of the game. For text games, the state is usually unknown and only partially observed. On each turn the game provides observations that give part of the current state. As shown in Figure 1, an agent playing a text-based game receives a new observation partially describing the new state after each command. The agent learns to represent the current state with as much observation and historical information as it has and act accordingly.

One way by which the state of text-games can be represented is via an RNN. This method was first introduced to solving text-based games in 2015.¹⁰ With RNNs the agent can learn to remember previously seen observations that contribute to the current state of the game. This could include remembering an objective given at the beginning of the game or the contents of the room I just came from. To do this, observation strings are transformed into a series of word embeddings and then encoded into a single vector. This vector is then passed into the RNN where it is combined with a remembered hidden state. The RNN outputs a state representation that combines the current observation and previous observations that the agent chose to remember. This state can then be passed into a DQN or any other RL algorithm.

Most text-based RL methods have used similar approaches to represent the complicated and partially hidden state of text-based games. Recent work in 2018 combined a multi-headed attention RNN with knowledge graphs to represent state for text-based games.¹¹ They used Stanford’s OpenIE to convert each observation into a knowledge graph prior to it being combined with previous observations to create the game’s state. The knowledge graph not only captures the semantic meanings of word embeddings but also incorporates the relationships between those words. Word embeddings may help an agent understand that doors and gates are similar things, but knowledge graphs can help agents understand that the lock in the room is on the gate and that the golden chest is behind it.

While these have been effective ways to represent the state of text-based games, they are not advanced enough to capture all the complex meaning that humans are able to derive from words. Future work in state spaces may involve using more complex models such as differentiable neural computers that encourage structured memory through a Von Neumann like architecture.¹² Perhaps new models will be created for solving text-based games that end up being transferable to other RL domains and machine learning problems.

Action Space

The action space of text-based games is a problem more unique to the text domain. For this reason, it has been the focus of most research in text-based RL. The action space of text-based RL in general is essentially every possible sentence or phrase in a language. Text-based games are attractive to RL research in the language domain because the action space can be decreased to the set of valid commands for the game. This brings all possible commands down to millions rather than googols. Still learning optimal actions for each state out of millions of choices is inefficient.

The same method first introduced in 2015 that relied on RNNs also utilized a creative way to handle large action spaces.¹¹ For this work’s Fantasy World agent, the set of all possible actions was defined by combinations of thirty-seven objects with six verbs, totaling 222 possible commands. Instead of training a DQN on each of these 222 commands, it was trained on objects and actions separately. The product of each object and verb output was then considered a command’s Q-value. This method was effective for their simple Fantasy World, but its commands were limited to object action pairs. Even so, it provided a starting point for future research.

Many current approaches limit their action space to a list of commands that are useful and common in text-based games. These lists often vary by game and are hundreds of commands long rather than millions. This would not be near enough for an agent to be able to communicate and function in the real world but serves for the purposes of research and simplified models. After creating sufficiently small but still large sets of commands, these methods prune the list of commands through elimination or ranking. This pruning can be approached in several ways. Affordance ranks commands by measuring which actions are enabled by a situation using word embeddings.¹³ The knowledge graph method counts the number of times words from a command are present in the knowledge graph and its relationships to rank commands.¹¹ Action elimination networks learn an elimination signal from the current state via contextual bandits.¹⁴ Pruning the action space in smaller text domains should scale to larger and more realistic action spaces as well.

Another recent paper explored the possibility of generating new action spaces at each turn during a text-based game.¹⁵ This was approached as a supervised learning problem using generated text datasets. A model was trained on context and entities in the game to produce a list of possible commands. The model was successful at producing valid commands for new situations like those it was trained on. Although the model has not yet been tested on text-based games, it should be able to generate valid and meaningful actions for an agent’s current state. This approach removes the problem of large action spaces by generating its own action space, but the RL agent is now limited by the capability of its action generation model. If the agent tries to act in a domain unfamiliar to the generation model, it will be unable to produce valid or meaningful commands.

Much of the work that needs to be done in text-based RL lies with the action space. Both action pruning and action generation seem to be valid methods for managing the extremely large text domain. Action pruning is attractive because most methods learn to act from scratch with no prior knowledge or training necessary, but these methods need to be able to scale well to be effective in larger text domains. Action generation seems promising because it dismisses the problem of a large action space, but agents are limited by their generation model that needs to be trained on numerous supervised examples. Even if text-based RL never makes it to real world applications, maybe some combination of these two approaches will prove to be effective when solving NLP problems in the real world.

Figure 2 Sample text-based gameplay generated from TextWorld.

Frameworks

RL for text-based games has only recently started to establish some standards and baselines. Up until now, researchers were generating their own games or frameworks for games. This made it difficult to compare results across research teams and to build on another team’s research. In 2018, Microsoft released a text-based game framework called TextWorld.¹⁶ TextWorld provides an RL framework for several common text-based games as well as auto-generated random games of customizable difficulty and size as demonstrated in Figure 2. This provides an ideal framework for simple text-based RL.

In 2019, Facebook AI Research released Learning in Interactive Games with Humans and Text (LIGHT), a text-based game platform like TextWorld.¹⁷ This framework also aims to be used as a standard comparison for text-based RL. Whereas TextWorld creates simple text environments in which individual agents can act, LIGHT is a large-scale game in which agents can both talk and act with other agents and humans. LIGHT adds focus on multi-agent tasks and conversation along with actions. Between the two frameworks and others yet to come, future research in text-based RL will be more communicable and intelligible.

Conclusion

Text-based games provide an efficient framework in which agents can learn to act in a text domain. Techniques developed and yet to be developed for text-based games will serve to better NLP and our understanding of human language. Much of the currently published research on text-based games is promising but difficult to compare and decipher. Frameworks like TextWorld and LIGHT should help research in this regard. State and action spaces are primary areas of research that require further work and contributions. Through this and other research, techniques for text-based RL will be able to move from text-based games to larger, more complicated, and more useful domains.

IV. Notes

1. Christopher John Cornish Hellaby Watkins. “Learning from delayed rewards.” PhD diss., King’s College, Cambridge, 1989.

2. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. “Human-level control through deep reinforcement learning.” Nature 518, no. 7540 (2015): 529.

3. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal policy optimization algorithms.” arXiv preprint arXiv:1707.06347 (2017).

4. David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser et al. “Mastering the game of Go with deep neural networks and tree search.” nature 529, no. 7587 (2016): 484.

5. Jeffrey L Elman. “Distributed representations, simple recurrent networks, and grammatical structure.” Machine learning 7, no. 2-3 (1991): 195-225.

6. Thomas R Gruber. “Toward principles for the design of ontologies used for knowledge sharing?.” International journal of human-computer studies 43, no. 5-6 (1995): 907-928.

7. Sepp Hochreiter, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 9, no. 8 (1997): 1735-1780.

8. Dzmitry Bahdanau. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473 (2014).

9. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” In Advances in Neural Information Processing Systems, pp. 5998-6008. 2017.

10. Karthik Narasimhan, Tejas Kulkarni, and Regina Barzilay. “Language understanding for text-based games using deep reinforcement learning.” arXiv preprint arXiv:1506.08941 (2015).

11. Prithviraj Ammanabrolu, and Mark O. Riedl. “Playing Text-Adventure Games with Graph-Based Deep Reinforcement Learning.” arXiv preprint arXiv:1812.01628 (2018).

12. Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo et al. “Hybrid computing using a neural network with dynamic external memory.” Nature 538, no. 7626 (2016): 471.

13. Nancy Fulda, Daniel Ricks, Ben Murdoch, and David Wingate. “What can you do with a rock? affordance extraction via word embeddings.” arXiv preprint arXiv:1703.03429 (2017).

14. Tom Zahavy, Matan Haroush, Nadav Merlis, Daniel J. Mankowitz, and Shie Mannor. “Learn what not to learn: Action elimination with deep reinforcement learning.” In Advances in Neural Information Processing Systems, pp. 3566-3577. 2018.

15. Ruo Yu Tao, Marc-Alexandre Côté, Xingdi Yuan, and Layla El Asri. “Towards Solving Text-based Games by Producing Adaptive Action Spaces.” arXiv preprint arXiv:1812.00855 (2018).

16. Marc-Alexandre Côté, Ákos Kádár, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore et al. “Textworld: A learning environment for text-based games.” arXiv preprint arXiv:1806.11532 (2018).

17. Jack Urbanek, Angela Fan, Siddharth Karamcheti, Saachi Jain, Samuel Humeau, Emily Dinan, Tim Rocktäschel, Douwe Kiela, Arthur Szlam, and Jason Weston. “Learning to Speak and Act in a Fantasy Text Adventure Game.” arXiv preprint arXiv:1903.03094 (2019).