Introduction
It seems like most people discussing generative AI for video games either hype it up as the inevitable future of gaming or dismiss it as just another fad. The former camp is composed of optimistic venture capitalists trying to get in on the next big venture, ambitious startup founders rushing products to market using low-hanging fruit, and video game developers rebranding their years old systems as state-of-the-art AI. The latter camp includes players disillusioned by recent attempts at applying the technology and long-time developers aware of the disconnect between the promises of generative AI and the realities of game development. However, while many people understand that the situation is more complicated, that nuance rarely comes across in the discussion.
An excellent contribution to this discussion recently came from researchers at University of Malta and New York University. They published Large Language Models and Games: A Survey and Roadmap (Gallotta et al., 2024). The paper provides a thorough analysis of previous uses of large language models (LLMs) in games and proposes some potential future directions. The survey is very thorough, but I believe there are still many ideas left to discuss in regards to the application of LLMs in games. These applications go far beyond generative AI NPCs that (somewhat frustratingly) currently dominate the gaming industry’s conversation around LLMs for games.
Safety and Risk
While some safety issues and risks that come with generative AI are unavoidable, many can be easily mitigated. For example, this post proposes several LLM use cases that provide guarantees as to what can and cannot be generated by an LLM. Keep in mind that different use cases of LLMs will come with varying levels of risk.
LLMs are a new technology and will change a lot in coming years. Legal issues and potential regulation are still being resolved. Also, LLM providers may change support or even go out of business in the near future. Diversifying dependencies or doing more development in-house mitigates these issues but comes with an increased cost.
Another major concern when deploying generated text in games is the lack of control over what content gets generated. From inappropriate content to inaccurate hallucinations to immersion breaking mistakes, vanilla LLMs can generate undesirable text. This problem is exaggerated when users are allowed to provide input directly to the LLM. While alignment methods can decrease the likelihood of undesirable content (Bai et al., 2022), they are expensive and still do not provide a guarantee. However, the problem of content curation can be greatly mitigated or solved completely by better controlling the inputs and outputs of the LLM.
By limiting what a player can input or using LLMs in a way that does not require direct user input, game developers can more easily control generated content. If a guarantee is needed, developers can use a simple grammar or template to constrain the text that an LLM generates. This increased control of outputs may be especially well suited for limited LLM applications within more traditional games.
With this in mind, many of the existing and proposed LLM applications in this post provide various amounts of control over LLM inputs and outputs as well as control over the development and deployment process. In general, increased control over these factors reduces risk at the expense of open-endedness or cost.
Existing Applications
One of the first experiences to apply LLMs to gaming was AI Dungeon (Latitude, 2019). AI Dungeon has been around since the release of GPT2 in 2019 and still has many active users today. In this entirely text-based adventure players take actions and speak using text inputs and an LLM provides a response. Since the LLM generations are not grounded in any game state, the experience is akin to collaborative story telling. However, the company is working on new experiences with grounded gameplay.
More recently, Inworld’s Origins demo (Inworld, 2023) and other games like it, utilize AI NPCs that players speak to directly. The AI NPC then uses an LLM to respond to the player. This allows players the freedom to approach a conversation as they see fit instead of selecting from pre-written responses. In this and similar games, players have very open-ended conversations (sometimes too much so), but are still limited in where high level narrative can go.
Other games, such as Suck Up (Proxima, 2023) and Talk to Me Human (Least Significant Bit, 2024), embrace LLM generation to the point where it becomes the central gameplay mechanic. In these games, the quirks of LLM generation become a type of “emergent gameplay” rather than a bug.
Finally, AI People (Good AI, 2024) creates LLM-powered simulations that include NPC dialogues, behaviors, and relationships. While its not available to play at the time of writing this post, it is an example of the potential of LLMs to create interesting sandbox and simulation games. There are still major technical challenges to overcome for LLM-powered simulations to be immersive and cost effective.
Generative AI NPCs
Recent game demos and early access games have created novel gaming experiences using AI-first gameplay. Many of these AI-first games allow you to speak (or type) to AI NPCs that reply dynamically using LLMs. While interesting, all of the current examples of AI NPCs bring to light unique challenges that need to be overcome before this type of gameplay becomes mainstream.
Dynamic Conversation
I’ve simultaneously heard players refer to gameplay with AI NPCs as “railroading” while others say that it is too open-ended. These critiques seem to be at odds, but both statements are valid and come from two distinct issues.
First, players feel gameplay is too open-ended because of an overwhelming amount of freedom. Games often advertise the quantity of choices provided to players, but too many choices can start to feel like work and detract from gameplay. After all, games are meant to be fun first, and realism isn’t always fun. A simple solution to prevent player choice overload is to guide players with generated dialogue options or suggestions.
Second, many AI NPCs have been given strict talking points to act as safety guardrails for generation and ensure the plot moves forward. However, it breaks player immersion when NPCs refuse to speak about topics outside of the predetermined set of talking points. As mentioned previously, limiting user input is an alternative form of safety guardrails albeit one that reduces player freedom. Also, there are more immersive methods for maintaining narrative control other than railroading NPC dialogues. I discuss one such method in the “Goal Oriented Narrative Planning” section.
In-Game Grounding
Another issue with generative dialogue systems is their inability to reason about their surroundings. In general, this problem is referred to as “grounding”, and is a difficult challenge when AI is “embodied” in settings such as video games (Nottingham et al., 2023).
During dialogue, players often refer to things that they can see in the game, but the LLM-powered NPCs are not natively aware of these objects. While vision language models are getting better at processing visual input (Li et al., 2022), they are expensive and more prone to errors in rendered images. Alternatively, a thorough description of the scene can help ground an LLM, but too much grounding information can decrease the quality of generated dialogue. Instead, grounding systems should only provide relevant grounding information as needed. This can be done as a preprocessing step (Nottingham et al., 2024a) or actively queried for by the LLM (Schick et al., 2023).
In addition to environmental objects, players can also refer to past events or character backgrounds that were not anticipated during development. Due to “yes response bias”, LLMs will be inconsistent during dialogues if adequate guardrails are not in place. A better approach would be to maintain a database of in-game knowledge and facts. Then, whenever an NPC needs to generate new information, the system can verify that it does not conflict with existing knowledge and store the new information for later use.
Immersive Interactions
Existing examples of generative dialogue for AI NPCs suffer from non-immersive generated text and text-to-speech systems. Today’s text-to-speech systems are not quite human-level as they often fail to generate natural sounding intonation. This technology will continue to improve, but using text alone or combining text with pseudo-speech may be a better approach for many games today.
At times, the generated text from AI NPCs feels unnatural since most off-the-shelf LLMs are not optimized for dialogue. The most accessible LLMs today are trained with various instruction following and human alignment stages. While this makes them easy to use and general purpose, the LLM output tends to be overly polite and verbose. Developers should utilize dialogue optimized LLMs when possible.
Alternative LLM Applications
AI NPCs dominate the conversation around LLMs for games because (1) they are the most obvious use case, (2) they represent a technological dream, and (3) they are well suited for startup funding. However, there are many other interesting applications for LLMs in games such as adaptive tutorials, in-game coaches, match commentators, and NPC barks. Many of these use cases do not suffer from the same challenges that AI NPCs do, and they can be more easily integrated into traditional game design. Also, since many of these applications are more narrow than open-ended dialogue, they can utilize different approaches that provide increased safety and control.
Development Tools
There are many methods for applying generative AI to speed up and improve the game development process. Generative AI can assist with concept generation, prototyping, and placeholder content without ever being used in a final product. Generative AI can also be leveraged in game development tools to increase the quantity and quality of in-game content. Generative AI tools such as ChatGPT and GitHub Copilot can already be applied to various aspects of game development. However, many tasks common to game design merit gaming-specific tooling that utilizes generative AI.
For example, LLMs such as ChatGPT are excellent at writing linear dialogue, but most dialogue used in game design is not linear. When writing a new dialogue option, game writers need to consider what the alternative options are, where the dialogue is going, and how the dialogue branches might merge together in the future. Naively inputting this information into an LLM is not as effective as utilizing an architecture created to handle such non-linear dependencies, and further research and development of such specialized tools will be of great benefit to the game industry.
Constrained Generation
Constrained text generation limits the words that an LLM can output when generating text. This can range from blacklisting specific words to requiring that output follows a specific grammar or template. Using a system like this, game developers can control what type of output is generated without needing to explicitly enumerate and program every possible input and output combination.
For example, game developers often use brief isolated pieces of dialogue called “barks” to communicate NPC intent and reactions to the player. These are typically designed to be as independent as possible from the game state so that they can be used in a variety of situations. However, an LLM provided with recent barks from other NPCs and the current game state could generate situationally relevant barks, improving the illusion that background NPCs are aware of their surroundings and conversing with each other. On top of this, templates can be designed for barks to constrain what content can be generated while remaining much less tedious than manually writing barks for every situation. This same approach can be applied elsewhere such as for commentators in sports games.
In other settings, a game developer may want to allow players to say anything to a system but system outputs do not need to be open-ended. For example, a “smart” coach or tutorial system can be designed to answer arbitrary player questions via constrained outputs from an LLM. By designing a set of templated responses and conditioning on game knowledge and the player’s question, LLMs can generate relevant factual information to assist a player.
Personalized Content
Besides the ability to generate coherent text, the power of LLMs comes from the ability to adapt based on textual instructions or demonstrations. Including demonstrations in the input of an LLM, called in-context learning (Brown et al., 2020), can be a powerful way to condition LLM outputs on feedback (Nottingham et al., 2024b). This means that LLMs can change their behavior at runtime according to player preferences. This ability can be used to create anything from adaptive behaviors trees to personalized procedural content generation to give every player a custom experience.
Controlling AI companions in a first-person game can be tedious and programming automatic behaviors may not work with every player playstyle. Instead, players could “train” their AI companions to behave as desired by reinforcing or disincentivizing behaviors. By using an LLM to generate behaviors instead of a traditional game AI system, behaviors can be conditioned on individual player preference to continuously adapt and grow with the player.
Recently researchers have demonstrated using LLMs for procedural content generation and level design (Sudhakaran et el., 2024). By leveraging individual player feedback on previous levels, LLMs can be conditioned on the favorite levels of specific player types when generating future levels. This same concept can be used for generating quests, dungeons, and enemies in various genres.
Goal Oriented Narrative Planning
Many of the previously discussed applications either require constraining LLMs to generate content that fits inside an existing narrative (i.e. Inworld Origins; constrained generation) or allowing LLMs to freely generate content in an open-ended simulation (i.e. AI Dungeon; AI People). However, in many cases, we would like to combine the open-endedness of LLM-powered interactions with human-written narrative. In other words, how can we direct AI NPC dialogues toward plot points without railroading the conversation? Also, how can we give players freedom to shape the game’s narrative while still arriving at a satisfying resolution?
For the time being, a naive implementation of an AI NPC may be able to chat with a player about arbitrary subjects, but the conversation will be aimless. If instructed to address a certain topic, LLMs tend to do so with a vigor that can disrupt the natural flow of the conversation. Instead, LLM-powered planning techniques can naturally guide dialogue toward desired narrative goals. An LLM-powered planner can sample and search potential future responses to steer conversation towards topics that will naturally resolve goals. This process can be seen as similar to goal oriented action planning (Orkin, 2003) where actions are dialogue utterances, action preconditions are the “naturalness” of an utterance as measured by generation probabilities, and a search heuristic measures the similarity between an utterance and the goal or how well the current utterance resolves the goal.
In the future, a similar approach could be used to create an AI game master capable of directing players towards desired game goal states. Game master goals may include arriving at a significant encounter or initiating a pre-written plot point. Game master actions may include spawning enemies near players or initiating dialogues with certain NPCs. By conditioning the game master on summaries of the in-game state, an LLM can be used to predict the effect of actions on the state and search for future states that map to goals.
The Roadmap
I hope that many of the proposals and ideas in this post are helpful in stimulating conversation and advancing the application of LLMs in games. Some, such as AI NPCs, are already a hot topic of discussion. Others, like AI game masters, are exciting but somewhat further from being realized. Personally, I believe that the utilization of constrained generation in areas such as AI tutorials, coaches, commentators, and barks have the greatest short-term potential for application in modern game studios. I also believe that leveraging in-context learning based on player feedback to create personalized experiences is an exciting and under explored application that merits research. While search-based planning methods also have great potential for impact, they would increase the cost of an already expensive technology.
The future of generative AI in games is not clear, but I do not think that this technology is as ephemeral as other recent trends (looking at you web3). I believe an increasing number of enjoyable and innovative games that utilize generative AI will come from indie studios over the next few years. Although, all we’ll see in the next year are more rushed demos. If AI gaming start ups position themselves well, they have the potential to impact the next generation of games. However, the needs of every game are so different, I worry that most AI gaming startups will not be flexible enough to survive. There are only a few large gaming companies currently investing heavily in generative AI research. I expect that any applications of generative AI in AAA games for the next 5+ years will be secondary to the main content of the game and still years out. However, my hope is that some mainstream successes of generative AI from indie developers will emerge and increase overall interest.
About Me
I began working on LLMs in games for academic research in 2019 applying GPT2 to classic text-based games. Since then, I’ve continued researching how to improve the use of LLMs in interactive settings, and worked on gaming projects such as LLM-powered Minecraft bots. I’ve also had the opportunity to work at gaming adjacent companies such as Nvidia and Unity doing reinforcement learning, and the game company Latitude improving LLMs for the game AI Dungeon. I’ll also be researching generative AI for games at Riot Games during summer 2024.
I would be grateful for feedback on these ideas and open to discussions about any of these topics. Keep an eye out for new AI gaming project updates from me, and follow me on Twitter if you’d like to stay connected.
Kolby Nottingham
References
Bai, Yuntao, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain et al. “Training a helpful and harmless assistant with reinforcement learning from human feedback.” arXiv preprint arXiv:2204.05862 (2022).
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. “Language models are few-shot learners.” Advances in neural information processing systems (2020).
Gallotta, Roberto, Graham Todd, Marvin Zammit, Sam Earle, Antonios Liapis, Julian Togelius, and Georgios N. Yannakakis. “Large Language Models and Games: A Survey and Roadmap.” arXiv preprint arXiv:2402.18659 (2024).
GoodAI. “AI in Games.” https://www.goodai.com/ai-in-games/ (2024).
Li, Junnan, Dongxu Li, Caiming Xiong, and Steven Hoi. “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation.” In International conference on machine learning, (2022).
Inworld. “Unscripted AI NPCs in a first-of-its-kind Unreal Engine demo.” https://inworld.ai/blog/origins-unreal-engine-demo (2023).
Latitude. “Ai Dungeon.” https://aidungeon.com/ (2019).
Least Significant Bit. “Talk to Me Human.” https://talktomehuman.com/ (2024).
Nottingham, Kolby, Prithviraj Ammanabrolu, Alane Suhr, Yejin Choi, Hannaneh Hajishirzi, Sameer Singh, and Roy Fox. “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling.” In International Conference on Machine Learning, (2023).
Nottingham, Kolby, Yasaman Razeghi, Kyungmin Kim, J. B. Lanier, Pierre Baldi, Roy Fox, and Sameer Singh. “Selective Perception: Learning Concise State Descriptions for Language Model Actors.” In North American Chapter of the Association for Computational Linguistics (2024).
Nottingham, Kolby, Bodhisattwa Prasad Majumder, Bhavana Dalvi Mishra, Sameer Singh, Peter Clark, and Roy Fox. “Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills.” In International Conference on Machine Learning (2024).
Orkin, Jeff. “Applying goal-oriented action planning to games.” AI game programming wisdom 2 (2003): 217-228.
Schick, Timo, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. “Toolformer: Language models can teach themselves to use tools.” In Advances in Neural Information Processing Systems 36 (2023).
Sudhakaran, Shyam, Miguel González-Duque, Matthias Freiberger, Claire Glanois, Elias Najarro, and Sebastian Risi. “Mariogpt: Open-ended text2level generation through large language models.” In Advances in Neural Information Processing Systems 36 (2024).