Murmuring into the void
~/ruminations

Toward Universal Board Game AI

Ch 2. How Dominion got Solved

Wherein the author hoots and jibbers excitedly.

In the first chapter I explained how we can make an AI for any game that loosely resembles chess, even long games with many possible moves that'd be extremely expensive to explore. You can also make AI for games with hidden info and randomness (like card games) with some techniques I'll mention.

This still left an enormous gap between those sorts of games, and modern board games with heaps of components and expansions and cards with unique abilities. Good luck fitting those into the simple fixed-size input-output model we talked about last chapter. Most modern games skip AI, or make a quite bad AI, so (as far as I know) the Dominion app broke a lot of ground.

You keep saying dominion. What's dominion.

Everyone starts with a bad deck of cards and improves their deck over time. There are over 800 cards, and they can do all kinds of things and interact with each other. Some cards are only good in combination, or only good at the end of the game. Which cards are available varies between games, so a card that's great in one game might be unhelpful in another. You can play many cards on one turn, and many cards cause players to make choices.

So we need

  • To be capable of long term planning and to quickly narrowing down many choices (we want AlphaZero, covered last chapter)
  • To be able to handle hidden information and randomness (Not new or crazy, just makes things complicated so I left it out of chapter 1. I'll gloss over it.)
  • To efficiently learn hundreds of cards (using a technique called Embeddings)
  • To somehow encode the highly varied state into a model for AlphaZero to predict with (using Transformer architecture, relatively new and most notably used in language models)
  • And to somehow get pretty much any arbitrary choice out of that model. No longer can you say there is some fixed number of possible legal moves, like 9 in tic-tac-toe. A card could ask you to choose literally anything from naming a number for an auction or picking two cards from hand or choosing what order to trigger effects in, etc. (a bizarre solution)

For the rest of the post, I'll talk about these in order. I'll be brief!

Credit due: I'm only able to write about how the Dominion AI works (rather than scratch my head, slack jawed) because amazingly, they publicly explained how they made it, and patiently answered many of my questions about the parts they didn't cover.

Hidden Info and Randomness

Technically those are separate but they come up together a lot - like a shuffled deck of cards. Both problems have been solved for a long time, but I left them out of chapter 1 because they make MCTS weirder.

One key thing is that a state + an action no longer leads to a single following state. If you play top left in tic-tac-toe the result is always the original board plus your piece in that box. If you draw a card or roll a die there's a range of possible results. No longer can we see the MCTS tree as a state connected by actions to other states - now each action can lead to a number of following states and we only learn which it is by running the game simulator and rolling that virtual die.

Hidden info is quite related. Anything that's unknown gets shuffled before the simulation so that you're able to draw a random card. For AlphaZero, this does mean we need some way to signify "face down card" which is another little wrinkle.

Information Leaking: If you didn't randomize something that should be hidden, the MCTS simulations would use the real information, and the AI would end up cheating since the action scores are based on information it shouldn't be able to see.

Note that much of the time, hidden info is not actually random! While your deck of cards is probably random, your opponent's hand is a product of all the actions they've taken and their own strategy. Getting a more accurate distribution of what the hidden state could be is a rabbithole of a problem, and it means that games based on opponent-prediction or bluffing are particularly challenging.

How to Learn Many Cards Efficiently

The machine learning model needs to map parts of the input, like "black pawn" onto a list of numbers it can use later for the prediction. You can think of that list of numbers as being a way of representing the 'meaning' of the input - everything that helps it score well during training will be represented, but it's quite hard for a human to tell what the numbers mean.

Chess has two colors times six types for a total of twelve piece inputs to map - but Dominion has hundreds of cards, and that first step of mapping each card in play from its ID number to its meaning gets to be slow and expensive. Embeddings are just a way of saving those mappings into a quick lookup table, like a dictionary.

LLMs also use embeddings, to map words to their meanings.

How to Represent Extremely Diverse States

This is where it starts getting well outside of the "simple game" assumptions I made in chapter 1. Traditional machine learning models take a fixed-sized list of numbers and train weights to work with each slot in the list. If the top left tic-tac-toe square is slot 0 and top center is 1, it'll learn the relationship between them from the weights tied to those slots. This relies on the game being fully represented with a fixed list like that, where each slot always represents the same thing.

In Dominion, you have a variable number of players, with a variable-sized deck, hand, and discard pile. The center of the table has a variable number of piles with variable numbers of cards in them. Also you can have any number of many kinds of tokens on things. Extremely awkward to try to have a fixed index for everything.

Dominion handles this with Transformers, which is a rather new and exciting machine learning technique that's also used by language models to understand natural text (words are similarly not organized into fixed slots).

I don't want to make this long and technical or need to find pictures, so I'll leave you with an incomplete but hopefully compelling gesture at the technique.

Very short version: Split the content into a variable length list of parts. Each part is fixed-size, so each pair of parts can do normal fixed-size machine learning with each other. Every part updates every other part, many times, so that eventually each part is affected by the important context from the rest of the game. If a card is usually good, but another card is available which counters it, after the context updates the card won't look so good anymore.

Less short but hopefully tolerable: Rather than the entire input being a fixed-size list of numbers representing the entire current game position, you can split the game up into individual parts, called "tokens", and make a list of numbers for each. This is far, far more flexible.

You can have different kinds of tokens, and different kinds can have different lengths. Dominion makes a token for every card (with information like the card's ID, location, any counters on it, etc), as well as a token for each player (how many victory point tokens they have, etc), and one token for "the game" for any global information. Trainable machine-learning math converts all these to be same-length lists representing the token's 'meaning'.

Now since all the tokens have fixed-size, every token can use the same trainable parameters to combine their value with every other token. Based on the token's values the weights figure out if they're relevant to each other, and how they affect each other. The output is another list of tokens that has been contextualized by the rest. In a language model, the word "basket" preceding the word "ball" dramatically changes the meaning. In Dominion, it can spot synergies or counters this way.

So Dominion makes a card token for every card anywhere in the game, it makes a long list of numbers containing the card's Embedding (see above), its location, how many tokens of each type it has, etc. Then every card updates every other card in several rounds. At the end there is still one token for each card, but now the numbers have been updated to contain everything the policy and value outputs need.

Dominion also makes a token for each player and a token for "the game itself" which are kept in the same list. The "game" token is machine learning mapped to a single number, and that's the Value output.

Why is the game token able to predict how good the board position is? Why does any of this work, in general? I glossed over how machine learning works in chapter 1. Briefly, it starts by not working at all. The math is full of many numbers called parameters, which start off random and useless. You run the model on example data where you already know the correct answer, and measure how terrible a job it's doing, and use some calculus to figure out for every parameter in the math, how that parameter should be adjusted to make the answer better. You fiddle slightly with every knob, and try again with another example where you know the answer, and repeat until the knobs are very well fiddled and the pile of math does a good job.

But what about the Policy output??

This is the most boggling part of the implementation. For simple games, there's a fixed number of actions at any stage in the game, and you just give them a score. Dominion cards can give totally arbitrary choices like "choose two (of the unknown count) of cards from your hand to discard." Trying to imagine how this could be done was the most hair-tugging, floor-pacing part of the investigation.

When you're at a choice point, and you need the new policy: Input the state into the transformer architecture, as described above, and save the output tokens. The model does not output a policy itself, because it has no idea what the legal actions are. The legal actions could be anything, so there's no fixed size list that could evaluate them.

But the code for the game rules does know the legal actions (it'd have to, to run the game). So, rather than the machine learning model outputting the policy directly, the game code goes through each legal move, and asks another separate machine learning model specifically for evaluating a single possible action.

The first model took the game state, and output a processed version of the state where each card was updated from context, which was also used to get the Value. The Policy Model is a second model which uses some of the outputs from the first model, based on what the game code decides is necessary. I think this is a pretty bizarre architecture - you could see it like a model with traditional code running in the middle.

The game code gives the Policy Model the relevant cards from the first model, now updated with context, and it outputs values for different "verbs" in the game, like "discard", "play", "draw" - actually a very long list.

For example, if the legal action to evaluate is "Buy a Village", it'd take the context-updated token representing the top Village card for sale, and pass it to the model. The model returns a list of numbers for different verbs, and the code checks the output for "Buy" for the input card "Village", and that's how good it is for you to buy a village right now. Every legal move is scored that way, individual and bespoke for the type of action, and eventually you have a full policy for each legal move.

Frankly there are many edge cases that don't fit nicely into "noun-verb", and the full system is a lot more gnarly than I'm making it sound, but it does work.


So ~superhuman AI was made for a game that looked impossible to make an AI for. Dominion definitely isn't the hardest game to solve, but it's up there. At the end of Chapter 1 we were able to solve anything that vaguely resembled checkers, all at once with the same techniques. So can we now solve basically every game? Could I make it easy to get a powerful AI for any game from just its rules? Wow, find out in the next chapter. (Coming later)

permalink

Toward Universal Board Game AI

Chapter 1: Simple Games

Wherein the author dumps a great deal of context necessary to understand his upcoming Main Point.

I play against the AI on the Dominion app most days, and it beats me around 80% of the time despite me having played for years. That's fantastic. You get the benefit of a high-level playing group, always available, faster than a human opponent would play. With access to AI like this, a game developer could evaluate game balance and optimal strategies automatically. It also gives me an exciting feeling of technological victory over the world. Achieving that kind of AI is very hard, and I'm going to write about it.

It's a bit rude to call Chess or Go "simple games", but I'll do it anyway because I'm comparing them with things like Magic: The Gathering (~30k cards, absurd interactions).

I don't mean that these games are easy, or lack depth, or don't take a lifetime to master; By simple games, I'm gesturing towards games with

  • Consistent and constrained rules and structure. Chess doesn't suddenly get a DLC adding portals or weather.
  • Consistent and constrained action spaces. TicTacToe and Go have one potential action per space, and any arbitrary Chess move could be represented by its start and end points. In complex modern board games you might be asked to choose between anything that the game rules could support.

Simple game AI is pretty much taken care of! You can lose to AI at any simple game of your choice.

TicTacToe and a Game Tree

A superhuman TicTacToe AI isn't much fun, but imagine you wanted to make one anyway. TicTacToe starts with 9 possible actions, then 8 responses to each of those actions, then 7 responses, etc., in a big tree. 9x8x7x6x5x4x3x2x1 makes about 360k total states, which is frankly not that much for a computer. On every turn, the AI could easily look at every possible course of action and every possible counteraction until every possible ending state, and pick one that's most beneficial. In other words, if there exists any path by which the AI can guarantee a victory (or more likely a draw...), it will take it.

That alternating between thinking of what you should do, and thinking how the opponent should respond (and how you should respond to that, and so on) is the general architecture behind game AI. There are all kinds of different versions, but at their heart that's what they're doing.

TicTacToe sucks though, let's play something else. Connect4 looks similar on the surface, and it doesn't look terribly difficult but instead of a state-tree of 9x8x7x... it's more like 7x7x7x7x... for maybe 35 turns. That comes out to about "38" followed by 28 "0"s, or 380 billion billion billion. That rules out brute-forcing the whole game like TicTacToe, so what can be done?

We still want a tree to give us a sense of which actions lead to good outcomes, but now we have to build the tree efficiently. After some seconds our human opponent will get impatient, so we have a limited number of tree-building-steps to work with.

What makes a useful tree?

  • If there's a good move nearby, we want to take it, so we want to explore all the nearby game states somewhat to find out what's promising. If there's a winning move available and we never look at it, we throw the win away. Think of this as exploring actions to find out what's good.
  • The moves that look more promising are the ones more likely to be chosen in the end, so we get more value from exploring them further. We take a good looking action, the opponent takes a good looking counter-action. Exploring unpromising actions is less likely to change our decision, so they can get less attention. Think of this as exploiting actions we already think are good.

So we want to somehow balance exploring the environment with exploiting the best moves.

Fortunately there's a very popular and powerful algorithm that does exactly that, Monte Carlo Tree Search (MCTS to save my fingers). Every time it has to choose an action branch on the tree to explore, it gives every option a score based on "how much has this action been ignored compared to the other actions at this state" (preferring exploring) + "how much has this action led to good outcomes for the current player at this state" (preferring exploiting).

But how does MCTS know which actions are good? Obviously if a move wins you the game, it's a good move, but earlier in the game a win might be many turns away.

There's no easy answer to this! If you could perfectly tell how good any game state was just by looking at it, you could just look at every action from your current position and take the one that leads to the best state. Instead we have to lean on rough estimates.

  • The "Monte Carlo" part is alluding to randomness (because casinos, I guess), and the default solution proposed with the algorithm was "random rollouts", where you figure out how good a state is by having both players play randomly until someone wins. That kind of works because the player closer to winning will hopefully be more likely to win randomly, but it's very noisy and slow.
  • You could also hand-write some code to estimate, like in Connect4 having 3-in-a-rows is good and your opponent having them is bad. You could probably write a pretty good estimate if you're an expert and have time to experiment, but that's hard.
  • Machine learning, which I'll get to later under "AlphaZero"

So you repeatedly pick a part of the tree to expand, selecting an action based on how under-explored it looks and how promising that part of the tree looks. Eventually, you run out of time and you pick the move that's currently most promising! Even with random rollouts that'll make a decent Connect4 AI right there, and you can always wait longer to give it more simulation time.

I think I won't write a new section for Chess because it branches off from the destination I'm writing towards, but in short, powerful Chess engines use:

  • A similar algorithm for tree expansion called minimax with different details
  • Really good board evaluation estimates, which more recently added machine learning
  • Really efficient game handling to make the simulations fast

Go and AlphaGo

So you're feeling pretty good, thinking you've got a nice general MCTS game player. You can plug in Checkers, Othello, Chess (with some difficulty), pretty much any 'simple' game you can think of.

Go is basically just one of those games with a really big board, so it seems like it should work. It does not.

MCTS was able to do okay on Connect4 with 7 possible actions per turn and maybe 35 turns.

Go has a 19x19 board (361 spaces) and the game usually lasts ~200 turns, so that's around uh... "1" followed by 768 "0"s total tree size. It's pretty hard to fathom how large that number is. MCTS is cooked.

  • In the time Connect4 MCTS could try every action, then every counter-action to each of those, and every counter-counter-action to each of those, Go MCTS would have only tried each square once, so MCTS can't explore effectively without burning out its sim budget.
  • The default solution of random rollouts is pretty awful for Go. It'll take ages to get to the end of the game, and the enormous board is mostly made up of dumb moves. Without decent value estimates, MCTS can't exploit effectively.

Basically it looks impossible, but you probably know it was already solved by google a while ago.

Machine Learning

This is where the machine learning comes in. Machine learning makes a mathematical formula to map some input to some output, like mapping photos to "is cat"/"is dog" labels. Inside the formula are many adjustable parameters, which start off random, causing random output. When the formula gets a wrong answer, using a bit of calculus, it's possible to see how all of those parameters should be changed so it would do better. Run that long enough with enough examples and eventually the formula stops outputting garbage and starts being useful.

Machine learning models usually want a fixed-size list of numbers as an input and another fixed-size list of numbers as an output. In between, it does various transformations usually involving yet more fixed-size lists of numbers.

In this case we want the machine learning model to help MCTS play go. For now, let's imagine we already have an enormous mountain of expert play data for the model to learn from.

The input is going to be a list of 361 numbers, like the 19x19 board flattened into a line. Maybe -1 for black, +1 for white, and 0 for empty. Normally, flattening the board would lose all the row/column information, but you could have the architecture inside the model understand it like a grid, so it's okay.

What do we want the output to be?

Value

For one thing, we want to get the value of the board state, because rollouts aren't going to work here. The value estimate is just a single number, so we can add a single number output to the end of the machine learning model (such as 1 for "black wins" and 0 for "white wins"), which we can use in MCTS to see if the board is good for the current player. Initially it just gives out random values anywhere from 0 to 1, but we can train it on the boards in our data mountain.

Say for a given board that it outputs 0.31 as its prediction randomly, when the correct answer is 1 because black won. The model is then updated so that for boards like that one it'd give an answer closer to 1. We want many small nudges, because we don't want it to just memorize which states black won after - we want it to have a deep, nuanced understanding of what situations lead to black winning.

So now we can plug that value estimate into MCTS and after it explores a state it can ask the model for the value estimate to guide future exploitation. This is tonnes faster and more accurate than a rollout. MCTS goes from "blind until it gets to the endgame" to being able to see the consequences of its moves.

Policy

But actually that's still not good enough! We can now see the value of the state after an action, but we need to spend a simulation to get to that state to see the result. The Go board is so large that there just isn't time to be checking everything. With the machine learning value estimate we can tell when a simulation wasn't helpful, but we only see that after doing the simulation.

If only MCTS could tell how good an action looked before simulating it.

Head back to our machine learning model and add a second output (yes you can do that), this time representing "at a given state, which action should I take?" or more precisely, "what action would an expert from our data mountain take in this state?" This is called the policy.

That may sound weird, because if this model learned to predict perfectly, we wouldn't even need MCTS anymore to play at expert level! It'd be very hard for it to get that good though, so instead it's used as an estimate to guide MCTS.

The output will be a list of 361 numbers, one for every possible action. Again, it starts by outputting largely random numbers like [0.14, 0.85...], when the correct answer is all '0's except for a single '1' representing the action the expert actually took. The model is adjusted so that the number in the slot where the correct answer is 1 gets closer to 1 and all the others get closer to 0. What matters is that after training, the good actions have larger outputs than the bad actions. Imagine there are 3 equally good actions and the rest are terrible: after much training, we'd expect the good actions to return just under 0.33 each (since they'd have been picked by the expert about 1/3 of the time each), and the others to be close to 0.

You might think the output could be a single number, rather than 361. It could output 5 for the 5th space or something. The problem is that would put all the actions on the same spectrum in an troublesome way. If both action 1 and 3 were often picked by the experts, the output might average to 2. Gross.

Okay so now when MCTS is picking which action to look into, in addition to using its explore + exploit score, it can also add in the machine learning model's recommendation. Now instead of every unvisited action having the same score, it already has a sense of which of them are more worth exploring, and it can jump right into exploring the better parts.

So if the model learns well and you have a powerful policy and value prediction, MCTS can race through even enormous spaces like Go's board. The policy lets it waste much less time evaluating bad actions, and the value gives it a much clearer view of which paths it's exploring are worth more attention. Neat.

But that all relied on an extremely convenient mountain of expert data. What do you do if you're not so lucky? What if, instead of a tonne of data, you have ... zero.

AlphaZero

I think this section will be shorter?

We already have our enhanced MCTS framework, and we just need to replace where we get the data from. We don't have experts anymore. We need to learn expert play in a cave with a box of scraps.

Our AI needs to be the expert that it learns from.

Start off with the model stupid as usual, but skip the step where you train the model and jump right into playing some games. Take your stupid-model-enhanced MCTS and make two copies with fun names like player_0 and player_1. Make them play against each other, terribly, and take notes. These losers are our experts.

After they've played for a while, train the machine learning model just like before, but using our self-made data. The key thing is, this data is still better than nothing. The values from those games do give some indication for which states were good, and the policies (action choices) do show which actions score well from MCTS simulation. It won't learn to be an expert in one step, but it'll learn to be slightly less bad. Then play more games and train again, and do that over and over again and it should hopefully become less bad each time until eventually it wipes the floor with you.

Get comfortable waiting, because google's superhuman AlphaZero Go player was trained over millions of games spread out across many powerful processors, which I estimate might cost about 200k USD on rented infrastructure.

So that sounds dismal. Yet somehow the Dominion AI, another extraordinarily difficult game to play, was trained in only a couple months on unspecialized home hardware. Next post is all about the Dominion AI and the other ways in which it is incredible.

permalink