SMITSIMAX
In most board games and strategy games, players move simultaneously: you commit to a plan before you see what your opponent picked. Standard MCTS can't handle that cleanly because it builds one tree where moves alternate. SMITSIMAX gives each player their own search tree, lets them select independently, combines the moves for simulation, and backpropagates results to each tree separately. The result is a surprisingly effective algorithm for simultaneous-move, multi-agent games, and it ships in competitive game AI today.
01Why simultaneous moves
Most real games aren't turn-based. In StarCraft, both players issue commands at the same time. In Pokรฉmon, both trainers pick a move before either executes. In competitive Rocket League, all six cars accelerate, boost, and jump on the same tick. Card games like Poker have hidden information that creates the same structural problem: you choose without knowing what your opponent chose.
The game-theory term is simultaneous-move game. The classic example is Rock-Paper-Scissors: both players reveal at the same time. There's no information advantage from going second, and no single "best" move. The only stable strategy is a Nash equilibrium, a probability distribution over actions.
Minimax doesn't apply: it assumes one player moves, then the other responds, creating an information asymmetry that doesn't exist in simultaneous games. Standard MCTS has the same problem, as we'll see in ยง4. SMITSIMAX is the fix: a variant of MCTS designed from the ground up for games where everyone acts at once.
A working understanding of simultaneous-move MCTS from first principles. The standard MCTS loop, why it breaks on simultaneous moves, and the decoupled-tree fix. UCB1 selection, information sets, the convergence question (does it find Nash?), regret matching as the theoretically grounded alternative, and practical tuning. Six live widgets, including a dual-tree visualizer that animates the algorithm step by step, a Rock-Paper-Scissors playground where you play against a Smitsimax agent, and a convergence race comparing UCB1, Exp3, and regret matching on a matrix game.
02A short history
The academic lineage is clear: UCB1 (2002) โ UCT (2006) โ CadiaPlayer's decoupled trees (2007) โ formal analysis of SM-MCTS variants (2013โ2016) โ practitioner adoption under the "Smitsimax" name (2018) โ neural-network integration (2024โ2025). The algorithm MSmits named in 2018 maps closely to Decoupled UCT from the academic literature, with the implementation difference that Smitsimax uses entirely separate tree data structures rather than a shared tree with decoupled statistics.
03Standard MCTS
Before fixing MCTS for simultaneous moves, we need the baseline. Standard MCTS builds a search tree incrementally over many iterations. Each iteration has four phases:
- Selection. Starting from the root, follow the tree down using a selection policy (UCB1, typically) to pick which child to visit at each internal node. Continue until you reach a node that has unexpanded children.
- Expansion. Add one (or more) child nodes to the tree, representing moves not yet explored.
- Simulation (rollout). From the new node, play random moves for both sides until the game ends. Record the outcome.
- Backpropagation. Walk back up from the new node to the root, updating every node's visit count and win/score tally.
After the time budget runs out, the root's most-visited child is the chosen move. The widget below shows this loop building a tree for a simple two-player game. Each node shows its visit count and win rate; edges are moves.
The key property: each node in the tree corresponds to a unique game state. Player 1 moves, then Player 2 responds, then Player 1 again. The tree interleaves moves. This works for chess, Go, and any game where players alternate turns.
04The simultaneous-move problem
When both players choose at the same time, the alternating-move assumption breaks. Consider a game where Player 1 picks from {A, B, C} and Player 2 picks from {X, Y, Z}, simultaneously. The outcome depends on the joint action (A,X), (A,Y), ..., (C,Z), nine combinations total.
There are two naive ways to force this into a standard MCTS tree, and both are broken:
Approach 1: Sequential UCT (SUCT)
Serialize the simultaneous game into a sequential one. Player 1 picks first, then Player 2 responds. This creates a valid tree, but Player 2 now has an information advantage that doesn't exist in the real game. In Rock-Paper-Scissors under SUCT, Player 2 always sees Player 1's throw before choosing and always wins. The algorithm "solves" a different, easier game than the one being played.
Approach 2: Joint-action tree
Each node has one child per joint action. For 3 actions per player, that's 9 children. For 10 actions per player, 100 children. For k players with m actions each, mk children per node. The branching factor explodes exponentially in the number of players. A four-player game with 20 actions each has 160,000 children per node. The tree is too wide to explore meaningfully in any reasonable time budget.
05Decoupled trees
Smitsimax solves both problems with one structural change: give each player their own search tree. Instead of one tree with interleaved or joint moves, player 1 has a tree where each node represents one of player 1's moves, and player 2 has a separate tree where each node represents one of player 2's moves. In a four-player game, there are four independent trees.
On each iteration:
- Each player independently descends their own tree using UCB1, selecting a move at each depth level.
- The selected moves are combined into a joint action.
- The game state is advanced using that joint action.
- This repeats to max depth (or until terminal).
- The final state is scored.
- Each player's tree is backpropagated independently with that score.
The branching factor per tree is just m (the number of actions for that player), not mk. No player sees another's selection before choosing. And because UCB1 operates independently per tree, each player's search is symmetric: the quality of opponent modeling matches the quality of self-play.
This is the core insight. Each tree "sees" the opponent's strategy only through the statistical distribution of outcomes across iterations. If Player 1's tree discovers that move A leads to high scores, Player 2's tree will eventually discover that its counter to A is best, which in turn shifts Player 1's statistics, and so on. The trees co-adapt through the shared simulation results, without ever directly observing each other's structure.
06UCB1 and selection
Selection within each tree uses the formula, the same bandit algorithm that drives standard MCTS. For a child node i with parent p:
The first term is exploitation: favor children with high average reward. The second term is exploration: favor children that have been visited less often relative to the parent. The constant C balances the two. In the original UCB1 paper[1], C = โ2 gives a theoretical regret bound. In practice, Smitsimax implementations often use C values between 0.3 and 1.4, tuned to the game.
Smitsimax adds a common modification: score normalization. Because different games produce scores on different scales, the exploitation term is normalized to [0, 1]:
The widget below lets you adjust C and watch how the selection policy shifts. Low C produces greedy, exploitative play. High C spreads visits more evenly, finding moves that are underexplored but potentially strong.
One practical detail from MSmits's implementation[9]: the first ~10 visits to a node use random selection rather than UCB1. Early UCB1 scores are noisy (a node visited once with a lucky win has a reward of 1.0), and random selection in the first few visits avoids committing to a path based on a single sample.
07The full algorithm
Putting it together. The pseudocode below is for k agents in a simultaneous-move game. Each agent has its own tree rooted at a node whose children correspond to that agent's legal moves. The game engine advances the state given a joint action (one move per agent).
function SmitsimaxSearch(initialState, timeBudget):
trees = []
for each agent i in 0..k-1:
trees[i] = new TreeNode(root=true)
lowestScore = [+INF for each agent]
highestScore = [-INF for each agent]
while time < timeBudget:
state = initialState.clone()
currentNodes = [trees[i] for each agent i]
// โโ Selection + Expansion + Simulation (to max depth) โโ
for depth in 0..maxDepth-1:
jointAction = []
for each agent i:
node = currentNodes[i]
if node.visits == 1:
node.expand(state.legalMoves(i))
child = node.selectChild() // random if visits < 10, else UCB1
child.visits += 1
currentNodes[i] = child
jointAction[i] = child.move
state.advance(jointAction)
if state.isTerminal():
break
// โโ Scoring โโ
scores = state.evaluate() // one score per agent
// โโ Update normalization bounds โโ
for each agent i:
lowestScore[i] = min(lowestScore[i], scores[i])
highestScore[i] = max(highestScore[i], scores[i])
// โโ Backpropagation โโ
for each agent i:
node = currentNodes[i]
while node != null:
node.totalScore += scores[i]
node = node.parent
// โโ Choose best move โโ
for each agent i:
bestMove[i] = trees[i].mostVisitedChild().move
return bestMove
This pseudocode omits several production concerns: handling variable-length games (some branches terminate early, some agents eliminated before others; covered in ยง13), tree reuse between turns, parallelization (virtual loss, root parallelism), and domain-specific rollout policies. A real Smitsimax bot also needs fast state cloning and incremental legal-move generation.
Why there is no separate expansion phase
Classical MCTS tutorials teach a four-phase loop: select down to the first unexplored node, expand it, rollout randomly from there, backpropagate. Readers often wonder whether Smitsimax should do the same: stop descending when a tree hits an unexplored node, then simulate randomly for the remaining depth.
It cannot, because the trees are decoupled. The game state at each depth is determined by the joint action of all agents. If Player 1's tree is only 3 nodes deep but Player 2's is 8 deep, the iteration still must advance the shared game state to maxDepth (or terminal) so that both trees receive a score. Player 1's tree simply gets expanded along the way: when a node has visits == 1 it creates children, selects one (randomly for the first few visits), and the loop continues. There is no point at which one player's tree "stops" while the others keep descending, because the simulation engine needs a move from every agent at every depth to advance the state.
Concretely: when a tree is shallower than the current iteration depth, it expands a new child at each remaining level and selects randomly among the fresh children (the parentVisits < 10 guard in selectChild). The tree grows one node deeper per depth-step, every iteration, until it spans the full maxDepth. A "shallow" tree never pauses or defers to a rollout policy; it just has more levels where selection is random rather than UCB1-guided.
The pseudocode above reflects this directly: the depth loop always runs to maxDepth, and expansion is folded into selection (line if node.visits == 1: node.expand(...)). No iteration terminates early due to reaching an unexplored node. Early termination only happens when the game state itself is terminal.
The selection function
function selectChild(node, parentVisits, C, sMin, sMax):
if parentVisits < 10:
return randomChild(node)
bestScore = -INF
bestChild = null
scale = sMax - sMin
if scale == 0: scale = 1 // avoid division by zero early on
for each child in node.children:
exploitation = (child.totalScore / child.visits) / scale
exploration = C * sqrt(ln(parentVisits) / child.visits)
ucb = exploitation + exploration
if ucb > bestScore:
bestScore = ucb
bestChild = child
return bestChild
08Information sets
A node in a standard MCTS tree maps to exactly one game state. A node in a Smitsimax tree does not. Consider Player 1's tree at depth 2. On iteration 37, Player 1 selected moves AโB, while Player 2 (in their own tree) selected XโY. On iteration 52, Player 1 selected the same AโB, but Player 2 selected XโZ this time. Player 1's node for "B" was visited in both iterations, but the underlying game state was different because the opponent made different choices.
This is analogous to an in game theory: a set of game states that are indistinguishable from a player's perspective. Player 1 knows they played AโB but doesn't know what Player 2 did, so all game states consistent with Player 1 playing AโB belong to the same information set. The Smitsimax node for "B" aggregates statistics over this entire information set.
This has a consequence: you cannot store state-dependent information in a Smitsimax node. In standard MCTS, you can cache the game state at each node. In Smitsimax, the game state at a node varies between iterations. Every iteration must re-simulate from the root to reconstruct the actual game state at each depth.
Smitsimax requires that each player's legal moves are independent of the other players' choices at the same turn. If Player 2's choice of X can make Player 1's move C illegal (it was legal when Player 2 chose Y), the tree is invalid: node C appears as a child in Player 1's tree, but the simulation will fail on iterations where Player 2 happens to pick X. Most simultaneous-move games satisfy this constraint naturally (each player picks from their own fixed action set), but games with interaction-dependent move legality need a different approach[9].
09Rock-Paper-Scissors playground
Rock-Paper-Scissors is the simplest simultaneous-move game. Two players, three actions each, symmetric payoffs. The unique Nash equilibrium is to play each action with equal probability (1/3, 1/3, 1/3). Any deviation from uniform can be exploited by an opponent who shifts weight toward the counter-action.
The widget below lets you play RPS against a Smitsimax agent. The agent runs 1,000 iterations of Smitsimax before each round, building a tree of depth 1 (one simultaneous round). The bar chart on the right shows the agent's current strategy (visit distribution over Rock, Paper, Scissors). Against a rational opponent, the strategy should converge toward 1/3 each, but the agent will adapt to exploit any pattern in your play.
10Convergence and Nash
A natural question: does Smitsimax converge to a as iterations increase?
The answer is no. Lisรฝ et al. proved at NeurIPS 2013[5] that UCB1 (and by extension UCT) does not converge to Nash equilibrium even in simple one-stage simultaneous-move games. The problem is that UCB1 is deterministic: given the same statistics, it always picks the same child. A deterministic selection policy can be exploited in a simultaneous-move game because the opponent can predict your choice. Nash equilibria in simultaneous-move games are typically mixed strategies (probability distributions), and a deterministic policy can only converge to a pure strategy.
For selection policies to converge to Nash, they need to be Hannan consistent in the adversarial sense. Two algorithms satisfy this:
- Exp3[12] (Exponential-weight algorithm for Exploration and Exploitation): selects stochastically, weighting actions by their exponential cumulative reward. Convergence proven but slow in practice.
- Regret Matching[13] (Hart and Mas-Colell, 2000): selects with probability proportional to positive cumulative regret. If both players use regret matching in a two-player zero-sum game, their average strategies converge to Nash.
Paradoxically, Decoupled UCT (Smitsimax) is competitive with or outperforms Exp3 and regret matching in many tested games, despite lacking convergence guarantees[6][7]. The relative ranking is game-dependent: DUCT wins in some games, regret matching in others. The likely reason DUCT does well in practice: convergence guarantees are about asymptotic behavior as iterations approach infinity. In finite time budgets (50 ms to 1 second), UCB1's ability to aggressively exploit promising branches gives it a practical advantage. The convergence gap matters most when your opponent is specifically trying to exploit your non-Nash strategy over many rounds.
11Regret matching
is the theoretically sound alternative to UCB1 for simultaneous-move selection. The idea: after each round, compute the regret for not having played each action, then select with probability proportional to accumulated positive regret.
For action a after round t:
The selection probability for action a on round t+1:
If all regrets are zero or negative, play uniformly at random. The convergence theorem (Hart and Mas-Colell, 2000[13]): if both players in a two-player zero-sum game use regret matching, their average strategies converge to a Nash equilibrium.
This is the foundation of (CFR)[14], the algorithm that solved heads-up limit Texas Hold'em[15] and underlies most modern poker AI. CFR extends regret matching to extensive-form (sequential) games by computing counterfactual regrets at each decision point. Smitsimax can use regret matching instead of UCB1 at each tree node, gaining convergence guarantees at the cost of weaker finite-time exploitation.
In practice, competitive Smitsimax implementations almost always use UCB1. The convergence guarantee matters when you need a provably optimal strategy (poker, where any exploitable leak costs real money) but matters less in time-limited game AI where "good enough in 50 ms" beats "theoretically optimal at infinity."
12Tuning
Five parameters control Smitsimax performance:
1 ยท Exploration constant (C)
The UCB1 exploration parameter. Too low, the search gets stuck in a local optimum. Too high, it wastes iterations exploring bad moves. MSmits reports[9] using values between 0.3 and 1.0 depending on the game. A good starting point is C = 0.7; tune from there.
2 ยท Max search depth
How many turns ahead to simulate per iteration. Deeper search sees further but spends more time per iteration, reducing the total number of iterations. For fast games with a 50 ms time budget and roughly 50,000 simulations per second, a depth of 10โ20 is typical. Longer games (100+ turns) benefit from shallower search with more iterations.
3 ยท Random selection threshold
The number of visits to a node before switching from random to UCB1 selection. MSmits uses roughly 10[9]. Setting this to 0 (pure UCB1 from the start) causes the algorithm to over-commit to whichever child wins the first rollout.
4 ยท Score normalization
Normalizing the exploitation term by (Smax โ Smin) keeps the UCB1 formula balanced regardless of score scale. Without normalization, the exploration constant C has a different effective strength for different games or different stages of the same game.
5 ยท Rollout policy
The original Smitsimax description expands all the way to max depth on every iteration (no random rollout). MSmits later found that adding random rollouts after expansion gave a small improvement[9]. The tradeoff is simulation fidelity vs iteration count: guided rollouts produce more accurate estimates per iteration but are slower, reducing the total number of iterations in a fixed time budget.
13Variable-depth trees
The pseudocode in ยง7 loops from depth 0 to maxDepth and breaks on terminal states. In real games, branches terminate at different depths: a snake crashes on turn 3, a unit dies on turn 5, while other branches run to the full depth limit. Three bias mechanisms emerge that the basic algorithm does not handle.
The bias
Score-scale mismatch. Games that accumulate score over time produce larger raw numbers at deeper depths. A branch terminating at depth 3 might score 15; one reaching depth 20 might score 180. Backpropagating raw scores to the same parent averages numbers from incompatible scales. The deep branch dominates the mean even when both strategies perform equally well per turn.
Visit-count skew. Under a time budget, shallow branches complete faster (less cloning, fewer move-generations per iteration). A depth-3 branch accumulates visits more quickly than a depth-20 branch. UCB1's exploration term shrinks with visit count, so the shallow branch gets exploited earlier and the deep branch stays under-explored, regardless of actual quality.
Subtree-size uncertainty. UCB1 tracks one kind of uncertainty: how many times a node has been visited. It ignores another: how much unexplored tree remains below each child. A node above a massive, barely-touched subtree has far more remaining uncertainty than a node above a shallow, fully-sampled subtree, even at identical visit counts. Moerland et al. (2018) formalize this as tree-structure uncertainty and propose a modified UCB formula that accounts for it[19].
Depth-discounted backpropagation
The most common fix in competitive implementations: apply a discount factor ฮณ (0 < ฮณ < 1) during backpropagation. Each step from leaf toward root multiplies the score by ฮณ.
function backpropagate(currentNodes, scores, gamma):
for each agent i:
discountedScore = scores[i]
node = currentNodes[i]
while node != null:
node.totalScore += discountedScore
discountedScore *= gamma // each step toward root attenuates the signal
node = node.parent
With ฮณ = 0.97, a score evaluated at depth 10 reaches the root multiplied by 0.9710 โ 0.74. A score at depth 3 is multiplied by 0.973 โ 0.91. Near-term outcomes propagate more strength to the root than distant ones. This is the same mechanism as the discount factor in reinforcement learning, where ฮณ encodes diminishing value of future rewards[9].
Tradeoff: ฮณ < 1 biases toward short-sighted play. If the optimal strategy requires sacrificing short-term score for a large payoff at depth 15, discounting undervalues that line. In competitive CodinGame implementations, values between 0.95 and 0.99 are common; 0.97 is a reasonable starting point.
Depth-normalized evaluation
Instead of (or alongside) discounting, normalize the evaluation output so that scores at different depths are comparable before backpropagation.
function evaluateNormalized(state, currentDepth):
rawScores = state.evaluate()
for each agent i:
// Convert to per-turn value: removes depth-proportional scaling
rawScores[i] = rawScores[i] / max(currentDepth, 1)
return rawScores
For games with cumulative scoring, dividing by depth converts raw score to "score per turn," which is depth-invariant. For games with a known maximum achievable score at each depth, normalizing by that maximum maps everything to [0, 1]. This interacts with the score normalization from ยง12: compute the Smin/Smax bounds over normalized scores, not raw scores, otherwise the running bounds will track the depth-scale difference rather than the strategic difference.
Agent elimination
In multi-agent games, agents get eliminated at different points. A dead agent can't act, but the game continues for surviving agents. Two approaches for the dead agent's tree:
Pass-through nodes. When an eliminated agent's tree is descended during an iteration, select a special "pass" child at each depth beyond the elimination point. The pass child carries no action; the game state advances using only the surviving agents' moves. Statistics still accumulate on the pass child, because the dead agent's tree needs to learn whether dying at that point led to a good or bad final outcome (e.g., 1st place vs. 4th).
// Inside the depth loop from ยง7:
for each agent i:
node = currentNodes[i]
if not state.isAlive(i):
// Dead agent: descend through a pass node, contribute no action
child = node.getOrCreatePassChild()
jointAction[i] = NO_ACTION
else:
if node.visits == 1:
node.expand(state.legalMoves(i))
child = node.selectChild()
jointAction[i] = child.move
child.visits += 1
currentNodes[i] = child
Freeze the tree. Stop selecting and expanding the dead agent's tree entirely after elimination. Simpler to implement, but the agent's tree cannot learn from post-death outcomes. If scoring depends on final placement or team result, pass-through nodes are more informative.
Proven terminals
When a simulation reaches a terminal state (game over), that result is exact, not a statistical estimate. Winands, Bjรถrnsson, and Saito (2008) introduced MCTS-Solver[18], which propagates proven game outcomes up the tree as hard values rather than folding them into the running average. When a branch consistently reaches the same terminal outcome, the node is marked as proven and excluded from further exploration.
In Smitsimax, proven results are harder to establish than in sequential games. A node in one player's tree maps to an information set, not a single game state (ยง8). A terminal outcome in one iteration may not recur in the next because the opponent's concurrent selections lead to a different game state at the same tree node. A node can only be marked proven if the outcome holds across all possible opponent paths. The practical application is narrower: if an agent's tree consistently reaches terminal death at a given depth regardless of opponent play, penalize that subtree during selection to avoid wasting iterations on it.
For a broader treatment of backup operators, Dam et al. (2020) show that replacing the arithmetic mean with a power mean (exponent p > 1) during backpropagation reduces the negative bias inherent in average-based backup[20]. The paper addresses MCTS backup generally, not variable-depth trees specifically, but the bias it corrects is exacerbated when deep and shallow evaluations are mixed. The technique requires bounded rewards in [0, 1] (compatible with the normalization from ยง6 and ยง12) and adds a single tuning parameter.
14Case studies
1 ยท Code of Kutulu (CodinGame, 2018)
A four-player simultaneous-move game set in a Lovecraftian maze. Players control explorers fleeing from wandering monsters and a minion controlled by another player. MSmits placed 4th in the contest using the first implementation of Smitsimax[9]. Four trees, one per agent (explorer + minion), each selecting movement and ability actions independently. The scoring function combined distance-from-monsters, sanity level, and survival time.
2 ยท Coders Strike Back (CodinGame, ongoing)
A pod-racing game with two pods per player (four pods total). Each pod has a continuous action space (target angle + thrust), discretized to a manageable set. MSmits reached top 10 using Smitsimax with four trees (one per pod). The key insight: allied pods cooperated emergently. One pod would act as a blocker, shielding the other from opponents, without any explicit cooperation logic. The cooperation emerged from the shared evaluation function: both allied pods receive the same score, so the blocker's tree converges on defensive moves because they lead to higher team scores.
3 ยท SNAKEBYTE (CodinGame Winter Challenge, 2026)
A multi-snake simultaneous-move game. The first-place solution by "frictionless" used Smitsimax with one tree per snake, simulating roughly 50,000 turns per 50 ms time budget. The fifth-place solution by "sZoom" used Smitsimax with 4,000โ6,000 iterations per 45 ms and roughly 250,000 total simulations. Both confirmed the multi-tree structure in their postmortems.
4 ยท CadiaPlayer (GGP, 2007โ2008)
Bjรถrnsson and Finnsson's CadiaPlayer[3] used decoupled trees for simultaneous-move games in the AAAI General Game Playing competition, winning in both 2007 and 2008. The implementation predates the Smitsimax name by a decade and is the direct academic antecedent.
5 ยท Goofspiel (academic benchmark)
Goofspiel (the Game of Pure Strategy) is the primary benchmark for SM-MCTS research[7]. Both players simultaneously bid cards to win prize cards. Lanctot et al. tested all SM-MCTS variants on Goofspiel and found DUCT (the Smitsimax structure) competitive with or ahead of Sequential UCT and Exp3 in win rate within typical computation budgets, though regret matching performed better in some game configurations.
15When it breaks
Smitsimax is not universally applicable. It breaks or performs poorly in several situations:
- Move-dependent legality. If Player 2's choice can make Player 1's moves illegal, the tree structure is invalid. Player 1's tree contains nodes for moves that may not be legal depending on what Player 2 does. The simulation will either crash or produce undefined behavior on those iterations. Games with blocking, capture-prevention, or interaction-dependent move sets need a different approach (ISMCTS with determinization, or joint-action MCTS with pruning).
- State-dependent evaluation. You can't attach a heuristic that depends on the game state to a Smitsimax tree node, because the game state at that node varies between iterations. Heuristics must go into the rollout or terminal evaluation, not the selection policy.
- Deep sequential dependencies. If the game's strategic depth comes from long sequences of actions where each action depends heavily on the opponent's prior action, Smitsimax's independence assumption weakens. As noted by competitive player "blasterpoard" in a 2024 CodinGame postmortem: "opponents' 2nd move will heavily depend on whether I chose to dive," making the independence between trees a liability in games with strong inter-player causal chains.
- Very large action spaces without discretization. Smitsimax needs a finite set of children per node. Continuous action spaces (e.g., "pick any angle from 0ยฐ to 360ยฐ and any thrust from 0 to 100") must be discretized, and the discretization granularity directly affects tree branching factor and search quality.
- Highly asymmetric information. Smitsimax handles simultaneous moves (both players choose without seeing each other's choice) but not general hidden information (where players have different knowledge of the game state, as in card games). For that, ISMCTS[4] with determinization is the standard approach.
16Pitfalls
Bugs in Smitsimax implementations, ranked by frequency:
- Forgetting to re-simulate from root each iteration. Because nodes don't map to unique game states, you can't cache the state at a node and pick up from there. Every iteration must clone the initial state and re-simulate the full path of joint actions from depth 0. Caching the state at a Smitsimax node gives you the state from a previous iteration with a different opponent path, producing garbage.
- Updating the wrong tree's statistics. When backpropagating, each tree receives its own agent's score. If you accidentally give Player 2's score to Player 1's tree, the search converges on the opponent's goal instead of your own.
- UCB1 with zero visits. A child node with zero visits produces a division-by-zero in the exploration term. Ensure every child is visited at least once before applying UCB1, or handle the zero-visit case explicitly (return +โ to force exploration of unvisited children).
- Not normalizing scores. If your game's scores range from 0 to 10,000, the exploitation term dominates UCB1 at any reasonable C value, and the search degenerates into greedy exploitation. Normalize by (Smax โ Smin) or use a fixed known range.
- Random seed in simulation producing the same rollout every iteration. If your game simulation is deterministic and you don't introduce randomness in the rollout, every iteration from the same joint-action path produces the same outcome. The tree never explores alternative outcomes and converges prematurely. Use a different random seed per iteration.
- Expanding too early or too late. Expanding a node on its first visit (before any statistics exist) means UCB1 immediately faces fresh children with no data. Expanding too late wastes iterations replaying the same path. The sweet spot is to expand on the first visit but use random selection until visit count reaches a threshold (~10).
- Allied agents fighting each other. In team games, allied agents must share the same score. If each agent maximizes a different objective, allies' trees will converge on conflicting strategies. The fix is a shared evaluation function that rewards team outcomes, not individual ones.
17What's next
- ISMCTS for hidden information. If your game has fog of war, hidden cards, or private knowledge, combine Smitsimax's decoupled trees with ISMCTS's determinization[4]: sample a consistent world state at the start of each iteration, then run Smitsimax on that sample.
- Neural network evaluation. Replace the random rollout with a neural-network value estimator, as in AlphaZero. Becker and Sunberg's Simultaneous AlphaZero[11] does exactly this, with a separate policy head per player and a regret-optimal solver for the matrix game at each node.
- CFR integration. For games where Nash equilibrium convergence matters (poker, auctions), use CFR[14] instead of UCB1 as the selection policy. Online Outcome Sampling (OOS) is the search variant of CFR suitable for real-time play.
- Equilibrium approximation at scale. Yu et al.'s 2024 work[10] integrates no-regret learning with neural networks, approximating coarse correlated equilibrium at each tree node. This is the current frontier for combining theoretical soundness with practical performance.
- Multi-objective Smitsimax. Some games have multiple objectives (score, survival, territory). Pareto-front approaches to multi-objective MCTS apply to Smitsimax: maintain multiple scores per node and select based on a weighted combination or Pareto dominance.
For the game AI programmer, the most useful next step is to implement Smitsimax on a simple simultaneous-move game (Goofspiel, or a grid-based chase game) and tune it. The algorithm is simple enough to implement in an afternoon; the complexity is in the game-specific evaluation function and the tuning of C, depth, and rollout policy.
Sources
- Peter Auer, Nicolรฒ Cesa-Bianchi, Paul Fischer. "Finite-time Analysis of the Multiarmed Bandit Problem." Machine Learning 47(2โ3):235โ256, 2002. DOI: 10.1023/A:1013689704352. link.springer.com/article/10.1023/A:1013689704352. The UCB1 formula that drives selection in MCTS and Smitsimax.
- Levente Kocsis, Csaba Szepesvรกri. "Bandit Based Monte-Carlo Planning." In Proc. 17th European Conference on Machine Learning (ECML), pp. 282โ293, 2006. DOI: 10.1007/11871842_29. link.springer.com/chapter/10.1007/11871842_29. Introduces UCT (Upper Confidence bounds applied to Trees), making MCTS practical.
- Yngvi Bjรถrnsson, Hilmar Finnsson. "CadiaPlayer: A Simulation-Based General Game Player." IEEE Transactions on Computational Intelligence and AI in Games 1(1):4โ15, March 2009. DOI: 10.1109/TCIAIG.2009.2018702. ieeexplore.ieee.org/document/4804823. CadiaPlayer used separate trees per player for simultaneous-move games in GGP, the first documented implementation of what would later be called Decoupled UCT.
- Peter I. Cowling, Edward J. Powley, Daniel Whitehouse. "Information Set Monte Carlo Tree Search." IEEE Transactions on Computational Intelligence and AI in Games 4(2):120โ143, 2012. DOI: 10.1109/TCIAIG.2012.2200894. ieeexplore.ieee.org/document/6203567. ISMCTS for hidden-information games; searches over information sets rather than game states.
- Viliam Lisรฝ, Vojtฤch Kovaลรญk, Marc Lanctot, Branislav Boลกanskรฝ. "Convergence of Monte Carlo Tree Search in Simultaneous Move Games." In Advances in Neural Information Processing Systems 26 (NeurIPS), 2013. papers.nips.cc. Proves UCB/UCT does not converge to Nash equilibrium in simultaneous-move games; epsilon-Hannan consistent algorithms (Exp3, RM) do under additional conditions.
- Mandy J. W. Tak, Marc Lanctot, Mark H. M. Winands. "Monte Carlo Tree Search Variants for Simultaneous Move Games." In Proc. IEEE Conference on Computational Intelligence and Games (CIG), 2014. Systematic comparison of SUCT, DUCT, Exp3, and RM on simultaneous-move games.
- Marc Lanctot, Viliam Lisรฝ, Mark H. M. Winands. "Monte Carlo Tree Search in Simultaneous Move Games with Applications to Goofspiel." In Computer Games, pp. 28โ43, Springer, 2014. link.springer.com/chapter/10.1007/978-3-319-14923-3_3. Detailed analysis of SM-MCTS variants on Goofspiel and nine GGP games.
- Branislav Boลกanskรฝ, Viliam Lisรฝ, Marc Lanctot, Jiลรญ ฤermรกk, Mark H. M. Winands. "Algorithms for Computing Strategies in Two-Player Simultaneous Move Games." Artificial Intelligence 237:1โ40, 2016. DOI: 10.1016/j.artint.2016.03.005. sciencedirect.com/science/article/pii/S0004370216300285. Comprehensive survey covering normal-form solvers, double oracle, and all SM-MCTS variants for simultaneous-move games.
- Michiel Smits (MSmits). "Smitsimax." CodinGame/Tech.io playground, 2018. github.com/Innoble/playground-aez0swj4. The practitioner tutorial that named the algorithm and described the implementation details (score normalization, random-visit threshold, expansion timing).
- Ryan Yu, Alex Olshevsky, Peter Chin. "Tree Search for Simultaneous Move Games via Equilibrium Approximation." arXiv:2406.10411, June 2024. Published in Proc. GameSec 2025, Springer, 2026. arxiv.org/abs/2406.10411. Integrates no-regret learning with neural networks for simultaneous-move tree search; approximates coarse correlated equilibrium.
- Tyler Becker, Zachary Sunberg. "Simultaneous AlphaZero: Extending Tree Search to Markov Games." arXiv:2512.12486, December 2025. arxiv.org/abs/2512.12486. AlphaZero extended to simultaneous-move games with separate policy heads per player and a regret-optimal solver at each node.
- Peter Auer, Nicolรฒ Cesa-Bianchi, Yoav Freund, Robert E. Schapire. "The Nonstochastic Multiarmed Bandit Problem." SIAM Journal on Computing 32(1):48โ77, 2002. DOI: 10.1137/S0097539701398375. epubs.siam.org/doi/10.1137/S0097539701398375. Introduces the Exp3 algorithm for adversarial bandits.
- Sergiu Hart, Andreu Mas-Colell. "A Simple Adaptive Procedure Leading to Correlated Equilibrium." Econometrica 68(5):1127โ1150, 2000. DOI: 10.1111/1468-0262.00153. onlinelibrary.wiley.com/doi/10.1111/1468-0262.00153. Introduces regret matching; proves convergence to correlated equilibrium.
- Martin Zinkevich, Michael Johanson, Michael Bowling, Carmelo Piccione. "Regret Minimization in Games with Incomplete Information." In Advances in Neural Information Processing Systems 20 (NeurIPS), 2007. papers.nips.cc. Introduces Counterfactual Regret Minimization (CFR) for extensive-form games.
- Michael Bowling, Neil Burch, Michael Johanson, Oskari Tammelin. "Heads-up Limit Hold'em Poker is Solved." Science 347(6218):145โ149, January 2015. DOI: 10.1126/science.1259433. science.org/doi/10.1126/science.1259433. CFR+ solves heads-up limit Hold'em; the first imperfect-information game of nontrivial size to be essentially solved.
- Vojtฤch Kovaลรญk, Viliam Lisรฝ, Marc Lanctot, Branislav Boลกanskรฝ. "Analysis of Hannan Consistent Selection for Monte Carlo Tree Search in Simultaneous Move Games." Machine Learning 109(1):1โ50, 2020. DOI: 10.1007/s10994-019-05832-z. link.springer.com/article/10.1007/s10994-019-05832-z. Follow-up analysis showing additional conditions needed for HC algorithms to converge in SM-MCTS.
- Dennis J. N. J. Soemers et al. "Ludii Example AI: ExampleDUCT.java." Ludii General Game System. github.com/Ludeme/LudiiExampleAI. Reference implementation of Decoupled UCT for simultaneous-move games in the Ludii framework.
- Mark H. M. Winands, Yngvi Bjรถrnsson, Jahn-Takeshi Saito. "Monte-Carlo Tree Search Solver." In Proc. 6th International Conference on Computers and Games (CG), pp. 25โ36, Springer, 2008. DOI: 10.1007/978-3-540-87608-3_3. link.springer.com/chapter/10.1007/978-3-540-87608-3_3. Introduces proven-value propagation for MCTS: terminal wins and losses are backed up as exact values rather than statistical samples, and proven-loss children are excluded from selection.
- Thomas M. Moerland, Joost Broekens, Aske Plaat, Catholijn M. Jonker. "Monte Carlo Tree Search for Asymmetric Trees." arXiv:1805.09218, 2018. arxiv.org/abs/1805.09218. Formalizes tree-structure uncertainty as a second type of uncertainty that standard UCB does not account for, arising from asymmetric termination of search trees. Proposes a modified UCB formula and reports improved efficiency on OpenAI Gym and Atari 2600 environments.
- Tuan Dam, Pascal Klink, Carlo D'Eramo, Jan Peters, Joni Pajarinen. "Generalized Mean Estimation in Monte-Carlo Tree Search." In Proc. 29th International Joint Conference on Artificial Intelligence (IJCAI), pp. 2397โ2404, 2020. ijcai.org/proceedings/2020/332. Replaces arithmetic-mean backup in MCTS with a power mean (exponent p > 1), reducing the negative bias inherent in average-based value estimation. The paper addresses MCTS backup generally; the correction also benefits variable-depth settings where the bias is amplified.