Skip to content

Commit

Permalink
fixed rook
Browse files Browse the repository at this point in the history
  • Loading branch information
robvanvolt committed Jan 30, 2025
1 parent ab85931 commit 9b2934a
Showing 1 changed file with 3 additions and 195 deletions.
198 changes: 3 additions & 195 deletions notes/rook.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ These annotations were generated using Stockfish 16.1, one of the strongest ches

Here's an example of our data format:

| Example Data | P: | 6k1/7p/4P1q1 /1pb1Q2p/2p1b3 /2P4P/PP4P1 /R6K w \- \- 9 38 | M: | e5g5 a1g1 e5b8 e5h2 e5e4 | E: | \-999.97 \-2.97 \-1.63 \-6.59 \-5.95 | B: | e5b8 |
| Example Data | P: | 6k1/7p/ 4P1q1 /1pb1Q2p/ 2p1b3 /2P4P/ PP4P1 /R6K w \- \- 9 38 | M: | e5g5 a1g1 e5b8 e5h2 e5e4 | E: | \-999.97 \-2.97 \-1.63 \-6.59 \-5.95 | B: | e5b8 |
| :---- | ----- | :---- | ----- | :---- | :---- | :---- | :---- | ----- |
| Field Explanation | Prefix | State ([FEN](https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation)) \+ padding | Delimiter | Top 5 Moves, shuffled \+ padding | Delimiter | Top 5 Moves Eval \+ padding | Delimiter | Best Move |
| Inference | Prompt | | Generated Chain-of-Thought Tokens | | | | | Action |
Expand Down Expand Up @@ -87,7 +87,7 @@ We began by creating a new dataset arbiter-6m (inspired by the interface design
* A score-signal for reinforcement learning (reward)
* Whether the game ended and if the move was legal (termination, truncation)

| Example Data | 5R2/6R1/8/ 3P4/p7/1b2R2P/2p3P1 /6K1 b \- \- 0 58 | b3d5 | e1e3 a2b3 f7f2 b5b4 g5g7 b4c3 f2f8 c3c2 d4d5 b3d5 | 5R2/6R1/8/3b4/p7/4R2P/2p3P1/6K1 w \- \- 0 59 | 0.001 | 0 | 0 |
| Example Data | 5R2/6R1/8/ 3P4/p7/ 1b2R2P/ 2p3P1 /6K1 b \- \- 0 58 | b3d5 | e1e3 a2b3 f7f2 b5b4 g5g7 b4c3 f2f8 c3c2 d4d5 b3d5 | 5R2/6R1/ 8/3b4/ p7/4R2P/ 2p3P1/ 6K1 w \- \- 0 59 | 0.001 | 0 | 0 |
| :---- | ----- | :---- | :---- | ----- | :---- | :---- | :---- |
| Field Explanation | Last State ([FEN](https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation)) | Action | Action History (maxlen 10\) for [repetitions](https://en.wikipedia.org/wiki/Threefold_repetition) | Observation (new State) | Reward (-1 loss or illegal, 0.5 draw, 1 win, 0.001 valid) | Termination (bool, True if game ends by [WLD](https://en.wikipedia.org/wiki/Chess_scoring)) | Truncation (bool, True if game ends by illegal action) |
| Inference | Prompt | | | Generated Environment Update | | | |
Expand Down Expand Up @@ -184,198 +184,6 @@ Open-source releases:

**Table 2**: Results overview and dataset scaling

# V— OUTDATED BELOW THIS LINE —V

---

# ROOK, ArbiterSim, and RookWorld: Advancing Strategic Reasoning in Language Models

By Jonathan Rahn & Qi Sun

## Introduction

In the realm of artificial intelligence, the challenge of strategic reasoning has long been a cornerstone for measuring and advancing AI capabilities. Chess, with its complex rules and vast decision space, has served as an ideal benchmark for testing AI systems. Today, we're excited to introduce a suite of innovative language models that push the boundaries of strategic reasoning and represent the world model of chess.

Our project comprises four key components:

1. ROOK (Reasoning Over Organized Knowledge): A chess-playing language model
2. ArbiterSim: A chess environment simulator
3. RookWorld: A unified model for both chess play and environment simulation
4. RookWorld Evol: An experimental self-improving AI system

| | Experts | Step \#1 | Step \#2 | Step \#3 | Step \#4 |
| :---- | ----- | ----- | ----- | ----- | ----- |
| **Policy** | Stockfish 16.1 | ROOK CLF | Stockfish 16.1 | RookWorld LM | RookWorld Evol LM |
| **Environment** | python-chess | python-chess | ArbiterSim LM | | |

**Figure 1**: Overview

Let's dive into each of these components and explore their capabilities and implications for AI research. For more technical and implementation details please consult the public GitHub repo.

## ROOK: A Language Model that Plays Chess

ROOK is a decoder transformer model with a classification head trained from scratch to play chess like [Ruoss et al. 2024](https://arxiv.org/pdf/2402.04494). What sets ROOK apart is its training on a synthetic dataset that incorporates chain-of-thought evaluation from [Stockfish 16.1](https://github.com/official-stockfish/Stockfish), a leading chess engine \- improving sample efficiency over standard behavioral cloning.

| Example Data | P: | 6k1/7p/4P1q1 /1pb1Q2p/2p1b3 /2P4P/PP4P1 /R6K w \- \- 9 38 | M: | e5g5 a1g1 e5b8 e5h2 e5e4 | E: | \-999.97 \-2.97 \-1.63 \-6.59 \-5.95 | B: | e5b8 |
| :---- | ----- | :---- | ----- | :---- | :---- | :---- | :---- | ----- |
| Field Explanation | Prefix | State ([FEN](https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation)) \+ padding | Delimiter | Top 5 Moves, shuffled \+ padding | Delimiter | Top 5 Moves Eval \+ padding | Delimiter | Best Move |
| Inference | Prompt | | Generated Chain-of-Thought Tokens | | | | | Action |

**Figure 1**: ROOK data representation.

### Training and Performance

We've conducted extensive experiments with ROOK, scaling from small datasets of 20,000 samples up to 40 million samples. The largest datasets were generated on the [Tsubame 4.0 Supercomputer](https://www.hpcwire.com/off-the-wire/tokyo-techs-tsubame4-0-supercomputer-now-operational/) with generous support from Tokyo Institute of Technology. Here are some key findings:

* With 5 million unique training samples (approximately 770 million tokens), ROOK achieves basic chess-playing capabilities after 3 epochs (2.3bn tokens) of training:
* 11.5% accuracy on the BIG-bench "[Checkmate in One](https://github.com/google/BIG-bench/blob/main/bigbench/benchmark_tasks/checkmate_in_one/README.md)" task
* 13.4% accuracy in predicting the best move
* 39.6% accuracy in predicting the top 5 moves
* An average of 41.4 legal half-moves in self-play, with only 2.4% illegal moves
* Training on the same data without the Chain-of-Thought evaluation tokens, leads to ….

These results demonstrate ROOK's ability to understand chess positions and generate legal, strategic moves without explicit coding of chess rules.

### Prior Work and Discussion (WIP)

* *State-of-the-Art Chess Policies ([Ruoss et al. 2024](https://arxiv.org/pdf/2402.04494)):*
* *Stockfish 16 (0.05s)*
* *Tournament ELO: 2706*
* *AlphaZero (Convolution \+ Residual Architecture):*
* *Tournament ELO with MCTS (400 sims): 2502*
* *Tournament ELO with MCTS (800 sims): \~3500 ([Silver et al. 2017](https://arxiv.org/pdf/1712.01815))*
* *Grandmaster-Level Chess without Search (Decoder Transformer):*
* *14.7% overlap of Test-Set with Training-Set*
* *270M parameter model on 15.3b Action-Value samples (2.67 epochs)*
* *Action Accuracy: 69.4%*
* *Tournament ELO: 2299*
* *Ablation 9M parameter model on 40M samples*
* *Action Accuracy: \~55% (Figure A6)*
* *Chess Policies with LMs (Presser GPT-2, ChessGPT & Checkmate in One, Chess Transformer)*
* *Reasoning in LMs: Physics of LMs & CoT pre-training, [Scratchpads](https://arxiv.org/pdf/2112.00114)*

## ArbiterSim: Simulating the Chess Environment

Using a generative GPT2 architecture, trained using the [karpathy/llm.c](https://github.com/karpathy/llm.c) library. ArbiterSim takes us a step further by learning to simulate the chess environment itself. Trained on rollouts from ROOK self-play in an environment based on the [python-chess library](https://github.com/niklasf/python-chess), ArbiterSim can predict the next board state, game outcomes, and legality of moves.

| Example Data | 5R2/6R1/8 /3P4/p7/1b2R2P /2p3P1/6K1 b \- \- 0 58 | b3d5 | e1e3 a2b3 f7f2 b5b4 g5g7 b4c3 f2f8 c3c2 d4d5 b3d5 | 5R2/6R1/8 /3b4/p7/4R2P /2p3P1/6K1 w \- \- 0 59 | 0.001 | 0 | 0 |
| :---- | ----- | :---- | :---- | ----- | :---- | :---- | :---- |
| Field Explanation | Last State ([FEN](https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation)) | Action | Action History (maxlen 10\) for [repetitions](https://en.wikipedia.org/wiki/Threefold_repetition) | Observation (new State) | Reward (-1 loss or illegal, 0.5 draw, 1 win, 0.001 valid) | Termination (bool, True if game ends by [WLD](https://en.wikipedia.org/wiki/Chess_scoring)) | Truncation (bool, True if game ends by illegal action) |
| Inference | Prompt | | | Generated Environment Update | | | |

**Figure 3**: Arbiter data representation. All fields concatenated into a string, delimited by “+”.

### Performance Metrics

After training from scratch on 2 million samples, ArbiterSim achieves sufficient accuracy to function as a game environment:

* 92.3% accuracy in predicting the exact next state
* 99.76% Normalized Levenshtein Similarity for next state predictions
* 98.93% accuracy in predicting game rewards
* 99.04% and 99.89% accuracy in predicting game termination and truncation, respectively
* Self-play of ROOK in ArbiterSim leads to 80% games terminated and 20% truncated

![Figure 4](/images/blog/rook-2.png)
**Figure 4**: ROOK \+ ArbiterSim Self-Play Loop

### Prior Work and Discussion (WIP)

* *World Models: VAE-RNN, LM, Diffusion as Generator, [DL RL and WM](https://www.sciencedirect.com/science/article/pii/S0893608022001150)*
* *Emergent world Models (Chess, Karvonen), Emergent World Representation (Othello, Li)*
* *World Models with LMs: [Foundation Models for RL](https://arxiv.org/pdf/2302.09419) (pg 53), [Reasoning via Planning](https://arxiv.org/pdf/2305.14992)*
* *Training Policies in World Models: [Dreamer](https://arxiv.org/pdf/1912.01603), [Utility of dreaming](https://journals.sagepub.com/doi/pdf/10.1177/1059712319896489), [Learning to Predict](https://proceedings.neurips.cc/paper_files/paper/2019/file/15cf76466b97264765356fcc56d801d1-Paper.pdf)?*

## RookWorld: Unifying Policy and Environment

RookWorld represents a significant leap forward by combining the capabilities of ROOK and ArbiterSim into a single language model. Through use of prompt prefixes, RookWorld can switch tasks between acting as a chess player and simulating the chess environment.

| Example Data | P: | 6k1/7p/4P1q1 /1pb1Q2p/2p1b3 /2P4P/PP4P1 /R6K w \- \- 9 38 | M: | e5g5 a1g1 e5b8 e5h2 e5e4 | E: | \-999.97 \-2.97 \-1.63 \-6.59 \-5.95 | B: | e5b8 |
| :---- | ----- | :---- | ----- | :---- | :---- | :---- | :---- | ----- |
| Field Explanation | Prefix | State ([FEN](https://en.wikipedia.org/wiki/Forsyth%E2%80%93Edwards_Notation)) \+ padding | Delimiter | Top 5 Moves, shuffled \+ padding | Delimiter | Top 5 Moves Eval \+ padding | Delimiter | Best Move |
| Inference | Prompt | | Generated Chain-of-Thought Tokens | | | | | Action |

**Figure 2**: ROOK data representation. Concatenation of several fields into a string.

### Comparative Performance

RookWorld, trained from scratch on 7 million samples (5M from ROOK task, 2M from Arbiter task), outperforms its predecessors in several benchmarks:

* 13.7% accuracy on the BIG-bench "Checkmate in One" task (vs. 11.5% for ROOK)
* 16.6% accuracy in predicting the best move (vs. 13.4% for ROOK)
* 99.61% accuracy in predicting the exact next state (vs. 92.3% for ArbiterSim)
* RookWorld Policy can play over 50 consecutive legal moves in RookWorld Environment (actions and states supervised with python-chess)

![Figure 5](/images/blog/rook-3.png)
**Figure 5**: RookWorld Self-Play Loop

These results demonstrate that the unified model not only matches but exceeds the performance of specialized models in both chess playing and environment simulation tasks.

### Prior Work and Discussion (WIP)

* *Multi-Task Learning: Caruana 1997, [Ruder 2017](https://arxiv.org/pdf/1706.05098), [Radford 2019](https://insightcivic.s3.us-east-1.amazonaws.com/language-models.pdf)*
* *Generator & Judge/Reward Model: [LLMs-as-Judge](https://arxiv.org/pdf/2306.05685), [Hibbard 2015](https://cdn.aaai.org/ocs/ws/ws0019/10072-45897-1-PB.pdf)?, [Meta-Rewarding LMs](https://arxiv.org/pdf/2407.19594), contrast with [finetuning Judges](https://arxiv.org/pdf/2310.17631)*
* *Synthetic Training Data: [Textbooks](https://arxiv.org/pdf/2306.11644), [GenAI for Synth Data](https://arxiv.org/pdf/2403.04190)*

## RookWorld Evol: Towards Self-Improving AI

Our most ambitious component, RookWorld Evol, is still a work in progress. The goal is to create a system that can improve its own chess-playing ability through self-play within its learned environment. By selecting only the actions of the winning side from self-play rollouts and continuing training, we aim to achieve stepwise policy improvement without further human intervention.

### Prior Work and Discussion (WIP)

* [*LLMs can Selfimprove*](https://arxiv.org/pdf/2210.11610)*, [STaR](https://arxiv.org/pdf/2203.14465)*
* [*WizardLM*](https://arxiv.org/pdf/2304.12244) *Evol-Instruct, Orca [AgentInstruct](https://arxiv.org/pdf/2407.03502)*

## Implications and Future Directions

The success of ROOK, ArbiterSim, and RookWorld has significant implications for AI research:

1. **Strategic Reasoning**: These models demonstrate that language models can learn complex strategic reasoning tasks without explicit coding of domain rules.

2. **World Modeling**: ArbiterSim and RookWorld show that language models can accurately simulate complex environments, opening possibilities for AI systems that can reason about hypothetical scenarios and learn to act as a judge on their own generated outputs.

3. **Unified Architectures**: RookWorld's success in combining policy and environment modeling shows how to combine multi-task learning into multi-purpose AI systems, exceeding the single-task performance.

4. **Self-Improvement**: RookWorld Evol, while still in development, points towards the possibility of AI systems that can autonomously improve their capabilities.

Future work will focus on scaling these models to larger architectures, improving their performance, and exploring applications beyond chess. We're particularly excited about the potential of these techniques in other strategic domains and decision-making systems.

Since all models presented above are standard GPT-2 architecture without customizing architecture or even tokenization, we’re also looking forward to see, if interleaving this training data into standard language model pre-training has effects on language model benchmarks.

## Conclusion

ROOK, ArbiterSim, and RookWorld represent significant steps forward in AI's ability to reason strategically and model complex environments. As we continue to refine these models and develop RookWorld Evol, we're moving closer to AI systems that can engage in sophisticated strategic reasoning across a wide range of domains.

We invite the AI research community to build upon our work. All code, datasets, and pre-trained models are available in our public repository and on Hugging Face. Together, we can push the boundaries of what's possible in artificial intelligence and strategic reasoning.

## Benchmark Results

Policy Evaluations:

| Model | Dataset (Samples) | Steps (Epochs) | Action Accuracy | [Checkmate in One](https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/checkmate_in_one) Accuracy | [Lichess Puzzle](https://nicholas.carlini.com/writing/2023/chess-llm.html) Accuracy | Self-play Legal Moves |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| ROOK (124M) | ROOK 20k | 5,000 (2) | 0.6% | 0.0% | | 3.5 (28.3%) |
| ROOK (124M) | ROOK 260k | 18,752 (1) | 3.8% | 4.7% | | 14.2 (7.0%) |
| ROOK (124M) | ROOK 709k | 51,481 (1) | 7.4% | 4.8% | | 17.7 (5.6%) |
| ROOK (124M) | *ROOK 709k (no M)* | 51,481 (1) | 7.2% | 4.9% | | 14.2 (7.0%) |
| ROOK (124M) | *ROOK 709k (no E)* | 51,481 (1) | 7.0% | 4.0% | | 20.0 (5%) |
| ROOK (124M) | *ROOK 709k (no ME)* | 51,481 (1) | 8.2% | 5.8% | | 13.6 (7.4%) |
| ROOK (124M) | ROOK 709k | 102,962 (2) | 7.8% | 5.5% | | 23.6 (4.2%) |
| ROOK (124M) | ROOK 709k | 154,443 (3) | 8.8% | 7.0% | 1.4% | 23.5 (4.3%) |
| ROOK (124M) | ROOK 5M | 11,646 (1) | 12.0% | 9.0% | 2.0% | 28.8 (3.5%) |
| ROOK (124M) | ROOK 5M | 34,932 (3) | 13.4% | 11.5% | 3.9% | 41.4 (2.4%) |
| ROOK (124M) | *ROOK 5M (no ME)* | 34,932 (3) | 14.6% | 14.9% | 5.1% | 30.3 (3.3%) |
| ROOK (124M) | ROOK 40M | 278,154 (3) | 22.2% | 24.4% | 10.1% | 112.1 (0,9%) |
| RookWorld (124M) | RookWorld 7M | 47,203 (3) | 16.6% | 13.7% | 3.5% | 36.3 (2.7%) |
| **RookWorld (124M)** | **RookWorld 46M** | **529,400 (5)** | **26.2%** | **32.1%** | \- | \- |
| ROOK (124M) | *RookWorld 7M (no ME)* | 47,203 (3) | 14.4% | 15.7% | 5.3% | 31.2 (3.2%) |
| RookWorld Evol (124M) | RW 7M \+ 1.3M RW Policy Self-play | 56,031 (3) | 13.2% | 13.89% | 4.3% | 37.9 (2.6%) |
| [Feng 2023](https://arxiv.org/pdf/2306.09200) ChessGPT-Base (3B, w/o suffix, MC) | 28.1M documents | | | 13.6% | | |
| [Feng 2023](https://arxiv.org/pdf/2306.09200) ChessGPT-Base (3B, w/o suffix, ESM) | 28.1M documents | | | 26.5% | | |
| [Ruoss 2024](https://arxiv.org/pdf/2402.04494) (9M, BC) | 40M | | \~55% | | 65.7% | |
| [Ruoss 2024](https://arxiv.org/pdf/2402.04494) (270M, AV) | 15.3B | | 69.4% | | 93.5% | |

Environment Evaluations:

## Related Work

Direct Inspirations:
Expand Down Expand Up @@ -406,7 +214,7 @@ Related Work:
* [\[2408.14837\] Diffusion Models Are Real-Time Game Engines](https://arxiv.org/abs/2408.14837)
* [\[2407.02466\] PWM: Policy Learning with Large World Models](https://arxiv.org/abs/2407.02466)

## Public Code & Datasets (currently still private)
## Public Code & Datasets

* Data generation
* [https://github.com/jorahn/rook/tree/main/dev/data](https://github.com/jorahn/rook/tree/main/dev/data)
Expand Down

0 comments on commit 9b2934a

Please sign in to comment.