A Rust tokenizer using the ungreedy 6-branch algorithm. Produces 4-15% fewer tokens than BPE at the same vocabulary size, with 8-9x faster tokenization throughput.
TokenBeast is a ground-up Rust rewrite of TokenMonster. Training finishes faster while matching or surpassing Go TokenMonster's compression quality.
Two algorithms are available via --algorithm:
TokenBeast (default) is a single-pass iterative distillation algorithm. It starts with millions of candidate tokens extracted from the corpus, then iteratively removes the lowest-scoring tokens until the target vocabulary size is reached. It protects short tokens from premature removal, uses a gradual removal schedule, and finishes with a tournament refinement phase that gives rejected tokens a second chance.
TokenMonster is a faithful Rust port of the Go multi-worker training algorithm. It uses parallel workers with dataset strips, union-based voting for token removal, and random swap refinement. Use this when you want results comparable to the original Go implementation.
See docs/algorithm-comparison.md for a detailed breakdown of how each algorithm works and where they differ.
Trained on tiny_shakespeare (1.1 MB). All systems trained on the same data.
| System | chr/tok | MB/s | vs SentencePiece |
|---|---|---|---|
| SentencePiece BPE | 3.679 | 3.5 | baseline |
| TokenMonster (Go, fast) | 3.927 | 29.0 | +6.8% |
| TokenMonster (Go, full) | 3.762 | 30.6 | +2.3% |
| TokenBeast (Rust) | 3.812 | 32.6 | +3.6% |
| System | chr/tok | MB/s | vs SentencePiece |
|---|---|---|---|
| SentencePiece BPE | 4.000 | 3.2 | baseline |
| TokenMonster (Go, fast) | 4.338 | 25.9 | +8.4% |
| TokenMonster (Go, full) | 4.246 | 28.6 | +6.2% |
| TokenBeast (Rust) | 4.591 | 30.4 | +14.8% |
TokenBeast leads at both sizes, and the advantage grows with vocabulary size. All variants tokenize at 26-33 MB/s, roughly 8-9x faster than SentencePiece.
Full benchmark tables including Rust TokenMonster results and vocabulary overlap analysis are in docs/benchmarks.md.
| Crate | Description |
|---|---|
tokenbeast |
Core library: vocab loading, tokenization, encoding/decoding |
tokenbeast-train |
Training library + CLI: dictionary extraction and vocabulary training |
tokenbeast-py |
Python bindings via PyO3: tokenization, training, and HuggingFace integration |
cd crates/tokenbeast-py
pip install -e .
# With HuggingFace support
pip install -e ".[hf]"from tokenbeast import Vocab
vocab = Vocab.load("results/final.tb")
ids = vocab.tokenize("Hello world")
text = vocab.decode_str(ids)
print(f"{len(ids)} tokens, round-trip: {text}")from tokenbeast import extract, train
# Extract a dictionary from training data
extract(dataset="data.txt", output="tokens.dict")
# Train a vocabulary (releases the GIL)
train(
dataset="data.txt",
dictionary="tokens.dict",
vocab_size=4096,
algorithm="tokenbeast", # or "tokenmonster"
dir="results",
)from tokenbeast import TokenBeastTokenizer
tok = TokenBeastTokenizer(vocab_file="results/final.tb")
encoded = tok("Hello world")
print(encoded["input_ids"])
print(tok.decode(encoded["input_ids"]))
# Save and reload
tok.save_pretrained("/tmp/my_tokenizer")
tok2 = TokenBeastTokenizer.from_pretrained("/tmp/my_tokenizer")Requires transformers. Install with pip install -e ".[hf]" or pip install transformers.
tokenbeast-train extract \
--dataset data.txt \
--output tokens.dict# TokenBeast (default)
tokenbeast-train train \
--dataset data.txt \
--dictionary tokens.dict \
--vocab-size 4096 \
--dir results
# TokenMonster
tokenbeast-train train \
--algorithm tokenmonster \
--dataset data.txt \
--dictionary tokens.dict \
--vocab-size 4096 \
--dir results_tm \
--workers 8tokenbeast-train eval \
--vocab results/final.tb \
--dataset data.txt| Parameter | Default | Description |
|---|---|---|
--algorithm |
tokenbeast |
tokenbeast (alias: tb) or tokenmonster (alias: tm) |
--dataset |
required | Path to training data file |
--dictionary |
required | Path to token dictionary (from extract) |
--vocab-size |
32000 | Target vocabulary size |
--dir |
results |
Output directory for vocab snapshots |
--workers |
8 | Number of parallel workers (TokenMonster only) |
--percentage |
15 | Percentage of dataset to use as scoring strips |
--midway-target |
0 | When to switch from strips to full dataset (0 = 6x vocab_size) |
--keep-trying |
1000 | Max no-improvement attempts in refinement phase |
--capcode |
2 | Capcode encoding level (0=none, 1=basic, 2=full) |
--charset |
utf8 |
Character set: utf8, utf16, none |
--level |
clean |
Optimization level: unfiltered, clean, balanced, consistent, strict |
--seed-vocab |
none | Optional seed vocabulary to warm-start training |
cargo build --releaseThe training binary will be at target/release/tokenbeast-train.