Skip to content

AdaMLLab/tokenbeast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TokenBeast

A Rust tokenizer using the ungreedy 6-branch algorithm. Produces 4-15% fewer tokens than BPE at the same vocabulary size, with 8-9x faster tokenization throughput.

TokenBeast is a ground-up Rust rewrite of TokenMonster. Training finishes faster while matching or surpassing Go TokenMonster's compression quality.

Training Algorithms

Two algorithms are available via --algorithm:

TokenBeast (default) is a single-pass iterative distillation algorithm. It starts with millions of candidate tokens extracted from the corpus, then iteratively removes the lowest-scoring tokens until the target vocabulary size is reached. It protects short tokens from premature removal, uses a gradual removal schedule, and finishes with a tournament refinement phase that gives rejected tokens a second chance.

TokenMonster is a faithful Rust port of the Go multi-worker training algorithm. It uses parallel workers with dataset strips, union-based voting for token removal, and random swap refinement. Use this when you want results comparable to the original Go implementation.

See docs/algorithm-comparison.md for a detailed breakdown of how each algorithm works and where they differ.

Results

Trained on tiny_shakespeare (1.1 MB). All systems trained on the same data.

Vocab 4096

System chr/tok MB/s vs SentencePiece
SentencePiece BPE 3.679 3.5 baseline
TokenMonster (Go, fast) 3.927 29.0 +6.8%
TokenMonster (Go, full) 3.762 30.6 +2.3%
TokenBeast (Rust) 3.812 32.6 +3.6%

Vocab 8192

System chr/tok MB/s vs SentencePiece
SentencePiece BPE 4.000 3.2 baseline
TokenMonster (Go, fast) 4.338 25.9 +8.4%
TokenMonster (Go, full) 4.246 28.6 +6.2%
TokenBeast (Rust) 4.591 30.4 +14.8%

TokenBeast leads at both sizes, and the advantage grows with vocabulary size. All variants tokenize at 26-33 MB/s, roughly 8-9x faster than SentencePiece.

Full benchmark tables including Rust TokenMonster results and vocabulary overlap analysis are in docs/benchmarks.md.

Crates

Crate Description
tokenbeast Core library: vocab loading, tokenization, encoding/decoding
tokenbeast-train Training library + CLI: dictionary extraction and vocabulary training
tokenbeast-py Python bindings via PyO3: tokenization, training, and HuggingFace integration

Quick Start (Python)

cd crates/tokenbeast-py
pip install -e .

# With HuggingFace support
pip install -e ".[hf]"

Tokenize

from tokenbeast import Vocab

vocab = Vocab.load("results/final.tb")
ids = vocab.tokenize("Hello world")
text = vocab.decode_str(ids)
print(f"{len(ids)} tokens, round-trip: {text}")

Train

from tokenbeast import extract, train

# Extract a dictionary from training data
extract(dataset="data.txt", output="tokens.dict")

# Train a vocabulary (releases the GIL)
train(
    dataset="data.txt",
    dictionary="tokens.dict",
    vocab_size=4096,
    algorithm="tokenbeast",  # or "tokenmonster"
    dir="results",
)

HuggingFace Integration

from tokenbeast import TokenBeastTokenizer

tok = TokenBeastTokenizer(vocab_file="results/final.tb")
encoded = tok("Hello world")
print(encoded["input_ids"])
print(tok.decode(encoded["input_ids"]))

# Save and reload
tok.save_pretrained("/tmp/my_tokenizer")
tok2 = TokenBeastTokenizer.from_pretrained("/tmp/my_tokenizer")

Requires transformers. Install with pip install -e ".[hf]" or pip install transformers.

Quick Start (CLI)

Extract a dictionary

tokenbeast-train extract \
  --dataset data.txt \
  --output tokens.dict

Train a vocabulary

# TokenBeast (default)
tokenbeast-train train \
  --dataset data.txt \
  --dictionary tokens.dict \
  --vocab-size 4096 \
  --dir results

# TokenMonster
tokenbeast-train train \
  --algorithm tokenmonster \
  --dataset data.txt \
  --dictionary tokens.dict \
  --vocab-size 4096 \
  --dir results_tm \
  --workers 8

Evaluate a vocabulary

tokenbeast-train eval \
  --vocab results/final.tb \
  --dataset data.txt

Training Parameters

Parameter Default Description
--algorithm tokenbeast tokenbeast (alias: tb) or tokenmonster (alias: tm)
--dataset required Path to training data file
--dictionary required Path to token dictionary (from extract)
--vocab-size 32000 Target vocabulary size
--dir results Output directory for vocab snapshots
--workers 8 Number of parallel workers (TokenMonster only)
--percentage 15 Percentage of dataset to use as scoring strips
--midway-target 0 When to switch from strips to full dataset (0 = 6x vocab_size)
--keep-trying 1000 Max no-improvement attempts in refinement phase
--capcode 2 Capcode encoding level (0=none, 1=basic, 2=full)
--charset utf8 Character set: utf8, utf16, none
--level clean Optimization level: unfiltered, clean, balanced, consistent, strict
--seed-vocab none Optional seed vocabulary to warm-start training

Building

cargo build --release

The training binary will be at target/release/tokenbeast-train.

About

Rust variant of tokenmonster with python hooks and bindings included

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors