Installation & requirements

We use NLLB models for training and inference, with Hugging Face Transformers as the primary framework.

You need to have the following installed:

  • Python version >= 3.10

  • PyTorch with CUDA support (recommended for training)

  • For training new models, you’ll also need an NVIDIA GPU and NCCL

Install core NLLB dependencies:

pip install torch transformers datasets evaluate sentencepiece sacrebleu accelerate

The NLLB checkpoints used in this project include:

  • facebook/nllb-200-distilled-600M

  • facebook/nllb-200-1.3B (optional larger variant)

Optional: install fairseq (used in this project for preprocessing/binarization compatibility):

pip install fairseq

Optional preprocessing tools:

git clone https://github.com/moses-smt/mosesdecoder
git clone https://github.com/rsennrich/subword-nmt

Also recommended:

  • For large datasets: pip install pyarrow

  • If you use Docker, increase shared memory size (--ipc=host or --shm-size) for stable training

Quick verification:

python -c "from transformers import AutoTokenizer; t=AutoTokenizer.from_pretrained('facebook/nllb-200-distilled-600M'); print('NLLB tokenizer OK')"