Training
========

A pre-trained model is utilized (which is trained on a large general textual corpus) and then we fine-tune (continue training) on a specialized dataset, in our case medical terminologies. For the pre-trained model, we select the CNN one offered by fairseq. Next, we show the training script we use:

.. highlight:: bash

::

   FAIRSEQ=~/fairseq
   PRETRAINED_MODEL=~/wmt14.en-fr.fconv-py

   SEED=1

   EXP_NAME=fine-tune

   SRC=en
   TRG=fr

   SRC_VOCAB=$PRETRAINED_MODEL/dict.$SRC.txt
   TRG_VOCAB=$PRETRAINED_MODEL/dict.$TRG.txt

   PRETRAINED_MODEL_FILE=$PRETRAINED_MODEL/model.pt

   CORPUS_DIR=~/data
   DATA_DIR=~/data-bin

   TRAIN_PREFIX=$CORPUS_DIR/train
   DEV_PREFIX=$CORPUS_DIR/valid

   mkdir -p $CORPUS_DIR
   mkdir -p $DATA_DIR

   ######################################
   # Preprocessing
   ######################################
   CUDA_VISIBLE_DEVICES=0 fairseq-preprocess \
       --source-lang $SRC \
       --target-lang $TRG \
       --trainpref $TRAIN_PREFIX \
       --validpref $DEV_PREFIX \
       --destdir $DATA_DIR \
       --srcdict $SRC_VOCAB \
       --tgtdict $TRG_VOCAB \
       --workers `nproc` \


   ######################################
   # Training
   ######################################
   CUDA_VISIBLE_DEVICES=0 fairseq-train $DATA_DIR \
       --restore-file $PRETRAINED_MODEL_FILE \
       --lr 0.5 --clip-norm 0.1 --dropout 0.1 --max-tokens 3000 \
       --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
       --lr-scheduler fixed --force-anneal 50 \
       --arch fconv_wmt_en_fr \
       --reset-optimizer \
       --skip-invalid-size-inputs-valid-test \
       --save-dir checkpoints/fconv_wmt_en_fr_saved