Preprocessing ============= Datasets -------- In the following table, we present the parallel datasets that we will first preprocess and then use later for training and resulting models. .. list-table:: :widths: 25 25 25 :header-rows: 1 * - Rounds - Model - Description * - 1st and 2nd rounds - CNN - 2020 Datasets: ICD-10, CHU Rouen, ORDO, ACAD, MEDDRA, ATC, MESH, ICD-O, DBPEDIA, ICPC, ICF * - 3rd round - CNN - Cleaning to remove bilingual sentences leading to ambiguities (e.g. ICPC is not relevantly structured for use in a training set) * - 4th round - CNN - 3rd round + PatTR corpus (patents database) * - 5th round - CNN - 3rd round + Medline (training2), Scielo datasets * - 6th round - Transformer - 5th, with Transformer architecture * - Ensemble - CNNs - an ensemble of the 3 CNN models was created : 3rd, 4th, 5th rounds Download pretrained model -------------------------- We also need to download the pretrained model from https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md: .. code-block:: bash mkdir -p data-bin curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - -C data-bin Running preprocessing --------------------- Assuming we have our palallel datasets in the files training.fr and training.en we are now ready to execute the preprocessing script: .. highlight:: bash :: SCRIPTS=mosesdecoder/scripts TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl CLEAN=$SCRIPTS/training/clean-corpus-n.perl NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl BPEROOT=subword-nmt/subword_nmt BPE_TOKENS=50000 src=en tgt=fr lang=en-fr tmp=tmp orig=orig mkdir -p $tmp $prep echo "pre-processing train data..." for l in $src $tgt; do rm $tmp/train.tags.$lang.tok.$l cat 'data/training.$l | \ perl $NORM_PUNC $l | \ perl $REM_NON_PRINT_CHAR | \ perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l done echo "splitting train and valid..." for l in $src $tgt; do awk '{if (NR%500 == 0) print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/valid.$l awk '{if (NR%500 != 0) print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l done TRAIN=$tmp/train.fr-en BPE_CODE=../wmt14.en-fr.fconv-py/bpecodes rm -f $TRAIN for l in $src $tgt; do cat $tmp/train.$l >> $TRAIN done #echo "learn_bpe.py on ${TRAIN}..." #python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE for L in $src $tgt; do for f in train.$L valid.$L; do echo "apply_bpe.py to ${f}..." python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f done done #python $BPEROOT/apply_bpe.py -c $BPE_CODE < test.$lang.tok.en > data/bpe.test perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt data_2021/train 1 5000 perl $CLEAN -ratio 1.5 $tmp/bpe.valid $src $tgt data_2021/valid 1 5000