Preprocessing

Datasets

In the following table, we present the parallel datasets that we will first preprocess and then use later for training and resulting models.

Rounds	Model	Description
1st and 2nd rounds	CNN	2020 Datasets: ICD-10, CHU Rouen, ORDO, ACAD, MEDDRA, ATC, MESH, ICD-O, DBPEDIA, ICPC, ICF
3rd round	CNN	Cleaning to remove bilingual sentences leading to ambiguities (e.g. ICPC is not relevantly structured for use in a training set)
4th round	CNN	3rd round + PatTR corpus (patents database)
5th round	CNN	3rd round + Medline (training2), Scielo datasets
6th round	Transformer	5th, with Transformer architecture
Ensemble	CNNs	an ensemble of the 3 CNN models was created : 3rd, 4th, 5th rounds

Download pretrained model

We also need to download the pretrained model from https://github.com/facebookresearch/fairseq/blob/main/examples/translation/README.md:

mkdir -p data-bin
curl https://dl.fbaipublicfiles.com/fairseq/models/wmt14.v2.en-fr.fconv-py.tar.bz2 | tar xvjf - -C data-bin

Running preprocessing

Assuming we have our palallel datasets in the files training.fr and training.en we are now ready to execute the preprocessing script:

SCRIPTS=mosesdecoder/scripts
TOKENIZER=$SCRIPTS/tokenizer/tokenizer.perl
CLEAN=$SCRIPTS/training/clean-corpus-n.perl
NORM_PUNC=$SCRIPTS/tokenizer/normalize-punctuation.perl
REM_NON_PRINT_CHAR=$SCRIPTS/tokenizer/remove-non-printing-char.perl
BPEROOT=subword-nmt/subword_nmt
BPE_TOKENS=50000

src=en
tgt=fr
lang=en-fr
tmp=tmp
orig=orig

mkdir -p $tmp $prep

echo "pre-processing train data..."
for l in $src $tgt; do
    rm $tmp/train.tags.$lang.tok.$l
    cat 'data/training.$l | \
        perl $NORM_PUNC $l | \
        perl $REM_NON_PRINT_CHAR | \
        perl $TOKENIZER -threads 8 -a -l $l >> $tmp/train.tags.$lang.tok.$l
done

echo "splitting train and valid..."
for l in $src $tgt; do
    awk '{if (NR%500 == 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/valid.$l
    awk '{if (NR%500 != 0)  print $0; }' $tmp/train.tags.$lang.tok.$l > $tmp/train.$l
done

TRAIN=$tmp/train.fr-en
BPE_CODE=../wmt14.en-fr.fconv-py/bpecodes

rm -f $TRAIN
for l in $src $tgt; do
    cat $tmp/train.$l >> $TRAIN
done

#echo "learn_bpe.py on ${TRAIN}..."
#python $BPEROOT/learn_bpe.py -s $BPE_TOKENS < $TRAIN > $BPE_CODE

for L in $src $tgt; do
    for f in train.$L valid.$L; do
        echo "apply_bpe.py to ${f}..."
        python $BPEROOT/apply_bpe.py -c $BPE_CODE < $tmp/$f > $tmp/bpe.$f
    done
done

#python $BPEROOT/apply_bpe.py -c $BPE_CODE < test.$lang.tok.en > data/bpe.test

perl $CLEAN -ratio 1.5 $tmp/bpe.train $src $tgt data_2021/train 1 5000
perl $CLEAN -ratio 1.5 $tmp/bpe.valid $src $tgt data_2021/valid 1 5000