Preprocessing

Datasets

In the following table, we present the datasets used for model training versions v1 and v2.

Dataset

Model Training Version 1 (CNN v1)

Model Training Version 2 (CNN v2)

ATC

6946

6946

CIM11_MMS

36984

36984

CISP

1434

1434

EMDN

8344

8344

ICD-10

11443

11443

ICF

1665

1665

NCIT

106602

106602

ORDO

15859

15859

SMS

25888

25888

STANDARD_TERMS

1297

1297

SNOMED

215769

Total

216462

432231

Running preprocessing

The following script splits raw parallel files, applies WMT BPE, and binarizes the result with Fairseq:

#!/usr/bin/env bash
set -e

###############################################
# CONFIG — CHANGE THESE PATHS
###############################################
# RAW_EN=data/medterms.en        # raw English terms
# RAW_FR=data/medterms.fr

# RAW_EN=data/medterms_plus_snomed.en        # raw English terms
# RAW_FR=data/medterms_plus_snomed.fr        # raw French terms

RAW_EN=data/initial_unique_english.txt        # raw English terms
RAW_FR=data/initial_unique_french.txt        # raw French terms

BPE_CODES="E:/icd11/models/wmt14.en-fr.fconv-py/bpecodes"        # BPE codes from WMT fconv model
SRC_DICT="E:/icd11/models/wmt14.en-fr.fconv-py/dict.en.txt"        # source dictionary from WMT model
TGT_DICT="E:/icd11/models/wmt14.en-fr.fconv-py/dict.fr.txt"         # target dictionary from WMT model

# DESTDIR=data-bin/enfr-medterms # output bin directory
# SPLIT_DIR=data/splits          # store temporary splits
# DESTDIR=data-bin/enfr-medterms-plus-snomed # output bin directory
# SPLIT_DIR=data/splits-snomed          # store temporary splits
DESTDIR=data-bin/initial # output bin directory
SPLIT_DIR=data/initial          # store temporary splits

VALID_RATIO=0.01
TEST_RATIO=0.01

mkdir -p "$SPLIT_DIR"
mkdir -p "$DESTDIR"

###############################################
# 1. SPLIT RAW PARALLEL DATA
###############################################

echo "[1/4] Splitting medterms.en/fr into train/valid"

paste "$RAW_EN" "$RAW_FR" | shuf > "$SPLIT_DIR/all.txt"

total=$(wc -l < "$SPLIT_DIR/all.txt")
valid=$(python - <<EOF
print(max(1, int($total * $VALID_RATIO)))
EOF
)
test=$(python - <<EOF
print(max(1, int($total * $TEST_RATIO)))
EOF
)
train=$((total - valid))

head -n $train "$SPLIT_DIR/all.txt" > "$SPLIT_DIR/train.txt"
tail -n $((valid + test)) "$SPLIT_DIR/all.txt" | head -n $valid > "$SPLIT_DIR/valid.txt"
tail -n $test "$SPLIT_DIR/all.txt" > "$SPLIT_DIR/test.txt"

cut -f1 "$SPLIT_DIR/train.txt" > "$SPLIT_DIR/medterms.train.en"
cut -f2 "$SPLIT_DIR/train.txt" > "$SPLIT_DIR/medterms.train.fr"

cut -f1 "$SPLIT_DIR/valid.txt" > "$SPLIT_DIR/medterms.valid.en"
cut -f2 "$SPLIT_DIR/valid.txt" > "$SPLIT_DIR/medterms.valid.fr"

echo "Train: $train lines"
echo "Valid: $valid lines"
# echo "Test : $test lines"

###############################################
# 2. APPLY WMT BPE TO ALL SPLITS
###############################################

echo "[2/4] Applying WMT BPE codes"

for prefix in medterms.train medterms.valid; do
    for lang in en fr; do
        in_file="$SPLIT_DIR/$prefix.$lang"
        out_file="$SPLIT_DIR/$prefix.bpe.$lang"
        echo "BPE → $in_file$out_file"

        subword-nmt apply-bpe \
            -c "$BPE_CODES" \
            < "$in_file" \
            > "$out_file"
    done
done

###############################################
# 3. RENAME BPE FILES TO FAIRSEQ FORMAT
###############################################

echo "[3/4] Preparing fairseq prefixes"

for split in train valid; do
    mv "$SPLIT_DIR/medterms.$split.bpe.en" "$SPLIT_DIR/medterms.$split.en"
    mv "$SPLIT_DIR/medterms.$split.bpe.fr" "$SPLIT_DIR/medterms.$split.fr"
done

###############################################
# 4. RUN FAIRSEQ PREPROCESS WITH WMT DICTS
###############################################

echo "[4/4] Running fairseq-preprocess"

fairseq-preprocess \
    --source-lang en --target-lang fr \
    --trainpref "$SPLIT_DIR/medterms.train" \
    --validpref "$SPLIT_DIR/medterms.valid" \
    --destdir "$DESTDIR" \
    --srcdict "$SRC_DICT" \
    --tgtdict "$TGT_DICT" \
    --workers 8

echo ""
echo "========================================="
echo "Done! Binarized data ready in: $DESTDIR"