Preprocessing
Datasets
In the following table, we present the datasets used for model training versions v1 and v2.
Dataset |
Model Training Version 1 (CNN v1) |
Model Training Version 2 (CNN v2) |
|---|---|---|
ATC |
6946 |
6946 |
CIM11_MMS |
36984 |
36984 |
CISP |
1434 |
1434 |
EMDN |
8344 |
8344 |
ICD-10 |
11443 |
11443 |
ICF |
1665 |
1665 |
NCIT |
106602 |
106602 |
ORDO |
15859 |
15859 |
SMS |
25888 |
25888 |
STANDARD_TERMS |
1297 |
1297 |
SNOMED |
215769 |
|
Total |
216462 |
432231 |
Running preprocessing
The following script splits raw parallel files, applies WMT BPE, and binarizes the result with Fairseq:
#!/usr/bin/env bash
set -e
###############################################
# CONFIG — CHANGE THESE PATHS
###############################################
# RAW_EN=data/medterms.en # raw English terms
# RAW_FR=data/medterms.fr
# RAW_EN=data/medterms_plus_snomed.en # raw English terms
# RAW_FR=data/medterms_plus_snomed.fr # raw French terms
RAW_EN=data/initial_unique_english.txt # raw English terms
RAW_FR=data/initial_unique_french.txt # raw French terms
BPE_CODES="E:/icd11/models/wmt14.en-fr.fconv-py/bpecodes" # BPE codes from WMT fconv model
SRC_DICT="E:/icd11/models/wmt14.en-fr.fconv-py/dict.en.txt" # source dictionary from WMT model
TGT_DICT="E:/icd11/models/wmt14.en-fr.fconv-py/dict.fr.txt" # target dictionary from WMT model
# DESTDIR=data-bin/enfr-medterms # output bin directory
# SPLIT_DIR=data/splits # store temporary splits
# DESTDIR=data-bin/enfr-medterms-plus-snomed # output bin directory
# SPLIT_DIR=data/splits-snomed # store temporary splits
DESTDIR=data-bin/initial # output bin directory
SPLIT_DIR=data/initial # store temporary splits
VALID_RATIO=0.01
TEST_RATIO=0.01
mkdir -p "$SPLIT_DIR"
mkdir -p "$DESTDIR"
###############################################
# 1. SPLIT RAW PARALLEL DATA
###############################################
echo "[1/4] Splitting medterms.en/fr into train/valid"
paste "$RAW_EN" "$RAW_FR" | shuf > "$SPLIT_DIR/all.txt"
total=$(wc -l < "$SPLIT_DIR/all.txt")
valid=$(python - <<EOF
print(max(1, int($total * $VALID_RATIO)))
EOF
)
test=$(python - <<EOF
print(max(1, int($total * $TEST_RATIO)))
EOF
)
train=$((total - valid))
head -n $train "$SPLIT_DIR/all.txt" > "$SPLIT_DIR/train.txt"
tail -n $((valid + test)) "$SPLIT_DIR/all.txt" | head -n $valid > "$SPLIT_DIR/valid.txt"
tail -n $test "$SPLIT_DIR/all.txt" > "$SPLIT_DIR/test.txt"
cut -f1 "$SPLIT_DIR/train.txt" > "$SPLIT_DIR/medterms.train.en"
cut -f2 "$SPLIT_DIR/train.txt" > "$SPLIT_DIR/medterms.train.fr"
cut -f1 "$SPLIT_DIR/valid.txt" > "$SPLIT_DIR/medterms.valid.en"
cut -f2 "$SPLIT_DIR/valid.txt" > "$SPLIT_DIR/medterms.valid.fr"
echo "Train: $train lines"
echo "Valid: $valid lines"
# echo "Test : $test lines"
###############################################
# 2. APPLY WMT BPE TO ALL SPLITS
###############################################
echo "[2/4] Applying WMT BPE codes"
for prefix in medterms.train medterms.valid; do
for lang in en fr; do
in_file="$SPLIT_DIR/$prefix.$lang"
out_file="$SPLIT_DIR/$prefix.bpe.$lang"
echo "BPE → $in_file → $out_file"
subword-nmt apply-bpe \
-c "$BPE_CODES" \
< "$in_file" \
> "$out_file"
done
done
###############################################
# 3. RENAME BPE FILES TO FAIRSEQ FORMAT
###############################################
echo "[3/4] Preparing fairseq prefixes"
for split in train valid; do
mv "$SPLIT_DIR/medterms.$split.bpe.en" "$SPLIT_DIR/medterms.$split.en"
mv "$SPLIT_DIR/medterms.$split.bpe.fr" "$SPLIT_DIR/medterms.$split.fr"
done
###############################################
# 4. RUN FAIRSEQ PREPROCESS WITH WMT DICTS
###############################################
echo "[4/4] Running fairseq-preprocess"
fairseq-preprocess \
--source-lang en --target-lang fr \
--trainpref "$SPLIT_DIR/medterms.train" \
--validpref "$SPLIT_DIR/medterms.valid" \
--destdir "$DESTDIR" \
--srcdict "$SRC_DICT" \
--tgtdict "$TGT_DICT" \
--workers 8
echo ""
echo "========================================="
echo "Done! Binarized data ready in: $DESTDIR"