Results
The following table summarizes evaluation results across CNN and NLLB variants.
Model / Dataset |
Num Rows |
Exact Acc Top-1 |
Norm Acc Top-1 |
Exact Acc Top-5 |
Norm Acc Top-5 |
Gold subset Pred Top-5 |
|---|---|---|---|---|---|---|
CNN (Initial dataset): v1 |
721 |
24.69% |
37.31% |
48.68% |
52.15% |
56.31% |
CNN (New dataset): v2 |
721 |
15.67% |
32.45% |
29.54% |
43.13% |
47.02% |
CNN (New dataset + SNOMED): v3 |
721 |
28.16% |
40.50% |
44.38% |
53.40% |
57.98% |
NLLB (Initial dataset): v4 |
721 |
24.41% |
37.59% |
45.35% |
51.32% |
54.37% |
NLLB (New dataset): v5 |
721 |
17.06% |
33.43% |
30.10% |
46.32% |
51.46% |
NLLB (New dataset + SNOMED): v6 |
721 |
26.63% |
42.02% |
47.71% |
55.20% |
60.75% |
NLLB configuration rationale
The NLLB configuration was selected from measured benchmark behavior on the project validation set.
Why NLLB: it gave better normalized and coverage-focused quality with the updated datasets, and the strongest run was
NLLB (New dataset + SNOMED): v6.Key result takeaways: *
Norm Acc Top-1:42.02%(best among compared runs) *Norm Acc Top-5:55.20%(best among compared runs) *Gold subset Pred Top-5:60.75%(best among compared runs)SNOMED impact for NLLB (v5 -> v6): *
Norm Acc Top-1:33.43% -> 42.02%*Norm Acc Top-5:46.32% -> 55.20%*Gold subset Pred Top-5:51.46% -> 60.75%Why smaller inference parameters:
num_beams=2andmax_new_tokens=48provide faster CPU response while preserving term-level translation quality for short terminology inputs.Training-side intent: conservative fine-tuning settings (moderate learning rate, gradient accumulation, fixed epoch budget) improve medical terminology fidelity without unnecessary production latency or compute cost.