Results

The following table summarizes evaluation results across CNN and NLLB variants.

Model / Dataset	Num Rows	Exact Acc Top-1	Norm Acc Top-1	Exact Acc Top-5	Norm Acc Top-5	Gold subset Pred Top-5
CNN (Initial dataset): v1	721	24.69%	37.31%	48.68%	52.15%	56.31%
CNN (New dataset): v2	721	15.67%	32.45%	29.54%	43.13%	47.02%
CNN (New dataset + SNOMED): v3	721	28.16%	40.50%	44.38%	53.40%	57.98%
NLLB (Initial dataset): v4	721	24.41%	37.59%	45.35%	51.32%	54.37%
NLLB (New dataset): v5	721	17.06%	33.43%	30.10%	46.32%	51.46%
NLLB (New dataset + SNOMED): v6	721	26.63%	42.02%	47.71%	55.20%	60.75%

NLLB configuration rationale

The NLLB configuration was selected from measured benchmark behavior on the project validation set.

Why NLLB: it gave better normalized and coverage-focused quality with the updated datasets, and the strongest run was NLLB (New dataset + SNOMED): v6.
Key result takeaways: * Norm Acc Top-1: 42.02% (best among compared runs) * Norm Acc Top-5: 55.20% (best among compared runs) * Gold subset Pred Top-5: 60.75% (best among compared runs)
SNOMED impact for NLLB (v5 -> v6): * Norm Acc Top-1: 33.43% -> 42.02% * Norm Acc Top-5: 46.32% -> 55.20% * Gold subset Pred Top-5: 51.46% -> 60.75%
Why smaller inference parameters: num_beams=2 and max_new_tokens=48 provide faster CPU response while preserving term-level translation quality for short terminology inputs.
Training-side intent: conservative fine-tuning settings (moderate learning rate, gradient accumulation, fixed epoch budget) improve medical terminology fidelity without unnecessary production latency or compute cost.