Results

The following table summarizes evaluation results across CNN and NLLB variants.

Model / Dataset

Num Rows

Exact Acc Top-1

Norm Acc Top-1

Exact Acc Top-5

Norm Acc Top-5

Gold subset Pred Top-5

CNN (Initial dataset): v1

721

24.69%

37.31%

48.68%

52.15%

56.31%

CNN (New dataset): v2

721

15.67%

32.45%

29.54%

43.13%

47.02%

CNN (New dataset + SNOMED): v3

721

28.16%

40.50%

44.38%

53.40%

57.98%

NLLB (Initial dataset): v4

721

24.41%

37.59%

45.35%

51.32%

54.37%

NLLB (New dataset): v5

721

17.06%

33.43%

30.10%

46.32%

51.46%

NLLB (New dataset + SNOMED): v6

721

26.63%

42.02%

47.71%

55.20%

60.75%

NLLB configuration rationale

The NLLB configuration was selected from measured benchmark behavior on the project validation set.

  • Why NLLB: it gave better normalized and coverage-focused quality with the updated datasets, and the strongest run was NLLB (New dataset + SNOMED): v6.

  • Key result takeaways: * Norm Acc Top-1: 42.02% (best among compared runs) * Norm Acc Top-5: 55.20% (best among compared runs) * Gold subset Pred Top-5: 60.75% (best among compared runs)

  • SNOMED impact for NLLB (v5 -> v6): * Norm Acc Top-1: 33.43% -> 42.02% * Norm Acc Top-5: 46.32% -> 55.20% * Gold subset Pred Top-5: 51.46% -> 60.75%

  • Why smaller inference parameters: num_beams=2 and max_new_tokens=48 provide faster CPU response while preserving term-level translation quality for short terminology inputs.

  • Training-side intent: conservative fine-tuning settings (moderate learning rate, gradient accumulation, fixed epoch budget) improve medical terminology fidelity without unnecessary production latency or compute cost.