Scoring function

The API returns a trust_score for each translation. This score combines the model confidence with terminology evidence from the post-processing lookup.

Formula

The implementation uses three cases:

  • Exact full terminology match: return 1.0.

  • No terminology match: return the model confidence.

  • Partial terminology match: combine model confidence and terminology match signal.

The weighting scheme is:

\[\begin{split}trust\_score = \begin{cases} 1.0, & \text{if the full translation exactly matches a terminology term} \\ C, & \text{if no terminology match is found} \\ 0.6C + 0.4M, & \text{if partial terminology matches are found} \end{cases}\end{split}\]

where:

\[M = 0.7R_{coverage} + 0.3R_{count}\]

and:

  • C is the model confidence used internally by the scoring function.

  • M is the terminology match signal.

  • R_coverage is the proportion of the translated text covered by matched terminology spans.

  • R_count = min(number_of_matches / 5, 1.0).

This means exact terminology matches are trusted fully, partial matches raise or lower the score according to coverage and match count, and translations with no terminology support rely only on the model confidence.

Remote API evaluation

The scenarios below were evaluated against https://anstranslation2.ddns.net/translate using the deployed NLLB model /app/models/checkpoint-16539.

Scenario

Terms

With highlights

Exact score 1.0

Average trust_score

Exact term with a terminology match

1

1

1

1.000000

Sentence/label with exact terms matched within terminology

50

50

1

0.933450

No terminology matches

6

0

0

0.925313

Scenario examples

The table shows representative examples only. The complete 50-label scenario 2 run is stored in scoring_function_remote_api_results.txt.

Scenario

Input

Output

trust_score

Highlights

Scenario 1

heart failure

insuffisance cardiaque

1.0

1

Scenario 2

patient with heart failure and respiratory distress

patient avec insuffisance cardiaque et detresse respiratoire

0.956598

5

Scenario 2

type 2 diabetes mellitus with hypertension

diabete sucre de type 2 avec hypertension

0.974716

3

Scenario 2

migraine with aura

migraine avec aura

1.0

1

Scenario 2

burn injury with skin infection

brulure traumatique avec infection cutanee

0.839952

3

Scenario 3

anti-MOG IgG seropositivity

seropositivite aux IgG anti-MOG

0.884634

0

Scenario 3

c.1521_1523delCTT CFTR variant

variant c.1521_1523delCTT CFTR

0.961241

0

Scenario 3

ctDNA MRD positivity

ctDNA MRD positivity

0.909852

0

Key takeaways

  • Scenario 1 confirms the exact-match rule: when the translated term is found exactly in the terminology, trust_score becomes 1.0.

  • Scenario 2 confirms the partial-match behavior: sentence-like labels usually receive high scores when several translated spans match known terminology.

  • Scenario 3 confirms the fallback behavior on realistic hard clinical shorthand: when there are no terminology highlights, the score comes from the model confidence only.

  • The full scenario 2 batch used 50 labels and is intentionally kept outside the rendered page to avoid making the documentation too long.