Scoring function

The API returns a trust_score for each translation. This score combines the model confidence with terminology evidence from the post-processing lookup.

Formula

The implementation uses three cases:

Exact full terminology match: return 1.0.
No terminology match: return 0.9C so unsupported translations are slightly penalized relative to terminology-supported translations.
Partial terminology match: combine model confidence and terminology match signal.

The weighting scheme is:

\[\begin{split}trust\_score = \begin{cases} 1.0, & \text{if the full translation exactly matches a terminology term} \\ 0.9C, & \text{if no terminology match is found} \\ 0.6C + 0.4M, & \text{if partial terminology matches are found} \end{cases}\end{split}\]

where:

\[M = 0.7R_{coverage} + 0.3R_{count}\]

and:

C is the model confidence used internally by the scoring function.
M is the terminology match signal.
R_coverage is the proportion of the translated text covered by matched terminology spans.
R_count = min(number_of_matches / 5, 1.0).

This means exact terminology matches are trusted fully, partial matches raise or lower the score according to coverage and match count, and translations with no terminology support keep the model confidence signal but are discounted by 10%.

Partial highlights are only emitted when the matched span is an exact full terminology label and the label has enough lexical content. French connector words such as avec, de, à, et, and similar function words are ignored even if they appear as standalone rows in the terminology table.

Remote API evaluation

The scenario inputs below were sent to https://anstranslation2.ddns.net/translate using the deployed NLLB model /app/models/checkpoint-16539 to obtain translations, model confidence, and terminology highlights. The reported trust_score values use the local repository formula above, including the no-match penalty, which will apply on the API after this code is deployed.

Generated: 2026-06-10T07:35:54.325019+00:00.

Scenario	Terms	With highlights	Exact score 1.0	Average trust_score
Exact term with a terminology match	1	1	1	`1.000000`
Sentence/label with exact terms matched within terminology	50	50	1	`0.887910`
No terminology matches	6	0	0	`0.832782`

Scenario examples

The table shows representative examples only. The complete 50-label scenario 2 run is stored in scoring_function_remote_api_results.txt. The text report is written as UTF-8 with BOM so French accents display correctly in Windows editors, browsers, and terminals that rely on encoding detection.

Scenario	Input	Output	trust_score	Highlights
Scenario 1	`heart failure`	`insuffisance cardiaque`	`1`	1
Scenario 2	`patient with heart failure and respiratory distress`	`patient avec insuffisance cardiaque et détresse respiratoire`	`0.928598`	3
Scenario 2	`type 2 diabetes mellitus with hypertension`	`diabète sucré de type 2 avec hypertension`	`0.947399`	2
Scenario 2	`migraine with aura`	`migraine avec aura`	`1`	1
Scenario 2	`burn injury with skin infection`	`brulure traumatique avec infection cutanée`	`0.765285`	2
Scenario 3	`anti-MOG IgG seropositivity`	`séropositivité aux IgG anti-MOG`	`0.796171`	0
Scenario 3	`c.1521_1523delCTT CFTR variant`	`variant c.1521_1523delCTT CFTR`	`0.865117`	0
Scenario 3	`ctDNA MRD positivity`	`ctDNA MRD positivity`	`0.818867`	0

Key takeaways

Scenario 1 confirms the exact-match rule: when the translated term is found exactly in the terminology, trust_score becomes 1.0.
Scenario 2 confirms the partial-match behavior: sentence-like labels usually receive high scores when several translated spans match known terminology.
Scenario 3 confirms the no-match behavior: without terminology highlights, the score is the model confidence discounted by 0.9, which keeps these unsupported cases below the scenario 2 average in this run.
The full scenario 2 batch used 50 labels and is intentionally kept outside the rendered page to avoid making the documentation too long.