The function is called DeterminePageLanguage. It's in the file components/translate/core/language_detection/language_detection_util.cc
Chrome first checks the HTML lang attribute and if it's not present it checks the Content-Language HTTP header. Then it gets a prediction from cld3.
The Compact Language Detector v3 (or CLD3) is a neural network model for language identification. The README states:
The inference code extracts character ngrams from the input text and computes the fraction of times each of them appears. For example, as shown in the figure below, if the input text is "banana", then one of the extracted trigrams is "ana" and the corresponding fraction is 2/4. The ngrams are hashed down to an id within a small range, and each id is represented by a dense embedding vector estimated during training.
The model averages the embeddings corresponding to each ngram type according to the fractions, and the averaged embeddings are concatenated to produce the embedding layer.
So essentially, they downloaded copies of a bunch of websites and paid someone to look at the text on those websites and say what language they're written in. Next they split the text into n-grams (groups of n letters) and so on and used a neural network to learn a mapping between n-gram distributions and languages.
So now they have 2 variables:
language which is set from either the HTML or the header (recall that the HTML attribute takes precedence if both are present)
cld_language which is a prediction based on the frequencies of groups of letters on the page
Then we hit this series of if-statements (I've edited out the part where they send analytics data about language mismatches)
if (language.empty()) {
return cld_language;
}
if (cld_language == kUnknownLanguageCode) {
return language;
}
if (CanCLDComplementSubCode(language, cld_language)) {
return cld_language;
}
if (IsSameOrSimilarLanguages(language, cld_language)) {
return language;
}
if (MaybeServerWrongConfiguration(language, cld_language)) {
return cld_language;
}
// Content-Language value might be wrong because CLD says that this page is
// written in another language with confidence. In this case, Chrome doesn't
// rely on any of the language codes, and gives up suggesting a translation.
return kUnknownLanguageCode;
CLD3 is small and is run locally. In fact, it's open source and they distribute a pre-trained model (although the code for training the model and the data they used isn't available). You can use it in your projects.
There's even official Python bindings:
pip install gcld3