12

I just opened a web page in Google Chrome, and it says "This page is in Japanese, would you like to translate it?".

Asking for a translation would presumably send the contents to Google, but how is the language identified in the first place? Is this done locally, in the browser? Or does this also send the page to Google? If so, should I not be asked for permission first? The page itself has no markup to indicate the language, and it is an internal intranet page, so that I am not at all sure that Google should be having access to its content.

Thilo
  • 3,425

3 Answers3

11

The Chrome browser can identify, or at least guess, the page language by looking at a number of on page factors:

This can be done locally without any further internet connection or reporting to Google.

Translation of the content would definitely send the page content to Google servers for translation.

s01ipsist
  • 211
  • 1
  • 4
10

The function is called DeterminePageLanguage. It's in the file components/translate/core/language_detection/language_detection_util.cc

Chrome first checks the HTML lang attribute and if it's not present it checks the Content-Language HTTP header. Then it gets a prediction from cld3.

The Compact Language Detector v3 (or CLD3) is a neural network model for language identification. The README states:

The inference code extracts character ngrams from the input text and computes the fraction of times each of them appears. For example, as shown in the figure below, if the input text is "banana", then one of the extracted trigrams is "ana" and the corresponding fraction is 2/4. The ngrams are hashed down to an id within a small range, and each id is represented by a dense embedding vector estimated during training.

The model averages the embeddings corresponding to each ngram type according to the fractions, and the averaged embeddings are concatenated to produce the embedding layer.

So essentially, they downloaded copies of a bunch of websites and paid someone to look at the text on those websites and say what language they're written in. Next they split the text into n-grams (groups of n letters) and so on and used a neural network to learn a mapping between n-gram distributions and languages.

So now they have 2 variables:

  • language which is set from either the HTML or the header (recall that the HTML attribute takes precedence if both are present)
  • cld_language which is a prediction based on the frequencies of groups of letters on the page

Then we hit this series of if-statements (I've edited out the part where they send analytics data about language mismatches)

  if (language.empty()) {
    return cld_language;
  }

if (cld_language == kUnknownLanguageCode) { return language; }

if (CanCLDComplementSubCode(language, cld_language)) { return cld_language; }

if (IsSameOrSimilarLanguages(language, cld_language)) { return language; }

if (MaybeServerWrongConfiguration(language, cld_language)) { return cld_language; }

// Content-Language value might be wrong because CLD says that this page is // written in another language with confidence. In this case, Chrome doesn't // rely on any of the language codes, and gives up suggesting a translation. return kUnknownLanguageCode;

CLD3 is small and is run locally. In fact, it's open source and they distribute a pre-trained model (although the code for training the model and the data they used isn't available). You can use it in your projects.

There's even official Python bindings:

pip install gcld3
0

The primary way for a browser like Chrome to determine the language of a web page is by reading the lang attribute on the <html> tag.

This is the first declaration in an HTML document:

<html lang="en">
...
</html>

For a page in French, it would be:

<html lang="fr">
...
</html>

According to the W3C document, Chrome, as a "user agent", uses this language declaration for a number of important functions (like translation, spell-checking, font selection, TTS, CSS etc.) to improve the user's experience.

If a page contains text in more than one language, the lang attribute can be used on specific HTML elements to declare the language for just that section. Chrome will then apply the rules above specifically to that part of the content:

<p>In Paris, they say <span lang="fr">"Bonjour!"</span>.</p>

The W3C article also clarifies the status of other methods:

  • Content-Language HTTP Header: The document states that this header is not meant to declare the language of the document itself, but rather the language of the intended audience. It advises against using it for this purpose and recommends always using the lang attribute instead.
  • <meta http-equiv="Content-Language">: Do not use it. Its use is now non-conforming in HTML5.
user159
  • 101