Language Detection in Private AI

Private AI runs a language detection model on every de-identification request and provides the detected languages as part of the response.

warning

It is important to note that the de-identification service model may be selected independently of the language detection results. Refer to the accuracy parameter description for details on how to force a specific model for de-identification and to learn how to enable automatic model selection.

Private AI includes language detection capabilities. Here is an example of what you'll see when you use the process/text endpoint:

Copy

Copied

[
    {
        "processed_text": "... <processed user text>...",
        "entities": [... <list of entities found>...],
        "entities_present": true,
        "characters_processed": x,
        "languages_detected": {
            "en": 0.7678971290588379,
            "fr": 0.5934884324888
        }
    }
]

Private AI uses the fastText language identification model to detect and identify languages which is independent of the Private AI PII model. The list of supported languages for fastText can be found here.

Language detection is always run as part of text and file processing. It is important to note that detection of a language does not mean that the appropriate PII model (English or Multilingual) was used to process the payload or that support for the language is available for PII detection. For example, sending a request such as:

Copy

Copied

{
    "text": ["Ich würde gerne mit John sprechen. Er wohnt in der Neuhauser Straße"],
    "entity_detection": {
        "accuracy": "high"
    }
}

will always use the English-only model to de-identify the input text and, while the response MAY contain entities, the multilingual model will have better performance on languages other than English and should be preferred. Here is the response you receive from this payload:

Copy

Copied

[
    {
        "processed_text": "Ich würde gerne mit [NAME_GIVEN_1] sprechen. Er wohnt in der [LOCATION_ADDRESS_STREET_1]",
        "entities": [
            {
                "processed_text": "NAME_GIVEN_1",
                "text": "John",
                "location": {
                    "stt_idx": 20,
                    "end_idx": 24,
                    "stt_idx_processed": 20,
                    "end_idx_processed": 34
                },
                "best_label": "NAME_GIVEN",
                "labels": {
                    "NAME": 0.9956,
                    "NAME_GIVEN": 0.4324
                }
            },
            {
                "processed_text": "LOCATION_ADDRESS_STREET_1",
                "text": "Neuhauser Straße",
                "location": {
                    "stt_idx": 51,
                    "end_idx": 67,
                    "stt_idx_processed": 61,
                    "end_idx_processed": 88
                },
                "best_label": "LOCATION_ADDRESS_STREET",
                "labels": {
                    "LOCATION": 0.9966,
                    "LOCATION_ADDRESS_STREET": 0.3716
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 67,
        "languages_detected": {
            "de": 0.999757707118988
        }
    }
]

Although entities are detected and the language was identified as de (German), the English PII model high was used to process the text.