Using Private AI to analyze your text

Detect, Parse, and Validate Entities in Text

info

In order to run the example code in this guide, please sign up for your free test api key here.

In addition to de-identification and redaction, Private AI also supports entity detection and validation. The analyze/text route described below is an essential tool for exploring and structuring your data as well as creating statistics around your data. In this guide, we demonstrate how to use the analyze/text endpoint introduced in 4.1 to return the analysis results of the detected entities, with examples of how these results can be used to meet your own use cases.

Analyze entities in text (new in 4.1)

The analyze/text route returns a list of detected entities along with the formatted text for each entity and a description of its subtypes. In this guide, we provide payloads to the Private AI's analyze/text REST API route and document the associated responses.

To better illustrate how this information can be used, we proceed by giving a series of common use cases.

Validation and custom redaction of credit card numbers

Some numerical entities integrate a checksum in their values. This checksum is used to confirm the entity's validity and to minimize the chance of error during transcription. This is the case for credit card numbers, which must satisfy the Luhn algorithm. The analyze/text route implements this algorithm on top of the NER model detection. This provides an additional safeguard by ensuring that the detected number is indeed a valid credit card number. Let's look at three specific examples including credit card numbers.

request with CCresponse with CC validation

Copy

Copied

{
    "text": [
        "Okay, hang on just a second because I got to get it. Okay, it is 6578-7790-4346-2237. Expiration. 1224.",
        "All right, I'm ready. 800 678-457-7896. Expiration is one. 224.",
        "CC_type: Diners Club International RuPay Visa JCB Amex CCN: 30569309025904 4242424242424242 4222222222222 6172873484776530 378282246310005 CC_CVC: 480 902 182 765 143 CC_Expiredate: 5/28 6/67 12/67 11/29 9/70"
    ],
    "locale": "en-US",
    "entity_detection": {
        "accuracy": "high",
        "entity_types": [
            {
                "type": "ENABLE",
                "value": ["CREDIT_CARD"]
            }
        ]
    }
}

Copy

Copied

[
    {
        "entities": [
            {
                "text": "6578-7790-4346-2237",
                "location": {
                    "stt_idx": 65,
                    "end_idx": 84
                },
                "best_label": "CREDIT_CARD",
                "labels": {
                    "CREDIT_CARD": 0.9022786834023215
                },
                "analysis_result": {
                    "formatted": "6578 7790 4346 2237",
                    "subtypes": [],
                    "validation_assertions": [
                        {
                            "provider": "luhn",
                            "status": "invalid"
                        }
                    ]
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 103,
        "languages_detected": {
            "en": 0.9202778935432434
        }
    },
    {
        "entities": [
            {
                "text": "800 678-457-7896",
                "location": {
                    "stt_idx": 22,
                    "end_idx": 40
                },
                "best_label": "CREDIT_CARD",
                "labels": {
                    "CREDIT_CARD": 0.9012922777069939
                },
                "analysis_result": {
                    "subtypes": [],
                    "validation_assertions": []
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 65,
        "languages_detected": {
            "en": 0.8164065480232239
        }
    },
    {
        "entities": [
            {
                "text": "30569309025904",
                "location": {
                    "stt_idx": 60,
                    "end_idx": 74
                },
                "best_label": "CREDIT_CARD",
                "labels": {
                    "CREDIT_CARD": 1.0
                },
                "analysis_result": {
                    "formatted": "3056 9309 025 904",
                    "subtypes": [],
                    "validation_assertions": [
                        {
                            "provider": "luhn",
                            "status": "valid"
                        }
                    ]
                }
            },
            {
                "text": "4242424242424242",
                "location": {
                    "stt_idx": 75,
                    "end_idx": 91
                },
                "best_label": "CREDIT_CARD",
                "labels": {
                    "CREDIT_CARD": 1.0
                },
                "analysis_result": {
                    "formatted": "4242 4242 4242 4242",
                    "subtypes": [],
                    "validation_assertions": [
                        {
                            "provider": "luhn",
                            "status": "valid"
                        }
                    ]
                }
            },
            {
                "text": "4222222222222",
                "location": {
                    "stt_idx": 92,
                    "end_idx": 105
                },
                "best_label": "CREDIT_CARD",
                "labels": {
                    "CREDIT_CARD": 1.0
                },
                "analysis_result": {
                    "formatted": "4222 222 222 222",
                    "subtypes": [],
                    "validation_assertions": [
                        {
                            "provider": "luhn",
                            "status": "valid"
                        }
                    ]
                }
            },
            {
                "text": "6172873484776530",
                "location": {
                    "stt_idx": 106,
                    "end_idx": 122
                },
                "best_label": "CREDIT_CARD",
                "labels": {
                    "CREDIT_CARD": 0.9088553956576756
                },
                "analysis_result": {
                    "formatted": "6172 8734 8477 6530",
                    "subtypes": [],
                    "validation_assertions": [
                        {
                            "provider": "luhn",
                            "status": "invalid"
                        }
                    ]
                }
            },
            {
                "text": "378282246310005",
                "location": {
                    "stt_idx": 123,
                    "end_idx": 138
                },
                "best_label": "CREDIT_CARD",
                "labels": {
                    "CREDIT_CARD": 1.0
                },
                "analysis_result": {
                    "formatted": "3782 8224 6310 005",
                    "subtypes": [],
                    "validation_assertions": [
                        {
                            "provider": "luhn",
                            "status": "valid"
                        }
                    ]
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 208,
        "languages_detected": {
            "en": 0.24319741129875183
        }
    }
]

The above request contains two fields, text and entity_detection, that are shared by the analyze/text, the ner/text and the process/text routes. The text field contains the text to analyze and the entity_detection field contains the NER configurations (e.g., the list of entities to detect). One last field in the request, locale, is unique to the analyze/text request. The locale field is used as a hint to the analyzer to help parse dates and other locale-dependent entities. For example, setting locale to en-US will force the analyzer to interpret the date 12-10-2020 as December 10, 2020 instead of October 12, 2020. Several example of values that can take these fields are provided below.

The full response above is a mouthful, so let's look at the first example's response in more detail.

Copy

Copied

{
    "entities": [
        {
            "text": "6578-7790-4346-2237",
            "location": {
                "stt_idx": 65,
                "end_idx": 84
            },
            "best_label": "CREDIT_CARD",
            "labels": {
                "CREDIT_CARD": 0.9022786834023215
            },
            "analysis_result": {
                "formatted": "6578 7790 4346 2237",
                "subtypes": [],
                "validation_assertions": [
                    {
                        "provider": "luhn",
                        "status": "invalid"
                    }
                ]
            }
        }
    ],
    "entities_present": true,
    "characters_processed": 103,
    "languages_detected": {
        "en": 0.9202778935432434
    }
}

The response contains three main parts:

the entity information including its text and its location . Those fields are shared across other routes including the ner/text and process/text routes and have the same use.
the formatted text of the entity. This field is unique to the analyze/text route and provides a "standard" format for the entity. This can facilitate the introduction of post-processing logic on detected entities. The formats are described in the following table.

Entity Type	Format	Example
CREDIT_CARD	space-separated groups of 3 to 5 digits	6578 7790 4346 2237
DATE	ISO-8601	2025-03-20T18:00:00+00:00
DOB	ISO-8601	2025-03-20
AGE	decimal numeral	12
All other entity types	no formatting	-

a list of validation assertions on the entity, which is also unique to the analyze/text route. It contains a list of objects that are specific to the entity being detected. In this example, the provider is the Luhn algorithm that was run on the credit card number and the result of the algorithm is provided as part of the status field. Currently, only credit card numbers contain validation assertions but more assertion providers will be added in the future.

The analysis result of this first example can be summed up in the following way. The credit card was successfully parsed and the parsed result is placed in the formatted field. However, although the number matches the credit card number format, the Luhn check failed on the number, so it is not a valid credit card number. This could be the result of a transcription error, for example.

The information included in the analysis result allows the creation of custom redaction of entities, using the post-processing framework, as shown in this section.

Date shifting and custom redaction of dates

Dates are one type of PII that is encountered in almost every dataset. Redaction is one way to ensure that sensitive dates do not create privacy issues. However, fully redacting dates often reduces the utility of the redacted data. For dates, it is often preferable to use other obfuscation methods that preserve their utility. Two well-known techniques are date shifting and date bucketing. Let's consider three examples containing dates.

request with datesresponse with parsed dates

Copy

Copied

{
    "text": [
        "$MDT $MRK $QRVO $TSS &amp; 5 more stock picks for LONG swings:  https://t.co/CbkieXxqoR (July 10 2018) https://t.co/eit53RUY4g",
        "Short sale volume (not short interest) for $KBE on 2018-07-09 is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%",
        "$WLTW high OI range is 160 to 155 for option expiration 07/20/2018 #options https://t.co/BnVElKBKkJ"
    ],
    "locale": "en-US",
    "entity_detection": {
        "entity_types": [
            {
                "type": "ENABLE",
                "value": ["DATE", "DOB", "DAY", "MONTH", "YEAR"]
            }
        ]
    }
}

Copy

Copied

[
    {
        "entities": [
            {
                "text": "July 10 2018",
                "location": {
                    "stt_idx": 89,
                    "end_idx": 101
                },
                "best_label": "DATE",
                "labels": {
                    "DATE": 0.9400081038475037,
                    "MONTH": 0.3111259341239929,
                    "DAY": 0.31207050879796344,
                    "YEAR": 0.29245950778325397
                },
                "analysis_result": {
                    "formatted": "2018-07-10T00:00:00",
                    "subtypes": [
                        {
                            "text": "10",
                            "formatted": "10",
                            "label": "DAY",
                            "location": {
                                "stt_idx": 94,
                                "end_idx": 96
                            }
                        },
                        {
                            "text": "July",
                            "formatted": "7",
                            "label": "MONTH",
                            "location": {
                                "stt_idx": 89,
                                "end_idx": 93
                            }
                        },
                        {
                            "text": "2018",
                            "formatted": "2018",
                            "label": "YEAR",
                            "location": {
                                "stt_idx": 97,
                                "end_idx": 101
                            }
                        }
                    ],
                    "validation_assertions": []
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 126,
        "languages_detected": {
            "en": 0.6427053809165955
        }
    },
    {
        "entities": [
            {
                "text": "2018-07-09",
                "location": {
                    "stt_idx": 51,
                    "end_idx": 61
                },
                "best_label": "DATE",
                "labels": {
                    "DATE": 0.9267139077186585,
                    "YEAR": 0.17909334897994994,
                    "MONTH": 0.18299812078475952,
                    "DAY": 0.18503443002700806
                },
                "analysis_result": {
                    "formatted": "2018-07-09T00:00:00",
                    "subtypes": [
                        {
                            "text": "09",
                            "formatted": "9",
                            "label": "DAY",
                            "location": {
                                "stt_idx": 59,
                                "end_idx": 61
                            }
                        },
                        {
                            "text": "07",
                            "formatted": "7",
                            "label": "MONTH",
                            "location": {
                                "stt_idx": 56,
                                "end_idx": 58
                            }
                        },
                        {
                            "text": "2018",
                            "formatted": "2018",
                            "label": "YEAR",
                            "location": {
                                "stt_idx": 51,
                                "end_idx": 55
                            }
                        }
                    ],
                    "validation_assertions": []
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 132,
        "languages_detected": {
            "en": 0.5451536178588867
        }
    },
    {
        "entities": [
            {
                "text": "07/20/2018",
                "location": {
                    "stt_idx": 56,
                    "end_idx": 66
                },
                "best_label": "DATE",
                "labels": {
                    "DATE": 0.9359936833381652,
                    "MONTH": 0.18900736570358276,
                    "DAY": 0.18550281524658202,
                    "YEAR": 0.18460171222686766
                },
                "analysis_result": {
                    "formatted": "2018-07-20T00:00:00",
                    "subtypes": [
                        {
                            "text": "20",
                            "formatted": "20",
                            "label": "DAY",
                            "location": {
                                "stt_idx": 59,
                                "end_idx": 61
                            }
                        },
                        {
                            "text": "07",
                            "formatted": "7",
                            "label": "MONTH",
                            "location": {
                                "stt_idx": 56,
                                "end_idx": 58
                            }
                        },
                        {
                            "text": "2018",
                            "formatted": "2018",
                            "label": "YEAR",
                            "location": {
                                "stt_idx": 62,
                                "end_idx": 66
                            }
                        }
                    ],
                    "validation_assertions": []
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 99,
        "languages_detected": {
            "en": 0.7047932744026184
        }
    }
]

Let's look at one specific date entity in the above response.

Copy

Copied

{
    "text": "July 10 2018",
    "location": {
        "stt_idx": 89,
        "end_idx": 101
    },
    "best_label": "DATE",
    "labels": {
        "DATE": 0.9400081038475037,
        "MONTH": 0.3111259341239929,
        "DAY": 0.31207050879796344,
        "YEAR": 0.29245950778325397
    },
    "analysis_result": {
        "formatted": "2018-07-10T00:00:00",
        "subtypes": [
            {
                "text": "10",
                "formatted": "10",
                "label": "DAY",
                "location": {
                    "stt_idx": 94,
                    "end_idx": 96
                }
            },
            {
                "text": "July",
                "formatted": "7",
                "label": "MONTH",
                "location": {
                    "stt_idx": 89,
                    "end_idx": 93
                }
            },
            {
                "text": "2018",
                "formatted": "2018",
                "label": "YEAR",
                "location": {
                    "stt_idx": 97,
                    "end_idx": 101
                }
            }
        ],
        "validation_assertions": []
    }
}

Many pieces of information are accessible from the analysis_result object. First, it is possible to access the formatted date "2018-07-10T00:00:00" from the field analysis_result.formatted. If you plan to implement logic on the dates found in the text, it might be easier to access the formatted dates rather than the original, non-standard date formats (e.g., "July 10 2018").

Also, it is possible to directly access the day, month, and year of the date entity via the response fields in analysis_result.subtypes. This information can be used to partially redact or to bucketize dates.
An example of redacting the day and month but keeping the year is provided in the custom redaction of dates guide.

Age bucketing and custom redaction of numbers

Similar to dates, it is possible to analyze ages and other numerical entities to create custom redaction. Consider these two examples.

request with agesresponse with parsed ages

Copy

Copied

{
    "text": [
        "A 32-year old Black female German citizen living in Germany wants to travel to the United States for leisure.",
        "West Point Public School division provides school-based preschool services for children from two through nine years of age who are children at risk and children with identified disabilities or delays."
    ],
    "link_batch": false,
    "locale": "en-US",
    "entity_detection": {
        "entity_types": [
            {
                "type": "ENABLE",
                "value": ["AGE"]
            }
        ]
    }
}

Copy

Copied

[
    {
        "entities": [
            {
                "text": "32",
                "location": {
                    "stt_idx": 2,
                    "end_idx": 4
                },
                "best_label": "AGE",
                "labels": {
                    "AGE": 0.9668179750442505
                },
                "analysis_result": {
                    "formatted": 32,
                    "subtypes": [],
                    "validation_assertions": []
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 109,
        "languages_detected": {
            "en": 0.9611877202987671
        }
    },
    {
        "entities": [
            {
                "text": "two",
                "location": {
                    "stt_idx": 93,
                    "end_idx": 96
                },
                "best_label": "AGE",
                "labels": {
                    "AGE": 0.9462096095085144
                },
                "analysis_result": {
                    "formatted": 2,
                    "subtypes": [],
                    "validation_assertions": []
                }
            },
            {
                "text": "nine",
                "location": {
                    "stt_idx": 105,
                    "end_idx": 109
                },
                "best_label": "AGE",
                "labels": {
                    "AGE": 0.9411536455154419
                },
                "analysis_result": {
                    "formatted": 9,
                    "subtypes": [],
                    "validation_assertions": []
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 200,
        "languages_detected": {
            "en": 0.9786704778671265
        }
    }
]

Using the Private AI python client, one can use the above analyze/text response to bucketize ages, as shown here.

Custom redaction of addresses

The GDPR and other privacy legislations impose strict requirements regarding the redaction of addresses. In the following scenario, we demonstrate how to partially redact an address by leaving only the less sensitive characters of a zip/postal code and removing all other address information (e.g., civic number, street name, and so on).

request with addressesresponse with parsed addresses

Copy

Copied

{
    "text": [
        "Please deliver this to 45, Clybaun Heights, Galway City, Ireland H91 AKK3",
        "3255 M-A-D-D-A-M-S street, huntington, west virginia is his birthplace",
        "My favorite city is San Francisco, California 94110, United States, 37.7749° N, 122.4194° W"
    ],
    "locale": "en-US"
}

Copy

Copied

[
    {
        "entities": [
            {
                "text": "45, Clybaun Heights, Galway City, Ireland H91 AKK3",
                "location": {
                    "stt_idx": 23,
                    "end_idx": 73
                },
                "best_label": "LOCATION_ADDRESS",
                "labels": {
                    "LOCATION_ADDRESS_STREET": 0.3171827793121338,
                    "LOCATION": 0.9123516889179454,
                    "LOCATION_ADDRESS": 0.9221759648884044,
                    "LOCATION_CITY": 0.16148114204406738,
                    "LOCATION_COUNTRY": 0.05482322678846471,
                    "LOCATION_ZIP": 0.26978740271400004
                },
                "analysis_result": {
                    "subtypes": [
                        {
                            "text": "45, Clybaun Heights",
                            "label": "LOCATION_ADDRESS_STREET",
                            "location": {
                                "stt_idx": 23,
                                "end_idx": 42
                            }
                        },
                        {
                            "text": "Galway City",
                            "label": "LOCATION_CITY",
                            "location": {
                                "stt_idx": 44,
                                "end_idx": 55
                            }
                        },
                        {
                            "text": "Ireland",
                            "label": "LOCATION_COUNTRY",
                            "location": {
                                "stt_idx": 57,
                                "end_idx": 64
                            }
                        },
                        {
                            "text": "H91 AKK3",
                            "label": "LOCATION_ZIP",
                            "location": {
                                "stt_idx": 65,
                                "end_idx": 73
                            }
                        }
                    ],
                    "validation_assertions": []
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 73,
        "languages_detected": {
            "en": 0.8342836499214172
        }
    },
    {
        "entities": [
            {
                "text": "3255 M-A-D-D-A-M-S street, huntington, west virginia",
                "location": {
                    "stt_idx": 0,
                    "end_idx": 52
                },
                "best_label": "LOCATION_ADDRESS",
                "labels": {
                    "LOCATION_ADDRESS_STREET": 0.6232224106788635,
                    "LOCATION_ADDRESS": 0.9109632035960322,
                    "LOCATION": 0.8909260371456975,
                    "LOCATION_CITY": 0.07817105106685472,
                    "LOCATION_STATE": 0.12203486328539641
                },
                "analysis_result": {
                    "subtypes": [
                        {
                            "text": "3255 M-A-D-D-A-M-S street",
                            "label": "LOCATION_ADDRESS_STREET",
                            "location": {
                                "stt_idx": 0,
                                "end_idx": 25
                            }
                        },
                        {
                            "text": "huntington",
                            "label": "LOCATION_CITY",
                            "location": {
                                "stt_idx": 27,
                                "end_idx": 37
                            }
                        },
                        {
                            "text": "west virginia",
                            "label": "LOCATION_STATE",
                            "location": {
                                "stt_idx": 39,
                                "end_idx": 52
                            }
                        }
                    ],
                    "validation_assertions": []
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 70,
        "languages_detected": {
            "en": 0.9467829465866089
        }
    },
    {
        "entities": [
            {
                "text": "San Francisco, California 94110, United States, 37.7749\u00b0 N, 122.4194\u00b0 W",
                "location": {
                    "stt_idx": 20,
                    "end_idx": 91
                },
                "best_label": "LOCATION",
                "labels": {
                    "LOCATION_CITY": 0.080466923614343,
                    "LOCATION": 0.8993716637293497,
                    "LOCATION_ADDRESS": 0.200799106930693,
                    "LOCATION_STATE": 0.03926792989174525,
                    "LOCATION_ZIP": 0.12127648045619328,
                    "LOCATION_COUNTRY": 0.07723071426153183,
                    "LOCATION_COORDINATE": 0.4833615819613139
                },
                "analysis_result": {
                    "subtypes": [
                        {
                            "text": "San Francisco",
                            "label": "LOCATION_CITY",
                            "location": {
                                "stt_idx": 20,
                                "end_idx": 33
                            }
                        },
                        {
                            "text": "California",
                            "label": "LOCATION_STATE",
                            "location": {
                                "stt_idx": 35,
                                "end_idx": 45
                            }
                        },
                        {
                            "text": "94110",
                            "label": "LOCATION_ZIP",
                            "location": {
                                "stt_idx": 46,
                                "end_idx": 51
                            }
                        },
                        {
                            "text": "United States",
                            "label": "LOCATION_COUNTRY",
                            "location": {
                                "stt_idx": 53,
                                "end_idx": 66
                            }
                        },
                        {
                            "text": "37.7749\u00b0 N, 122.4194\u00b0 W",
                            "label": "LOCATION_COORDINATE",
                            "location": {
                                "stt_idx": 68,
                                "end_idx": 91
                            }
                        }
                    ],
                    "validation_assertions": []
                }
            }
        ],
        "entities_present": true,
        "characters_processed": 91,
        "languages_detected": {
            "en": 0.7658711075782776
        }
    }
]

The above request contains three examples containing addresses. The corresponding analyze/text response contains the result of the analysis. This response, along with the corresponding PAI client post-processing code, can be used to mask street addresses, in order to hide the most sensitive information.

Relation detection

Relation detection refers to the broader natural language processing (NLP) capability of understanding how entities in a text are connected. While entity recognition tells us what the entities are (e.g., a person's name, a company, a location), relation detection tells us how those entities are related. Relation detection covers tasks like coreference resolution and relation extraction, both of which are supported, and together provide a deeper understanding of unstructured text.

The analyze/text route can be used to configure relation detection by using the optional relation_detection field in the request.

Coreference Resolution

Coreference resolution is the task of identifying different entity mentions in a given text that refer to the same real-world entity. The relation_detection field offers a configurable option for coreference resolution:

coreference_resolution : Specifies the method for identifying coreferential entities:
- heuristics : Uses rule-based methods
- model_prediction : Uses machine learning models
- combined : Uses both approaches

request with coreference resolutionresponse with coreference_id

Copy

Copied

    {
        "text": [
            "Nikola Jokić (Serbian Cyrillic: Никола Јокић, pronounced [nǐkola jôkitɕ] ⓘ; born February 19, 1995) is a Serbian professional basketball player who is a center for the Denver Nuggets of the National Basketball Association (NBA). Jokić was born in the city of Sombor in the northern part of Serbia. He grew up in a cramped two-bedroom apartment that housed him and his two brothers."
        ],
        "entity_detection": {
            "accuracy": "high"
        },
        "locale": "en-US",
        "relation_detection": {
            "coreference_resolution": "model_prediction"
        }
    }

Copy

Copied

[
    {
        "entities": [
            {
                "text": "Nikola Jokić",
                "location": {
                    "stt_idx": 0,
                    "end_idx": 12
                },
                "best_label": "NAME",
                "labels": {
                    "NAME_GIVEN": 0.2300402671098709,
                    "NAME": 0.9172913134098053,
                    "NAME_FAMILY": 0.6867769062519073
                },
                "coreference_id": "56c15276-33da-4726-bc81-369074049222"
            },
            {
                "text": "Serbian Cyrillic",
                "location": {
                    "stt_idx": 14,
                    "end_idx": 30
                },
                "best_label": "LANGUAGE",
                "labels": {
                    "LANGUAGE": 0.94222651720047
                },
                "coreference_id": "0d6296d4-c453-4c73-9415-5abc527a38e5"
            },
            {
                "text": "Никола Јокић",
                "location": {
                    "stt_idx": 32,
                    "end_idx": 44
                },
                "best_label": "NAME_GIVEN",
                "labels": {
                    "NAME": 0.842899182013103,
                    "NAME_GIVEN": 0.6497380946363721,
                    "NAME_FAMILY": 0.07980045356920787
                },
                "coreference_id": "56c15276-33da-4726-bc81-369074049222"
            },
            {
                "text": "nǐkola jôkitɕ",
                "location": {
                    "stt_idx": 58,
                    "end_idx": 71
                },
                "best_label": "NAME",
                "labels": {
                    "NAME_GIVEN": 0.4644567847251892,
                    "NAME": 0.8982340276241303,
                    "NAME_FAMILY": 0.44664961099624634
                },
                "coreference_id": "56c15276-33da-4726-bc81-369074049222"
            },
            {
                "text": "February 19, 1995",
                "location": {
                    "stt_idx": 81,
                    "end_idx": 98
                },
                "best_label": "DOB",
                "labels": {
                    "DOB": 0.9391335248947144
                },
                "analysis_result": {
                    "formatted": "1995-02-19T00:00:00",
                    "subtypes": [
                        {
                            "formatted": "19",
                            "label": "DAY"
                        },
                        {
                            "formatted": "2",
                            "label": "MONTH"
                        },
                        {
                            "formatted": "1995",
                            "label": "YEAR"
                        }
                    ],
                    "validation_assertions": []
                },
                "coreference_id": "65e20278-31c8-4cfb-ad73-1e24db5fcd8e"
            },
            {
                "text": "Serbian",
                "location": {
                    "stt_idx": 105,
                    "end_idx": 112
                },
                "best_label": "ORIGIN",
                "labels": {
                    "ORIGIN": 0.9151841402053833
                },
                "coreference_id": "43af91fe-7868-4469-a70b-c22cfcd917e2"
            },
            {
                "text": "professional basketball player",
                "location": {
                    "stt_idx": 113,
                    "end_idx": 143
                },
                "best_label": "OCCUPATION",
                "labels": {
                    "OCCUPATION": 0.8843509753545126
                },
                "coreference_id": "e8fc3654-89ab-4cec-806a-ec97f13a9673"
            },
            {
                "text": "center",
                "location": {
                    "stt_idx": 153,
                    "end_idx": 159
                },
                "best_label": "OCCUPATION",
                "labels": {
                    "OCCUPATION": 0.8316260576248169
                },
                "coreference_id": "8ca28112-9a34-492e-8b1f-4b9fc72c0b1f"
            },
            {
                "text": "Denver Nuggets",
                "location": {
                    "stt_idx": 168,
                    "end_idx": 182
                },
                "best_label": "ORGANIZATION",
                "labels": {
                    "LOCATION_CITY": 0.48198258876800537,
                    "ORGANIZATION": 0.9154168367385864,
                    "LOCATION": 0.4703272879123688
                },
                "coreference_id": "de750d67-78eb-4606-8aec-6c2f697e9c50"
            },
            {
                "text": "National Basketball Association",
                "location": {
                    "stt_idx": 190,
                    "end_idx": 221
                },
                "best_label": "ORGANIZATION",
                "labels": {
                    "ORGANIZATION": 0.9143192768096924
                },
                "coreference_id": "b907f506-1492-40b2-915a-1c472fc1efe8"
            },
            {
                "text": "NBA",
                "location": {
                    "stt_idx": 223,
                    "end_idx": 226
                },
                "best_label": "ORGANIZATION",
                "labels": {
                    "ORGANIZATION": 0.8653480410575867
                },
                "coreference_id": "b907f506-1492-40b2-915a-1c472fc1efe8"
            },
            {
                "text": "Jokić",
                "location": {
                    "stt_idx": 229,
                    "end_idx": 234
                },
                "best_label": "NAME_FAMILY",
                "labels": {
                    "NAME_FAMILY": 0.9177489876747131,
                    "NAME": 0.9098437031110128
                },
                "coreference_id": "56c15276-33da-4726-bc81-369074049222"
            },
            {
                "text": "Sombor",
                "location": {
                    "stt_idx": 259,
                    "end_idx": 265
                },
                "best_label": "LOCATION_CITY",
                "labels": {
                    "LOCATION_CITY": 0.925605853398641,
                    "LOCATION": 0.9114498297373453
                },
                "coreference_id": "4b204f6e-a2ed-4c45-a268-d5a20d477478"
            },
            {
                "text": "Serbia",
                "location": {
                    "stt_idx": 290,
                    "end_idx": 296
                },
                "best_label": "LOCATION_COUNTRY",
                "labels": {
                    "LOCATION_COUNTRY": 0.9711890816688538,
                    "LOCATION": 0.9073220491409302
                },
                "coreference_id": "ef54dfa2-5bae-4081-8b9d-2dc0ddf8868c"
            }
        ],
        "entities_present": true,
        "characters_processed": 381,
        "languages_detected": {
            "en": 0.9837551116943359
        }
    }
]

The response includes a key element for each entity:

coreference_id : A unique identifier added to each entity that groups coreferential entities under a common label. This behavior matches the /process/text endpoint when processed_text is set to MARKER and coreference resolution is applied. For example, "Nikola Jokić", "Никола Јокић", "nǐkola jôkitɕ", and "Jokić" all share the same coreference_id ("56c15276-33da-4726-bc81-369074049222"), indicating that they refer to the same person.

For an example of how to use the coreference information from the API to replace all mentions of a person with their initials, see the custom redaction of coreferenced names example.

Relation Extraction

Relation extraction is the task of identifying meaningful relations between entities in text, such as person-to-person or person-to-location links. It helps unlock document-level understanding by connecting pieces of information and making it easier to de-identify related data.

Let's look at an example:

Copy

Copied

Nessa Jonsson was born and raised in Sweden. Her father, Erik, emigrated to the United States when she was a baby. He died in 1980.

In the text above, there are four entities:

Nessa Jonsson ( NAME )
Sweden ( LOCATION_COUNTRY )
Erik ( NAME_GIVEN )
the United States ( LOCATION_COUNTRY )
1980 ( DATE_INTERVAL )

Relation extraction can be used to identify the semantic connections between those entities, such as:

Nessa Jonsson is born in Sweden
Nessa Jonsson is the daughter of Erik
Erik is the father of Nessa Jonsson
Erik lived in the United States
Erik died in 1980

Relation extraction plays a key role in document understanding by uncovering how entities are connected, which enables systems to move toward structured, contextualized information. In domains like healthcare and finance, this unlocks the potential of unstructured text by identifying relationships like family connections, places of origin, or dates of birth.

Private AI and Relation Extraction (Beta)

Private AI's de-identification service offers the ability to use relation extraction on its analyze/text endpoint. Relation extraction is currently implemented on top of both the named entity recognition (NER) and the coreference resolution models. It is, therefore, limited to predicting relations between clusters of coreferenced entities.

Currently, the system supports a single generalized relation type: RELATED_TO, which is used to capture all of the supported semantic relations between a person and another entity:

Kinship - a relation between two NAME s (or other variants, e.g. NAME_GIVEN ) indicating family or close personal relationships between individuals. These may include parent-child, siblings, spouses, etc. A kinship relation is always bi-directional.
Place of birth - a relation between NAME and LOCATION entities, indicating the location where the person was born. This can refer to a city, state, country, or region.
Citizenship - a relation between NAME and LOCATION or ORIGIN entities, indicating nationality or legal citizenship of the person.
Origin - a relation between NAME and ORIGIN entities, indicating the country a person originally comes from, reflecting ancestry or cultural background rather than legal status or birthplace.
Date of birth - a relation between NAME and DOB entities, indicating birthdate.
Date of death - a relation between NAME and DATE or DATE_INTERVAL entities, indicating the date of death of a person.

For the example above, the system will extract the following relations:

Nessa Jonsson → RELATED_TO → Sweden
Nessa Jonsson → RELATED_TO → Erik
Erik → RELATED_TO → Nessa Jonsson

The relation extraction feature can be enabled as part of the analyze/text endpoint by setting the field enable_relation_extraction to true.

enable_relation_extraction : Controls whether relation extraction is performed during analysis.
- true : Enables relation extraction
- false (default): Disables relation extraction

relation extraction and coreference resolution

Relation extraction relies on coreference resolution to group people mentions in text. Make sure a non-null value is set for coreference_resolution before setting enable_relation_extraction to true.

Here is an example of how to enable relation extraction in your request using the analyze/text endpoint. Notice the enable_relation_extraction field within the relation_detection object.

request with relation extraction enabledresponse with relations

Copy

Copied

    {
        "text": [
            "Nessa Jonsson was born on March 17, 1995 in Sweden and currently resides there. Her sister, Erika, has a history of hypertension."
        ],
        "entity_detection": {
            "accuracy": "high"
        },
        "locale": "en-US",
        "relation_detection": {
            "coreference_resolution": "model_prediction",
            "enable_relation_extraction": true
        }
    }

Copy

Copied

[
    {
        "entities": [
            {
                "text": "Nessa Jonsson",
                "location": {
                    "stt_idx": 0,
                    "end_idx": 13
                },
                "best_label": "NAME",
                "labels": {
                    "NAME_GIVEN": 0.36283198595046995,
                    "NAME": 0.9023965716361999,
                    "NAME_FAMILY": 0.5529729723930359
                },
                "coreference_id": "7e688543-9aff-4b9a-a386-5277d2ee8954",
                "relations": [
                    {
                        "coreference_id": "60108305-97e7-4019-9d02-0dc0549b27ea",
                        "label": "RELATED_TO"
                    },
                    {
                        "coreference_id": "c83c10bb-609c-4f5f-b2a1-9ba6ac367614",
                        "label": "RELATED_TO"
                    },
                    {
                        "coreference_id": "add43460-23ff-4100-b529-b17f8b9a71f4",
                        "label": "RELATED_TO"
                    }
                ]
            },
            {
                "text": "March 17, 1995",
                "location": {
                    "stt_idx": 26,
                    "end_idx": 40
                },
                "best_label": "DOB",
                "labels": {
                    "DOB": 0.9576807171106339
                },
                "analysis_result": {
                    "formatted": "1995-03-17T00:00:00",
                    "subtypes": [
                        {
                            "formatted": "17",
                            "label": "DAY"
                        },
                        {
                            "formatted": "3",
                            "label": "MONTH"
                        },
                        {
                            "formatted": "1995",
                            "label": "YEAR"
                        }
                    ],
                    "validation_assertions": []
                },
                "coreference_id": "add43460-23ff-4100-b529-b17f8b9a71f4",
                "relations": []
            },
            {
                "text": "Sweden",
                "location": {
                    "stt_idx": 44,
                    "end_idx": 50
                },
                "best_label": "LOCATION_COUNTRY",
                "labels": {
                    "LOCATION_COUNTRY": 0.9481291770935059,
                    "LOCATION": 0.9080604314804077
                },
                "coreference_id": "60108305-97e7-4019-9d02-0dc0549b27ea",
                "relations": []
            },
            {
                "text": "Erika",
                "location": {
                    "stt_idx": 92,
                    "end_idx": 97
                },
                "best_label": "NAME_GIVEN",
                "labels": {
                    "NAME_GIVEN": 0.9050805866718292,
                    "NAME": 0.8937010765075684
                },
                "coreference_id": "c83c10bb-609c-4f5f-b2a1-9ba6ac367614",
                "relations": [
                    {
                        "coreference_id": "7e688543-9aff-4b9a-a386-5277d2ee8954",
                        "label": "RELATED_TO"
                    }
                ]
            },
            {
                "text": "hypertension",
                "location": {
                    "stt_idx": 116,
                    "end_idx": 128
                },
                "best_label": "CONDITION",
                "labels": {
                    "CONDITION": 0.9360405206680298
                },
                "coreference_id": "676cbf92-3eac-46dd-ac39-96d2368c09da",
                "relations": []
            }
        ],
        "entities_present": true,
        "characters_processed": 129,
        "languages_detected": {
            "en": 0.9928773641586304
        }
    }
]

With relation extraction enabled, each entity in the response now contains an additional field capturing its relations:

relations : A list of extracted relations involving the entity. Each relation object includes:
- coreference_id : The ID of the related entity from the coreference_id field of another entity in the response.
- label : The type of relation detected. Currently, only one relation is supported, the generic RELATED_TO relation.

Limitations

The relation extraction model is provided as an experimental feature and is not intended for production use. It currently supports English text and is constrained to inputs of up to 1024 tokens. Any text beyond this limit will be ignored during processing. Relation predictions may be inaccurate or missed, particularly in complex contexts where related entities occur far apart within the text.