Detect, Parse, and Validate Entities in Text
info
In order to run the example code in this guide, please sign up for your free test api key here.
In addition to de-identification and redaction, Private AI also supports entity detection and validation. The analyze/text
route described below is an essential tool for exploring and structuring your data as well as creating statistics around your data. In this guide, we demonstrate how to use the analyze/text
endpoint introduced in 4.1
to return the analysis results of the detected entities, with examples of how these results can be used to meet your own use cases.
Analyze entities in text (new in 4.1)
The analyze/text
route returns a list of detected entities along with the formatted text for each entity and a description of its subtypes. In this guide, we provide payloads to the Private AI's analyze/text
REST API route and document the associated responses.
To better illustrate how this information can be used, we proceed by giving a series of common use cases.
Validation and custom redaction of credit card numbers
Some numerical entities integrate a checksum in their values. This checksum is used to confirm the entity's validity and to minimize the chance of error during transcription. This is the case for credit card numbers, which must satisfy the Luhn algorithm. The analyze/text
route implements this algorithm on top of the NER model detection. This provides an additional safeguard by ensuring that the detected number is indeed a valid credit card number. Let's look at three specific examples including credit card numbers.
{
"text": [
"Okay, hang on just a second because I got to get it. Okay, it is 6578-7790-4346-2237. Expiration. 1224.",
"All right, I'm ready. 800 678-457-7896. Expiration is one. 224.",
"CC_type: Diners Club International RuPay Visa JCB Amex CCN: 30569309025904 4242424242424242 4222222222222 6172873484776530 378282246310005 CC_CVC: 480 902 182 765 143 CC_Expiredate: 5/28 6/67 12/67 11/29 9/70"
],
"locale": "en-US",
"entity_detection": {
"accuracy": "high",
"entity_types": [
{
"type": "ENABLE",
"value": ["CREDIT_CARD"]
}
]
}
}
[
{
"entities": [
{
"text": "6578-7790-4346-2237",
"location": {
"stt_idx": 65,
"end_idx": 84
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 0.9022786834023215
},
"analysis_result": {
"formatted": "6578 7790 4346 2237",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "invalid"
}
]
}
}
],
"entities_present": true,
"characters_processed": 103,
"languages_detected": {
"en": 0.9202778935432434
}
},
{
"entities": [
{
"text": "800 678-457-7896",
"location": {
"stt_idx": 22,
"end_idx": 40
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 0.9012922777069939
},
"analysis_result": {
"subtypes": [],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 65,
"languages_detected": {
"en": 0.8164065480232239
}
},
{
"entities": [
{
"text": "30569309025904",
"location": {
"stt_idx": 60,
"end_idx": 74
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 1.0
},
"analysis_result": {
"formatted": "3056 9309 025 904",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "valid"
}
]
}
},
{
"text": "4242424242424242",
"location": {
"stt_idx": 75,
"end_idx": 91
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 1.0
},
"analysis_result": {
"formatted": "4242 4242 4242 4242",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "valid"
}
]
}
},
{
"text": "4222222222222",
"location": {
"stt_idx": 92,
"end_idx": 105
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 1.0
},
"analysis_result": {
"formatted": "4222 222 222 222",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "valid"
}
]
}
},
{
"text": "6172873484776530",
"location": {
"stt_idx": 106,
"end_idx": 122
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 0.9088553956576756
},
"analysis_result": {
"formatted": "6172 8734 8477 6530",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "invalid"
}
]
}
},
{
"text": "378282246310005",
"location": {
"stt_idx": 123,
"end_idx": 138
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 1.0
},
"analysis_result": {
"formatted": "3782 8224 6310 005",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "valid"
}
]
}
}
],
"entities_present": true,
"characters_processed": 208,
"languages_detected": {
"en": 0.24319741129875183
}
}
]
The above request contains two fields, text
and entity_detection
, that are shared by the analyze/text
, the ner/text
and the process/text
routes. The text
field contains the text to analyze and the entity_detection
field contains the NER configurations (e.g., the list of entities to detect). One last field in the request, locale
, is unique to the analyze/text
request. The locale
field is used as a hint to the analyzer to help parse dates and other locale-dependent entities. For example, setting locale
to en-US
will force the analyzer to interpret the date 12-10-2020
as December 10, 2020 instead of October 12, 2020. Several example of values that can take these fields are provided below.
The full response above is a mouthful, so let's look at the first example's response in more detail.
{
"entities": [
{
"text": "6578-7790-4346-2237",
"location": {
"stt_idx": 65,
"end_idx": 84
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 0.9022786834023215
},
"analysis_result": {
"formatted": "6578 7790 4346 2237",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "invalid"
}
]
}
}
],
"entities_present": true,
"characters_processed": 103,
"languages_detected": {
"en": 0.9202778935432434
}
}
The response contains three main parts:
-
the entity information including its
text
and its
location
. Those fields are shared across other routes including the
ner/text
andprocess/text
routes and have the same use. -
the
formatted
text of the entity. This field is unique to the
analyze/text
route and provides a "standard" format for the entity. This can facilitate the introduction of post-processing logic on detected entities. The formats are described in the following table.
Entity Type | Format | Example |
---|---|---|
CREDIT_CARD | space-separated groups of 3 to 5 digits | 6578 7790 4346 2237 |
DATE | ISO-8601 | 2025-03-20T18:00:00+00:00 |
DOB | ISO-8601 | 2025-03-20 |
AGE | decimal numeral | 12 |
All other entity types | no formatting | - |
-
a list of
validation assertions
on the entity, which is also unique to the
analyze/text
route. It contains a list of objects that are specific to the entity being detected. In this example, theprovider
is the Luhn algorithm that was run on the credit card number and the result of the algorithm is provided as part of thestatus
field. Currently, only credit card numbers contain validation assertions but more assertion providers will be added in the future.
The analysis result of this first example can be summed up in the following way. The credit card was successfully parsed and the parsed result is placed in the formatted
field. However, although the number matches the credit card number format, the Luhn check failed on the number, so it is not a valid credit card number. This could be the result of a transcription error, for example.
The information included in the analysis result allows the creation of custom redaction of entities, using the post-processing framework, as shown in this section.
Date shifting and custom redaction of dates
Dates are one type of PII that is encountered in almost every dataset. Redaction is one way to ensure that sensitive dates do not create privacy issues. However, fully redacting dates often reduces the utility of the redacted data. For dates, it is often preferable to use other obfuscation methods that preserve their utility. Two well-known techniques are date shifting and date bucketing. Let's consider three examples containing dates.
{
"text": [
"$MDT $MRK $QRVO $TSS & 5 more stock picks for LONG swings: https://t.co/CbkieXxqoR (July 10 2018) https://t.co/eit53RUY4g",
"Short sale volume (not short interest) for $KBE on 2018-07-09 is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%",
"$WLTW high OI range is 160 to 155 for option expiration 07/20/2018 #options https://t.co/BnVElKBKkJ"
],
"locale": "en-US",
"entity_detection": {
"entity_types": [
{
"type": "ENABLE",
"value": ["DATE", "DOB", "DAY", "MONTH", "YEAR"]
}
]
}
}
[
{
"entities": [
{
"text": "July 10 2018",
"location": {
"stt_idx": 89,
"end_idx": 101
},
"best_label": "DATE",
"labels": {
"DATE": 0.9400081038475037,
"MONTH": 0.3111259341239929,
"DAY": 0.31207050879796344,
"YEAR": 0.29245950778325397
},
"analysis_result": {
"formatted": "2018-07-10T00:00:00",
"subtypes": [
{
"text": "10",
"formatted": "10",
"label": "DAY",
"location": {
"stt_idx": 94,
"end_idx": 96
}
},
{
"text": "July",
"formatted": "7",
"label": "MONTH",
"location": {
"stt_idx": 89,
"end_idx": 93
}
},
{
"text": "2018",
"formatted": "2018",
"label": "YEAR",
"location": {
"stt_idx": 97,
"end_idx": 101
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 126,
"languages_detected": {
"en": 0.6427053809165955
}
},
{
"entities": [
{
"text": "2018-07-09",
"location": {
"stt_idx": 51,
"end_idx": 61
},
"best_label": "DATE",
"labels": {
"DATE": 0.9267139077186585,
"YEAR": 0.17909334897994994,
"MONTH": 0.18299812078475952,
"DAY": 0.18503443002700806
},
"analysis_result": {
"formatted": "2018-07-09T00:00:00",
"subtypes": [
{
"text": "09",
"formatted": "9",
"label": "DAY",
"location": {
"stt_idx": 59,
"end_idx": 61
}
},
{
"text": "07",
"formatted": "7",
"label": "MONTH",
"location": {
"stt_idx": 56,
"end_idx": 58
}
},
{
"text": "2018",
"formatted": "2018",
"label": "YEAR",
"location": {
"stt_idx": 51,
"end_idx": 55
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 132,
"languages_detected": {
"en": 0.5451536178588867
}
},
{
"entities": [
{
"text": "07/20/2018",
"location": {
"stt_idx": 56,
"end_idx": 66
},
"best_label": "DATE",
"labels": {
"DATE": 0.9359936833381652,
"MONTH": 0.18900736570358276,
"DAY": 0.18550281524658202,
"YEAR": 0.18460171222686766
},
"analysis_result": {
"formatted": "2018-07-20T00:00:00",
"subtypes": [
{
"text": "20",
"formatted": "20",
"label": "DAY",
"location": {
"stt_idx": 59,
"end_idx": 61
}
},
{
"text": "07",
"formatted": "7",
"label": "MONTH",
"location": {
"stt_idx": 56,
"end_idx": 58
}
},
{
"text": "2018",
"formatted": "2018",
"label": "YEAR",
"location": {
"stt_idx": 62,
"end_idx": 66
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 99,
"languages_detected": {
"en": 0.7047932744026184
}
}
]
Let's look at one specific date entity in the above response.
{
"text": "July 10 2018",
"location": {
"stt_idx": 89,
"end_idx": 101
},
"best_label": "DATE",
"labels": {
"DATE": 0.9400081038475037,
"MONTH": 0.3111259341239929,
"DAY": 0.31207050879796344,
"YEAR": 0.29245950778325397
},
"analysis_result": {
"formatted": "2018-07-10T00:00:00",
"subtypes": [
{
"text": "10",
"formatted": "10",
"label": "DAY",
"location": {
"stt_idx": 94,
"end_idx": 96
}
},
{
"text": "July",
"formatted": "7",
"label": "MONTH",
"location": {
"stt_idx": 89,
"end_idx": 93
}
},
{
"text": "2018",
"formatted": "2018",
"label": "YEAR",
"location": {
"stt_idx": 97,
"end_idx": 101
}
}
],
"validation_assertions": []
}
}
Many pieces of information are accessible from the analysis_result
object. First, it is possible to access the formatted date "2018-07-10T00:00:00" from the field analysis_result.formatted
. If you plan to implement logic on the dates found in the text, it might be easier to access the formatted dates rather than the original, non-standard date formats (e.g., "July 10 2018").
Also, it is possible to directly access the day, month, and year of the date entity via the response fields in analysis_result.subtypes
. This information can be used to partially redact or to bucketize dates.
An example of redacting the day and month but keeping the year is provided in the custom redaction of dates guide.
Age bucketing and custom redaction of numbers
Similar to dates, it is possible to analyze ages and other numerical entities to create custom redaction. Consider these two examples.
{
"text": [
"A 32-year old Black female German citizen living in Germany wants to travel to the United States for leisure.",
"West Point Public School division provides school-based preschool services for children from two through nine years of age who are children at risk and children with identified disabilities or delays."
],
"link_batch": false,
"locale": "en-US",
"entity_detection": {
"entity_types": [
{
"type": "ENABLE",
"value": ["AGE"]
}
]
}
}
[
{
"entities": [
{
"text": "32",
"location": {
"stt_idx": 2,
"end_idx": 4
},
"best_label": "AGE",
"labels": {
"AGE": 0.9668179750442505
},
"analysis_result": {
"formatted": 32,
"subtypes": [],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 109,
"languages_detected": {
"en": 0.9611877202987671
}
},
{
"entities": [
{
"text": "two",
"location": {
"stt_idx": 93,
"end_idx": 96
},
"best_label": "AGE",
"labels": {
"AGE": 0.9462096095085144
},
"analysis_result": {
"formatted": 2,
"subtypes": [],
"validation_assertions": []
}
},
{
"text": "nine",
"location": {
"stt_idx": 105,
"end_idx": 109
},
"best_label": "AGE",
"labels": {
"AGE": 0.9411536455154419
},
"analysis_result": {
"formatted": 9,
"subtypes": [],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 200,
"languages_detected": {
"en": 0.9786704778671265
}
}
]
Using the Private AI python client, one can use the above analyze/text
response to bucketize ages, as shown here.
Custom redaction of addresses
The GDPR and other privacy legislations impose strict requirements regarding the redaction of addresses. In the following scenario, we demonstrate how to partially redact an address by leaving only the less sensitive characters of a zip/postal code and removing all other address information (e.g., civic number, street name, and so on).
{
"text": [
"Please deliver this to 45, Clybaun Heights, Galway City, Ireland H91 AKK3",
"3255 M-A-D-D-A-M-S street, huntington, west virginia is his birthplace",
"My favorite city is San Francisco, California 94110, United States, 37.7749° N, 122.4194° W"
],
"locale": "en-US"
}
[
{
"entities": [
{
"text": "45, Clybaun Heights, Galway City, Ireland H91 AKK3",
"location": {
"stt_idx": 23,
"end_idx": 73
},
"best_label": "LOCATION_ADDRESS",
"labels": {
"LOCATION_ADDRESS_STREET": 0.3171827793121338,
"LOCATION": 0.9123516889179454,
"LOCATION_ADDRESS": 0.9221759648884044,
"LOCATION_CITY": 0.16148114204406738,
"LOCATION_COUNTRY": 0.05482322678846471,
"LOCATION_ZIP": 0.26978740271400004
},
"analysis_result": {
"subtypes": [
{
"text": "45, Clybaun Heights",
"label": "LOCATION_ADDRESS_STREET",
"location": {
"stt_idx": 23,
"end_idx": 42
}
},
{
"text": "Galway City",
"label": "LOCATION_CITY",
"location": {
"stt_idx": 44,
"end_idx": 55
}
},
{
"text": "Ireland",
"label": "LOCATION_COUNTRY",
"location": {
"stt_idx": 57,
"end_idx": 64
}
},
{
"text": "H91 AKK3",
"label": "LOCATION_ZIP",
"location": {
"stt_idx": 65,
"end_idx": 73
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 73,
"languages_detected": {
"en": 0.8342836499214172
}
},
{
"entities": [
{
"text": "3255 M-A-D-D-A-M-S street, huntington, west virginia",
"location": {
"stt_idx": 0,
"end_idx": 52
},
"best_label": "LOCATION_ADDRESS",
"labels": {
"LOCATION_ADDRESS_STREET": 0.6232224106788635,
"LOCATION_ADDRESS": 0.9109632035960322,
"LOCATION": 0.8909260371456975,
"LOCATION_CITY": 0.07817105106685472,
"LOCATION_STATE": 0.12203486328539641
},
"analysis_result": {
"subtypes": [
{
"text": "3255 M-A-D-D-A-M-S street",
"label": "LOCATION_ADDRESS_STREET",
"location": {
"stt_idx": 0,
"end_idx": 25
}
},
{
"text": "huntington",
"label": "LOCATION_CITY",
"location": {
"stt_idx": 27,
"end_idx": 37
}
},
{
"text": "west virginia",
"label": "LOCATION_STATE",
"location": {
"stt_idx": 39,
"end_idx": 52
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 70,
"languages_detected": {
"en": 0.9467829465866089
}
},
{
"entities": [
{
"text": "San Francisco, California 94110, United States, 37.7749\u00b0 N, 122.4194\u00b0 W",
"location": {
"stt_idx": 20,
"end_idx": 91
},
"best_label": "LOCATION",
"labels": {
"LOCATION_CITY": 0.080466923614343,
"LOCATION": 0.8993716637293497,
"LOCATION_ADDRESS": 0.200799106930693,
"LOCATION_STATE": 0.03926792989174525,
"LOCATION_ZIP": 0.12127648045619328,
"LOCATION_COUNTRY": 0.07723071426153183,
"LOCATION_COORDINATE": 0.4833615819613139
},
"analysis_result": {
"subtypes": [
{
"text": "San Francisco",
"label": "LOCATION_CITY",
"location": {
"stt_idx": 20,
"end_idx": 33
}
},
{
"text": "California",
"label": "LOCATION_STATE",
"location": {
"stt_idx": 35,
"end_idx": 45
}
},
{
"text": "94110",
"label": "LOCATION_ZIP",
"location": {
"stt_idx": 46,
"end_idx": 51
}
},
{
"text": "United States",
"label": "LOCATION_COUNTRY",
"location": {
"stt_idx": 53,
"end_idx": 66
}
},
{
"text": "37.7749\u00b0 N, 122.4194\u00b0 W",
"label": "LOCATION_COORDINATE",
"location": {
"stt_idx": 68,
"end_idx": 91
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 91,
"languages_detected": {
"en": 0.7658711075782776
}
}
]
The above request contains three examples containing addresses. The corresponding analyze/text
response contains the result of the analysis. This response, along with the corresponding PAI client post-processing code, can be used to mask street addresses, in order to hide the most sensitive information.
Relation detection
Relation detection refers to the broader natural language processing (NLP) capability of understanding how entities in a text are connected. While entity recognition tells us what the entities are (e.g., a person's name, a company, a location), relation detection tells us how those entities are related. Relation detection covers tasks like coreference resolution and relation extraction, both of which are supported, and together provide a deeper understanding of unstructured text.
The analyze/text
route can be used to configure relation detection by using the optional relation_detection
field in the request.
Coreference Resolution
Coreference resolution is the task of identifying different entity mentions in a given text that refer to the same real-world entity. The relation_detection
field offers a configurable option for coreference resolution:
-
coreference_resolution
: Specifies the method for identifying coreferential entities:
-
heuristics
: Uses rule-based methods -
model_prediction
: Uses machine learning models -
combined
: Uses both approaches
-
{
"text": [
"Nikola Jokić (Serbian Cyrillic: Никола Јокић, pronounced [nǐkola jôkitɕ] ⓘ; born February 19, 1995) is a Serbian professional basketball player who is a center for the Denver Nuggets of the National Basketball Association (NBA). Jokić was born in the city of Sombor in the northern part of Serbia. He grew up in a cramped two-bedroom apartment that housed him and his two brothers."
],
"entity_detection": {
"accuracy": "high"
},
"locale": "en-US",
"relation_detection": {
"coreference_resolution": "model_prediction"
}
}
[
{
"entities": [
{
"text": "Nikola Jokić",
"location": {
"stt_idx": 0,
"end_idx": 12
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.2300402671098709,
"NAME": 0.9172913134098053,
"NAME_FAMILY": 0.6867769062519073
},
"coreference_id": "56c15276-33da-4726-bc81-369074049222"
},
{
"text": "Serbian Cyrillic",
"location": {
"stt_idx": 14,
"end_idx": 30
},
"best_label": "LANGUAGE",
"labels": {
"LANGUAGE": 0.94222651720047
},
"coreference_id": "0d6296d4-c453-4c73-9415-5abc527a38e5"
},
{
"text": "Никола Јокић",
"location": {
"stt_idx": 32,
"end_idx": 44
},
"best_label": "NAME_GIVEN",
"labels": {
"NAME": 0.842899182013103,
"NAME_GIVEN": 0.6497380946363721,
"NAME_FAMILY": 0.07980045356920787
},
"coreference_id": "56c15276-33da-4726-bc81-369074049222"
},
{
"text": "nǐkola jôkitɕ",
"location": {
"stt_idx": 58,
"end_idx": 71
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.4644567847251892,
"NAME": 0.8982340276241303,
"NAME_FAMILY": 0.44664961099624634
},
"coreference_id": "56c15276-33da-4726-bc81-369074049222"
},
{
"text": "February 19, 1995",
"location": {
"stt_idx": 81,
"end_idx": 98
},
"best_label": "DOB",
"labels": {
"DOB": 0.9391335248947144
},
"analysis_result": {
"formatted": "1995-02-19T00:00:00",
"subtypes": [
{
"formatted": "19",
"label": "DAY"
},
{
"formatted": "2",
"label": "MONTH"
},
{
"formatted": "1995",
"label": "YEAR"
}
],
"validation_assertions": []
},
"coreference_id": "65e20278-31c8-4cfb-ad73-1e24db5fcd8e"
},
{
"text": "Serbian",
"location": {
"stt_idx": 105,
"end_idx": 112
},
"best_label": "ORIGIN",
"labels": {
"ORIGIN": 0.9151841402053833
},
"coreference_id": "43af91fe-7868-4469-a70b-c22cfcd917e2"
},
{
"text": "professional basketball player",
"location": {
"stt_idx": 113,
"end_idx": 143
},
"best_label": "OCCUPATION",
"labels": {
"OCCUPATION": 0.8843509753545126
},
"coreference_id": "e8fc3654-89ab-4cec-806a-ec97f13a9673"
},
{
"text": "center",
"location": {
"stt_idx": 153,
"end_idx": 159
},
"best_label": "OCCUPATION",
"labels": {
"OCCUPATION": 0.8316260576248169
},
"coreference_id": "8ca28112-9a34-492e-8b1f-4b9fc72c0b1f"
},
{
"text": "Denver Nuggets",
"location": {
"stt_idx": 168,
"end_idx": 182
},
"best_label": "ORGANIZATION",
"labels": {
"LOCATION_CITY": 0.48198258876800537,
"ORGANIZATION": 0.9154168367385864,
"LOCATION": 0.4703272879123688
},
"coreference_id": "de750d67-78eb-4606-8aec-6c2f697e9c50"
},
{
"text": "National Basketball Association",
"location": {
"stt_idx": 190,
"end_idx": 221
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.9143192768096924
},
"coreference_id": "b907f506-1492-40b2-915a-1c472fc1efe8"
},
{
"text": "NBA",
"location": {
"stt_idx": 223,
"end_idx": 226
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8653480410575867
},
"coreference_id": "b907f506-1492-40b2-915a-1c472fc1efe8"
},
{
"text": "Jokić",
"location": {
"stt_idx": 229,
"end_idx": 234
},
"best_label": "NAME_FAMILY",
"labels": {
"NAME_FAMILY": 0.9177489876747131,
"NAME": 0.9098437031110128
},
"coreference_id": "56c15276-33da-4726-bc81-369074049222"
},
{
"text": "Sombor",
"location": {
"stt_idx": 259,
"end_idx": 265
},
"best_label": "LOCATION_CITY",
"labels": {
"LOCATION_CITY": 0.925605853398641,
"LOCATION": 0.9114498297373453
},
"coreference_id": "4b204f6e-a2ed-4c45-a268-d5a20d477478"
},
{
"text": "Serbia",
"location": {
"stt_idx": 290,
"end_idx": 296
},
"best_label": "LOCATION_COUNTRY",
"labels": {
"LOCATION_COUNTRY": 0.9711890816688538,
"LOCATION": 0.9073220491409302
},
"coreference_id": "ef54dfa2-5bae-4081-8b9d-2dc0ddf8868c"
}
],
"entities_present": true,
"characters_processed": 381,
"languages_detected": {
"en": 0.9837551116943359
}
}
]
The response includes a key element for each entity:
-
coreference_id
: A unique identifier added to each entity that groups coreferential entities under a common label. This behavior matches the
/process/text
endpoint whenprocessed_text
is set to MARKER and coreference resolution is applied. For example, "Nikola Jokić", "Никола Јокић", "nǐkola jôkitɕ", and "Jokić" all share the samecoreference_id
("56c15276-33da-4726-bc81-369074049222"), indicating that they refer to the same person.
For an example of how to use the coreference information from the API to replace all mentions of a person with their initials, see the custom redaction of coreferenced names example.
Relation Extraction
Relation extraction is the task of identifying meaningful relations between entities in text, such as person-to-person or person-to-location links. It helps unlock document-level understanding by connecting pieces of information and making it easier to de-identify related data.
Let's look at an example:
Nessa Jonsson was born and raised in Sweden. Her father, Erik, emigrated to the United States when she was a baby. He died in 1980.
In the text above, there are four entities:
-
Nessa Jonsson (
NAME
) -
Sweden (
LOCATION_COUNTRY
) -
Erik (
NAME_GIVEN
) -
the United States (
LOCATION_COUNTRY
) -
1980 (
DATE_INTERVAL
)
Relation extraction can be used to identify the semantic connections between those entities, such as:
- Nessa Jonsson is born in Sweden
- Nessa Jonsson is the daughter of Erik
- Erik is the father of Nessa Jonsson
- Erik lived in the United States
- Erik died in 1980
Relation extraction plays a key role in document understanding by uncovering how entities are connected, which enables systems to move toward structured, contextualized information. In domains like healthcare and finance, this unlocks the potential of unstructured text by identifying relationships like family connections, places of origin, or dates of birth.
Private AI and Relation Extraction (Beta)
Private AI's de-identification service offers the ability to use relation extraction on its analyze/text
endpoint. Relation extraction is currently implemented on top of both the named entity recognition (NER) and the coreference resolution models. It is, therefore, limited to predicting relations between clusters of coreferenced entities.
Currently, the system supports a single generalized relation type: RELATED_TO
, which is used to capture all of the supported semantic relations between a person and another entity:
-
Kinship - a relation between two
NAME
s (or other variants, e.g.NAME_GIVEN
) indicating family or close personal relationships between individuals. These may include parent-child, siblings, spouses, etc. A kinship relation is always bi-directional. -
Place of birth - a relation between
NAME
andLOCATION
entities, indicating the location where the person was born. This can refer to a city, state, country, or region. -
Citizenship - a relation between
NAME
andLOCATION
orORIGIN
entities, indicating nationality or legal citizenship of the person. -
Origin - a relation between
NAME
andORIGIN
entities, indicating the country a person originally comes from, reflecting ancestry or cultural background rather than legal status or birthplace. -
Date of birth - a relation between
NAME
andDOB
entities, indicating birthdate. -
Date of death - a relation between
NAME
andDATE
orDATE_INTERVAL
entities, indicating the date of death of a person.
For the example above, the system will extract the following relations:
-
Nessa Jonsson →
RELATED_TO
→ Sweden -
Nessa Jonsson →
RELATED_TO
→ Erik -
Erik →
RELATED_TO
→ Nessa Jonsson
The relation extraction feature can be enabled as part of the analyze/text
endpoint by setting the field enable_relation_extraction
to true
.
-
enable_relation_extraction
: Controls whether relation extraction is performed during analysis.-
true
: Enables relation extraction -
false
(default): Disables relation extraction
-
relation extraction and coreference resolution
Relation extraction relies on coreference resolution to group people mentions in text. Make sure a non-null value is set for coreference_resolution
before setting enable_relation_extraction
to true.
Here is an example of how to enable relation extraction in your request using the analyze/text
endpoint. Notice the enable_relation_extraction
field within the relation_detection
object.
{
"text": [
"Nessa Jonsson was born on March 17, 1995 in Sweden and currently resides there. Her sister, Erika, has a history of hypertension."
],
"entity_detection": {
"accuracy": "high"
},
"locale": "en-US",
"relation_detection": {
"coreference_resolution": "model_prediction",
"enable_relation_extraction": true
}
}
[
{
"entities": [
{
"text": "Nessa Jonsson",
"location": {
"stt_idx": 0,
"end_idx": 13
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.36283198595046995,
"NAME": 0.9023965716361999,
"NAME_FAMILY": 0.5529729723930359
},
"coreference_id": "7e688543-9aff-4b9a-a386-5277d2ee8954",
"relations": [
{
"coreference_id": "60108305-97e7-4019-9d02-0dc0549b27ea",
"label": "RELATED_TO"
},
{
"coreference_id": "c83c10bb-609c-4f5f-b2a1-9ba6ac367614",
"label": "RELATED_TO"
},
{
"coreference_id": "add43460-23ff-4100-b529-b17f8b9a71f4",
"label": "RELATED_TO"
}
]
},
{
"text": "March 17, 1995",
"location": {
"stt_idx": 26,
"end_idx": 40
},
"best_label": "DOB",
"labels": {
"DOB": 0.9576807171106339
},
"analysis_result": {
"formatted": "1995-03-17T00:00:00",
"subtypes": [
{
"formatted": "17",
"label": "DAY"
},
{
"formatted": "3",
"label": "MONTH"
},
{
"formatted": "1995",
"label": "YEAR"
}
],
"validation_assertions": []
},
"coreference_id": "add43460-23ff-4100-b529-b17f8b9a71f4",
"relations": []
},
{
"text": "Sweden",
"location": {
"stt_idx": 44,
"end_idx": 50
},
"best_label": "LOCATION_COUNTRY",
"labels": {
"LOCATION_COUNTRY": 0.9481291770935059,
"LOCATION": 0.9080604314804077
},
"coreference_id": "60108305-97e7-4019-9d02-0dc0549b27ea",
"relations": []
},
{
"text": "Erika",
"location": {
"stt_idx": 92,
"end_idx": 97
},
"best_label": "NAME_GIVEN",
"labels": {
"NAME_GIVEN": 0.9050805866718292,
"NAME": 0.8937010765075684
},
"coreference_id": "c83c10bb-609c-4f5f-b2a1-9ba6ac367614",
"relations": [
{
"coreference_id": "7e688543-9aff-4b9a-a386-5277d2ee8954",
"label": "RELATED_TO"
}
]
},
{
"text": "hypertension",
"location": {
"stt_idx": 116,
"end_idx": 128
},
"best_label": "CONDITION",
"labels": {
"CONDITION": 0.9360405206680298
},
"coreference_id": "676cbf92-3eac-46dd-ac39-96d2368c09da",
"relations": []
}
],
"entities_present": true,
"characters_processed": 129,
"languages_detected": {
"en": 0.9928773641586304
}
}
]
With relation extraction enabled, each entity in the response now contains an additional field capturing its relations:
-
relations
: A list of extracted relations involving the entity. Each relation object includes:-
coreference_id
: The ID of the related entity from thecoreference_id
field of another entity in the response. -
label
: The type of relation detected. Currently, only one relation is supported, the genericRELATED_TO
relation.
-
Limitations
The relation extraction model is provided as an experimental feature and is not intended for production use. It currently supports English text and is constrained to inputs of up to 1024 tokens. Any text beyond this limit will be ignored during processing. Relation predictions may be inaccurate or missed, particularly in complex contexts where related entities occur far apart within the text.