Detect, Parse, and Validate Entities in Text
info
In order to run the example code in this guide, please sign up for your free test api key here.
In addition to de-identification and redaction, Private AI also supports entity detection and validation. The analyze/text
route described below is an essential tool for exploring and structuring your data as well as creating statistics around your data. In this guide, we demonstrate how to use the analyze/text
endpoint introduced in 4.1
to return the analysis results of the detected entities, with examples of how these results can be used to meet your own use cases.
Analyze entities in text (new in 4.1)
The analyze/text
route returns a list of detected entities along with the formatted text for each entity and a description of its subtypes. In this guide, we provide payloads to the Private AI's analyze/text
REST API route and document the associated responses.
To better illustrate how this information can be used, we proceed by giving a series of common use cases.
Validation and custom redaction of credit card numbers
Some numerical entities integrate a checksum in their values. This checksum is used to confirm the entity's validity and to minimize the chance of error during transcription. This is the case for credit card numbers, which must satisfy the Luhn algorithm. The analyze/text
route implements this algorithm on top of the NER model detection. This provides an additional safeguard by ensuring that the detected number is indeed a valid credit card number. Let's look at three specific examples including credit card numbers.
{
"text": [
"Okay, hang on just a second because I got to get it. Okay, it is 6578-7790-4346-2237. Expiration. 1224.",
"All right, I'm ready. 800 678-457-7896. Expiration is one. 224.",
"CC_type: Diners Club International RuPay Visa JCB Amex CCN: 30569309025904 4242424242424242 4222222222222 6172873484776530 378282246310005 CC_CVC: 480 902 182 765 143 CC_Expiredate: 5/28 6/67 12/67 11/29 9/70"
],
"locale": "en-US",
"entity_detection": {
"accuracy": "high",
"entity_types": [
{
"type": "ENABLE",
"value": ["CREDIT_CARD"]
}
]
}
}
[
{
"entities": [
{
"text": "6578-7790-4346-2237",
"location": {
"stt_idx": 65,
"end_idx": 84
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 0.9022786834023215
},
"analysis_result": {
"formatted": "6578 7790 4346 2237",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "invalid"
}
]
}
}
],
"entities_present": true,
"characters_processed": 103,
"languages_detected": {
"en": 0.9202778935432434
}
},
{
"entities": [
{
"text": "800 678-457-7896",
"location": {
"stt_idx": 22,
"end_idx": 40
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 0.9012922777069939
},
"analysis_result": {
"subtypes": [],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 65,
"languages_detected": {
"en": 0.8164065480232239
}
},
{
"entities": [
{
"text": "30569309025904",
"location": {
"stt_idx": 60,
"end_idx": 74
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 1.0
},
"analysis_result": {
"formatted": "3056 9309 025 904",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "valid"
}
]
}
},
{
"text": "4242424242424242",
"location": {
"stt_idx": 75,
"end_idx": 91
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 1.0
},
"analysis_result": {
"formatted": "4242 4242 4242 4242",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "valid"
}
]
}
},
{
"text": "4222222222222",
"location": {
"stt_idx": 92,
"end_idx": 105
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 1.0
},
"analysis_result": {
"formatted": "4222 222 222 222",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "valid"
}
]
}
},
{
"text": "6172873484776530",
"location": {
"stt_idx": 106,
"end_idx": 122
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 0.9088553956576756
},
"analysis_result": {
"formatted": "6172 8734 8477 6530",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "invalid"
}
]
}
},
{
"text": "378282246310005",
"location": {
"stt_idx": 123,
"end_idx": 138
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 1.0
},
"analysis_result": {
"formatted": "3782 8224 6310 005",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "valid"
}
]
}
}
],
"entities_present": true,
"characters_processed": 208,
"languages_detected": {
"en": 0.24319741129875183
}
}
]
The above request contains two fields, text
and entity_detection
, that are shared by the analyze/text
, the ner/text
and the process/text
routes. The text
field contains the text to analyze and the entity_detection
field contains the NER configurations (e.g., the list of entities to detect). One last field in the request, locale
, is unique to the analyze/text
request. The locale
field is used as a hint to the analyzer to help parse dates and other locale-dependent entities. For example, setting locale
to en-US
will force the analyzer to interpret the date 12-10-2020
as December 10, 2020 instead of October 12, 2020. Several example of values that can take these fields are provided below.
The full response above is a mouthful, so let's look at the first example's response in more detail.
{
"entities": [
{
"text": "6578-7790-4346-2237",
"location": {
"stt_idx": 65,
"end_idx": 84
},
"best_label": "CREDIT_CARD",
"labels": {
"CREDIT_CARD": 0.9022786834023215
},
"analysis_result": {
"formatted": "6578 7790 4346 2237",
"subtypes": [],
"validation_assertions": [
{
"provider": "luhn",
"status": "invalid"
}
]
}
}
],
"entities_present": true,
"characters_processed": 103,
"languages_detected": {
"en": 0.9202778935432434
}
}
The response contains three main parts:
-
the entity information including its
text
and its
location
. Those fields are shared across other routes including the
ner/text
andprocess/text
routes and have the same use. -
the
formatted
text of the entity. This field is unique to the
analyze/text
route and provides a "standard" format for the entity. This can facilitate the introduction of post-processing logic on detected entities. The formats are described in the following table.
Entity Type | Format | Example |
---|---|---|
CREDIT_CARD | space-separated groups of 3 to 5 digits | 6578 7790 4346 2237 |
DATE | ISO-8601 | 2025-03-20T18:00:00+00:00 |
DOB | ISO-8601 | 2025-03-20 |
AGE | decimal numeral | 12 |
All other entity types | no formatting | - |
-
a list of
validation assertions
on the entity, which is also unique to the
analyze/text
route. It contains a list of objects that are specific to the entity being detected. In this example, theprovider
is the Luhn algorithm that was run on the credit card number and the result of the algorithm is provided as part of thestatus
field. Currently, only credit card numbers contain validation assertions but more assertion providers will be added in the future.
The analysis result of this first example can be summed up in the following way. The credit card was successfully parsed and the parsed result is placed in the formatted
field. However, although the number matches the credit card number format, the Luhn check failed on the number, so it is not a valid credit card number. This could be the result of a transcription error, for example.
The information included in the analysis result allows the creation of custom redaction of entities. The following code shows an example of a custom redaction of credit card numbers.
# This code is assuming that you have the Private AI deidentification service running locally on port 8080.
# It also assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
client = PAIClient(url="http://localhost:8080")
text = [
"Okay, hang on just a second because I got to get it. Okay, it is 6578-7790-4346-2237. Expiration. 1224.",
"All right, I'm ready. 800 678-457-7896. Expiration is one. 224.",
"CC_type: Diners Club International RuPay Visa JCB Amex CCN: 30569309025904 4242424242424242 4222222222222 6172873484776530 378282246310005 CC_CVC: 480 902 182 765 143 CC_Expiredate: 5/28 6/67 12/67 11/29 9/70",
]
request = {"text": text, "locale": "en-US", "entity_detection": {"accuracy": "high", "entity_types": [{"type": "ENABLE", "value": ["CREDIT_CARD"]}]}}
text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)
# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_credit_card(entity) -> str:
"""Redacts credit card numbers"""
analysis_result = entity["analysis_result"]
for assertion in analysis_result["validation_assertions"]:
if assertion["provider"] == "luhn":
if assertion["status"] == "valid":
return f"[{'*' * 12}{analysis_result['formatted'][-4:]}]"
else:
return f"{analysis_result['formatted']} [INVALID]"
return f"{entity['text']}"
entity_processors = {"CREDIT_CARD": redact_credit_card}
deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
print(example)
The redact_credit_card
function contains the necessary logic to redact credit card numbers as follows:
- if the credit card number is valid, hide it except for the last four characters (which could include spaces).
- if the credit card number is parsed correctly but it fails the Luhn check it means that the number is invalid. In this case, don't hide the number and add an INVALID tag after the number. This could be used to identify more easily invalid credit card numbers in text for a later review.
- if the number fails to parse as a credit card number then do nothing. This code is assuming that this is not a credit card number.
The above code output looks like this:
Okay, hang on just a second because I got to get it. Okay, it is 6578 7790 4346 2237 [INVALID]. Expiration. 1224.
All right, I'm ready. 800 678-457-7896. Expiration is one. 224.
CC_type: Diners Club International RuPay Visa JCB Amex CCN: [************ 904] [************4242] [************ 222] 6172 8734 8477 6530 [INVALID] [************ 005] CC_CVC: 480 902 182 765 143 CC_Expiredate: 5/28 6/67 12/67 11/29 9/70
Notice how the credit card number on the first line example was not redacted but an INVALID marker was added right after it instead. On the second line, the 800 678-457-7896
entity was not redacted as expected. This entity is possibly a phone number and not a credit card number. Finally, the last line shows several examples of valid credit card numbers and a single invalid one. The valid credit card numbers were masked except for their last characters as expected.
Date shifting and custom redaction of dates
Dates are one type of PII that is encountered in almost every dataset. Redaction is one way to ensure that sensitive dates do not create privacy issues. However, fully redacting dates often reduces the utility of the redacted data. For dates, it is often preferable to use other obfuscation methods that preserve their utility. Two well-known techniques are date shifting and date bucketing. Let's consider three examples containing dates.
{
"text": [
"$MDT $MRK $QRVO $TSS & 5 more stock picks for LONG swings: https://t.co/CbkieXxqoR (July 10 2018) https://t.co/eit53RUY4g",
"Short sale volume (not short interest) for $KBE on 2018-07-09 is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%",
"$WLTW high OI range is 160 to 155 for option expiration 07/20/2018 #options https://t.co/BnVElKBKkJ"
],
"locale": "en-US",
"entity_detection": {
"entity_types": [
{
"type": "ENABLE",
"value": ["DATE", "DOB", "DAY", "MONTH", "YEAR"]
}
]
}
}
[
{
"entities": [
{
"text": "July 10 2018",
"location": {
"stt_idx": 89,
"end_idx": 101
},
"best_label": "DATE",
"labels": {
"DATE": 0.9400081038475037,
"MONTH": 0.3111259341239929,
"DAY": 0.31207050879796344,
"YEAR": 0.29245950778325397
},
"analysis_result": {
"formatted": "2018-07-10T00:00:00",
"subtypes": [
{
"text": "10",
"formatted": "10",
"label": "DAY",
"location": {
"stt_idx": 94,
"end_idx": 96
}
},
{
"text": "July",
"formatted": "7",
"label": "MONTH",
"location": {
"stt_idx": 89,
"end_idx": 93
}
},
{
"text": "2018",
"formatted": "2018",
"label": "YEAR",
"location": {
"stt_idx": 97,
"end_idx": 101
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 126,
"languages_detected": {
"en": 0.6427053809165955
}
},
{
"entities": [
{
"text": "2018-07-09",
"location": {
"stt_idx": 51,
"end_idx": 61
},
"best_label": "DATE",
"labels": {
"DATE": 0.9267139077186585,
"YEAR": 0.17909334897994994,
"MONTH": 0.18299812078475952,
"DAY": 0.18503443002700806
},
"analysis_result": {
"formatted": "2018-07-09T00:00:00",
"subtypes": [
{
"text": "09",
"formatted": "9",
"label": "DAY",
"location": {
"stt_idx": 59,
"end_idx": 61
}
},
{
"text": "07",
"formatted": "7",
"label": "MONTH",
"location": {
"stt_idx": 56,
"end_idx": 58
}
},
{
"text": "2018",
"formatted": "2018",
"label": "YEAR",
"location": {
"stt_idx": 51,
"end_idx": 55
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 132,
"languages_detected": {
"en": 0.5451536178588867
}
},
{
"entities": [
{
"text": "07/20/2018",
"location": {
"stt_idx": 56,
"end_idx": 66
},
"best_label": "DATE",
"labels": {
"DATE": 0.9359936833381652,
"MONTH": 0.18900736570358276,
"DAY": 0.18550281524658202,
"YEAR": 0.18460171222686766
},
"analysis_result": {
"formatted": "2018-07-20T00:00:00",
"subtypes": [
{
"text": "20",
"formatted": "20",
"label": "DAY",
"location": {
"stt_idx": 59,
"end_idx": 61
}
},
{
"text": "07",
"formatted": "7",
"label": "MONTH",
"location": {
"stt_idx": 56,
"end_idx": 58
}
},
{
"text": "2018",
"formatted": "2018",
"label": "YEAR",
"location": {
"stt_idx": 62,
"end_idx": 66
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 99,
"languages_detected": {
"en": 0.7047932744026184
}
}
]
Let's look at one specific date entity in the above response.
{
"text": "July 10 2018",
"location": {
"stt_idx": 89,
"end_idx": 101
},
"best_label": "DATE",
"labels": {
"DATE": 0.9400081038475037,
"MONTH": 0.3111259341239929,
"DAY": 0.31207050879796344,
"YEAR": 0.29245950778325397
},
"analysis_result": {
"formatted": "2018-07-10T00:00:00",
"subtypes": [
{
"text": "10",
"formatted": "10",
"label": "DAY",
"location": {
"stt_idx": 94,
"end_idx": 96
}
},
{
"text": "July",
"formatted": "7",
"label": "MONTH",
"location": {
"stt_idx": 89,
"end_idx": 93
}
},
{
"text": "2018",
"formatted": "2018",
"label": "YEAR",
"location": {
"stt_idx": 97,
"end_idx": 101
}
}
],
"validation_assertions": []
}
}
Many pieces of information are accessible from the analysis_result
object. First, it is possible to access the formatted date "2018-07-10T00:00:00" from the field analysis_result.formatted
. If you plan to implement logic on the dates found in the text, it might be easier to access the formatted dates rather than the original, non-standard date formats (e.g., "July 10 2018").
Also, it is possible to directly access the day, month, and year of the date entity via the response fields in analysis_result.subtypes
. This information can be used to partially redact or to bucketize dates. The following code gives an example of how to redact the day and month from the original dates, but keep the year unchanged. This example uses helpers that have been made available in the Private AI python client.
# This code assumes that you have the Private AI deidentification service running locally on port 8080.
# It also assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
client = PAIClient(url="http://localhost:8080")
text = [
"$MDT $MRK $QRVO $TSS & 5 more stock picks for LONG swings: https://t.co/CbkieXxqoR (July 10 2018) https://t.co/eit53RUY4g",
"Short sale volume (not short interest) for $KBE on 2018-07-09 is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%",
"$WLTW high OI range is 160 to 155 for option expiration 07/20/2018 #options https://t.co/BnVElKBKkJ",
]
request = {
"text": text,
"locale": "en-US",
"entity_detection": {"accuracy": "high", "entity_types": [{"type": "ENABLE", "value": ["DATE", "DOB", "DAY", "MONTH", "YEAR"]}]},
}
text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)
# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_date(entity) -> str:
"""Redacts days and months from dates"""
offset = entity["location"]["stt_idx"]
text = entity["text"]
for subtype in entity["analysis_result"]["subtypes"]:
if subtype["label"] in ["DAY", "MONTH"] and "location" in subtype:
stt = subtype["location"]["stt_idx"] - offset
end = subtype["location"]["end_idx"] - offset
text = text[:stt] + "#" * (end - stt) + text[end:]
return text
entity_processors = {"DATE": redact_date, "DOB": redact_date}
deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
print(example)
The output of this request is provided below:
$MDT $MRK $QRVO $TSS & 5 more stock picks for LONG swings: https://t.co/CbkieXxqoR (#### ## 2018) https://t.co/eit53RUY4g
Short sale volume (not short interest) for $KBE on 2018-##-## is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%
$WLTW high OI range is 160 to 155 for option expiration ##/##/2018 #options https://t.co/BnVElKBKkJ
Notice how the dates have been partially redacted. A similar approach can be used to instead shift the dates. To do so, simply replace the date processor in the above code with this one:
def redact_date(entity) -> str:
"""Shifts date by a random number of weeks (0 to 20 weeks)"""
random_week_offset = random.randint(0, 20)
if "formatted" in entity["analysis_result"]:
formatted_datetime = datetime.fromisoformat(entity["analysis_result"]["formatted"])
return str((formatted_datetime+timedelta(weeks=random_week_offset)).date())
else:
return entity["text"]
This is an example output of this date processor.
$MDT $MRK $QRVO $TSS & 5 more stock picks for LONG swings: https://t.co/CbkieXxqoR (2018-07-17) https://t.co/eit53RUY4g
Short sale volume (not short interest) for $KBE on 2018-08-20 is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%
$WLTW high OI range is 160 to 155 for option expiration 2018-09-28 #options https://t.co/BnVElKBKkJ
Notice how the dates are replaced with dates that have been shifted by a random number of weeks.
Age bucketing and custom redaction of numbers
Similar to dates, it is possible to analyze ages and other numerical entities to create custom redaction. Consider these two examples.
{
"text": [
"A 32-year old Black female German citizen living in Germany wants to travel to the United States for leisure.",
"West Point Public School division provides school-based preschool services for children from two through nine years of age who are children at risk and children with identified disabilities or delays."
],
"link_batch": false,
"locale": "en-US",
"entity_detection": {
"entity_types": [
{
"type": "ENABLE",
"value": ["AGE"]
}
]
}
}
[
{
"entities": [
{
"text": "32",
"location": {
"stt_idx": 2,
"end_idx": 4
},
"best_label": "AGE",
"labels": {
"AGE": 0.9668179750442505
},
"analysis_result": {
"formatted": 32,
"subtypes": [],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 109,
"languages_detected": {
"en": 0.9611877202987671
}
},
{
"entities": [
{
"text": "two",
"location": {
"stt_idx": 93,
"end_idx": 96
},
"best_label": "AGE",
"labels": {
"AGE": 0.9462096095085144
},
"analysis_result": {
"formatted": 2,
"subtypes": [],
"validation_assertions": []
}
},
{
"text": "nine",
"location": {
"stt_idx": 105,
"end_idx": 109
},
"best_label": "AGE",
"labels": {
"AGE": 0.9411536455154419
},
"analysis_result": {
"formatted": 9,
"subtypes": [],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 200,
"languages_detected": {
"en": 0.9786704778671265
}
}
]
Using the Private AI python client, one can use the above analyze/text
response to bucketize ages.
# This code assumes that you have the Private AI deidentification service running locally on port 8080.
# It also assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
client = PAIClient(url="http://localhost:8080")
text = [
"A 32-year old Black female German citizen living in Germany wants to travel to the United States for leisure.",
"West Point Public School division provides school-based preschool services for children from two through nine years of age who are children at risk and children with identified disabilities or delays.",
]
request = {"text": text, "locale": "en-US", "entity_detection": {"accuracy": "high", "entity_types": [{"type": "ENABLE", "value": ["AGE"]}]}}
text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)
# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_age(entity) -> str:
"""Round to the closest 10th"""
if "formatted" in entity["analysis_result"]:
age = entity["analysis_result"]["formatted"]
return str(int(round(age * 10, -2) / 10))
else:
"#"
entity_processors = {"AGE": redact_age}
deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
print(example)
The output of this code shows that ages have been bucketed to the closest multiple of ten.
A 30-year old Black female German citizen living in Germany wants to travel to the United States for leisure.
West Point Public School division provides school-based preschool services for children from 0 through 10 years of age who are children at risk and children with identified disabilities or delays.
Custom redaction of addresses
The GDPR and other privacy legislations impose strict requirements regarding the redaction of addresses. In the following scenario, we demonstrate how to partially redact an address by leaving only the less sensitive characters of a zip/postal code and removing all other address information (e.g., civic number, street name, and so on).
{
"text": [
"Please deliver this to 45, Clybaun Heights, Galway City, Ireland H91 AKK3",
"3255 M-A-D-D-A-M-S street, huntington, west virginia is his birthplace",
"My favorite city is San Francisco, California 94110, United States, 37.7749° N, 122.4194° W"
],
"locale": "en-US"
}
[
{
"entities": [
{
"text": "45, Clybaun Heights, Galway City, Ireland H91 AKK3",
"location": {
"stt_idx": 23,
"end_idx": 73
},
"best_label": "LOCATION_ADDRESS",
"labels": {
"LOCATION_ADDRESS_STREET": 0.3171827793121338,
"LOCATION": 0.9123516889179454,
"LOCATION_ADDRESS": 0.9221759648884044,
"LOCATION_CITY": 0.16148114204406738,
"LOCATION_COUNTRY": 0.05482322678846471,
"LOCATION_ZIP": 0.26978740271400004
},
"analysis_result": {
"subtypes": [
{
"text": "45, Clybaun Heights",
"label": "LOCATION_ADDRESS_STREET",
"location": {
"stt_idx": 23,
"end_idx": 42
}
},
{
"text": "Galway City",
"label": "LOCATION_CITY",
"location": {
"stt_idx": 44,
"end_idx": 55
}
},
{
"text": "Ireland",
"label": "LOCATION_COUNTRY",
"location": {
"stt_idx": 57,
"end_idx": 64
}
},
{
"text": "H91 AKK3",
"label": "LOCATION_ZIP",
"location": {
"stt_idx": 65,
"end_idx": 73
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 73,
"languages_detected": {
"en": 0.8342836499214172
}
},
{
"entities": [
{
"text": "3255 M-A-D-D-A-M-S street, huntington, west virginia",
"location": {
"stt_idx": 0,
"end_idx": 52
},
"best_label": "LOCATION_ADDRESS",
"labels": {
"LOCATION_ADDRESS_STREET": 0.6232224106788635,
"LOCATION_ADDRESS": 0.9109632035960322,
"LOCATION": 0.8909260371456975,
"LOCATION_CITY": 0.07817105106685472,
"LOCATION_STATE": 0.12203486328539641
},
"analysis_result": {
"subtypes": [
{
"text": "3255 M-A-D-D-A-M-S street",
"label": "LOCATION_ADDRESS_STREET",
"location": {
"stt_idx": 0,
"end_idx": 25
}
},
{
"text": "huntington",
"label": "LOCATION_CITY",
"location": {
"stt_idx": 27,
"end_idx": 37
}
},
{
"text": "west virginia",
"label": "LOCATION_STATE",
"location": {
"stt_idx": 39,
"end_idx": 52
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 70,
"languages_detected": {
"en": 0.9467829465866089
}
},
{
"entities": [
{
"text": "San Francisco, California 94110, United States, 37.7749\u00b0 N, 122.4194\u00b0 W",
"location": {
"stt_idx": 20,
"end_idx": 91
},
"best_label": "LOCATION",
"labels": {
"LOCATION_CITY": 0.080466923614343,
"LOCATION": 0.8993716637293497,
"LOCATION_ADDRESS": 0.200799106930693,
"LOCATION_STATE": 0.03926792989174525,
"LOCATION_ZIP": 0.12127648045619328,
"LOCATION_COUNTRY": 0.07723071426153183,
"LOCATION_COORDINATE": 0.4833615819613139
},
"analysis_result": {
"subtypes": [
{
"text": "San Francisco",
"label": "LOCATION_CITY",
"location": {
"stt_idx": 20,
"end_idx": 33
}
},
{
"text": "California",
"label": "LOCATION_STATE",
"location": {
"stt_idx": 35,
"end_idx": 45
}
},
{
"text": "94110",
"label": "LOCATION_ZIP",
"location": {
"stt_idx": 46,
"end_idx": 51
}
},
{
"text": "United States",
"label": "LOCATION_COUNTRY",
"location": {
"stt_idx": 53,
"end_idx": 66
}
},
{
"text": "37.7749\u00b0 N, 122.4194\u00b0 W",
"label": "LOCATION_COORDINATE",
"location": {
"stt_idx": 68,
"end_idx": 91
}
}
],
"validation_assertions": []
}
}
],
"entities_present": true,
"characters_processed": 91,
"languages_detected": {
"en": 0.7658711075782776
}
}
]
The above request contains three examples containing addresses. The corresponding analyze/text
response contains the result of the analysis. This response, along with the following code, can be used to mask street addresses, in order to hide the most sensitive information.
# This code assumes that you have the Private AI deidentification service running locally on port 8080.
# It also assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
client = PAIClient(url="http://localhost:8080")
text = [
"Please deliver this to 45, Clybaun Heights, Galway City, Ireland H91 AKK3",
"3255 M-A-D-D-A-M-S street, huntington, west virginia is his birthplace",
"My favorite city is San Francisco, California 94110, United States, 37.7749° N, 122.4194° W",
]
request = {"text": text, "locale": "en-US", "entity_detection": {"accuracy": "high"}}
text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)
# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_address(entity) -> str:
"""Redacts address to hide the most sensitive info"""
analysis_result = entity["analysis_result"]
subtypes = sorted(analysis_result["subtypes"], key=lambda x: x["location"]["stt_idx"])
address_parts = []
for subtype in subtypes:
if subtype["label"] in ["LOCATION_COUNTRY", "LOCATION_STATE", "LOCATION_CITY"]:
address_parts.append(subtype["text"])
elif subtype["label"] in ["LOCATION_ZIP"]:
address_parts.append(subtype["text"][:3] + "#" * (len(subtype["text"]) - 3))
else:
address_parts.append(f"""[{subtype["label"]}]""")
return " ".join(address_parts)
entity_processors = {"LOCATION": redact_address, "LOCATION_ADDRESS": redact_address}
deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
print(example)
The output of the code above provides the redacted addresses. As you can see, only the first 3 characters of the postal code and zip code are kept and addresses, when present, are redacted. The last example shows that GPS coordinates are also redacted.
Please deliver this to [LOCATION_ADDRESS_STREET] Galway City Ireland H91#####
[LOCATION_ADDRESS_STREET] huntington west virginia is his birthplace
My favorite city is San Francisco California 941## United States [LOCATION_COORDINATE]
Coreference Resolution
Coreference resolution is the task of identifying different entity mentions in a given text that refer to the same real-world entity. The analyze/text
route supports coreference resolution through the optional relation_detection
field in the request. The relation_detection
field offers a configurable option for coreference resolution:
-
coreference_resolution
: Specifies the method for identifying coreferential entities:
-
heuristics
: Uses rule-based methods -
model_prediction
: Uses machine learning models -
combined
: Uses both approaches
-
{
"text": [
"Nikola Jokić (Serbian Cyrillic: Никола Јокић, pronounced [nǐkola jôkitɕ] ⓘ; born February 19, 1995) is a Serbian professional basketball player who is a center for the Denver Nuggets of the National Basketball Association (NBA). Jokić was born in the city of Sombor in the northern part of Serbia. He grew up in a cramped two-bedroom apartment that housed him and his two brothers."
],
"entity_detection": {
"accuracy": "high"
},
"locale": "en-US",
"relation_detection": {
"coreference_resolution": "model_prediction"
}
}
[
{
"entities": [
{
"text": "Nikola Jokić",
"location": {
"stt_idx": 0,
"end_idx": 12
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.2300402671098709,
"NAME": 0.9172913134098053,
"NAME_FAMILY": 0.6867769062519073
},
"coreference_id": "NAME_1"
},
{
"text": "Serbian Cyrillic",
"location": {
"stt_idx": 14,
"end_idx": 30
},
"best_label": "LANGUAGE",
"labels": {
"LANGUAGE": 0.94222651720047
},
"coreference_id": "LANGUAGE_1"
},
{
"text": "Никола Јокић",
"location": {
"stt_idx": 32,
"end_idx": 44
},
"best_label": "NAME_GIVEN",
"labels": {
"NAME": 0.842899182013103,
"NAME_GIVEN": 0.6497380946363721,
"NAME_FAMILY": 0.07980045356920787
},
"coreference_id": "NAME_1"
},
{
"text": "nǐkola jôkitɕ",
"location": {
"stt_idx": 58,
"end_idx": 71
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.4644567847251892,
"NAME": 0.8982340276241303,
"NAME_FAMILY": 0.44664961099624634
},
"coreference_id": "NAME_1"
},
{
"text": "February 19, 1995",
"location": {
"stt_idx": 81,
"end_idx": 98
},
"best_label": "DOB",
"labels": {
"DOB": 0.9391335248947144
},
"analysis_result": {
"formatted": "1995-02-19T00:00:00",
"subtypes": [
{
"formatted": "19",
"label": "DAY"
},
{
"formatted": "2",
"label": "MONTH"
},
{
"formatted": "1995",
"label": "YEAR"
}
],
"validation_assertions": []
},
"coreference_id": "DOB_1"
},
{
"text": "Serbian",
"location": {
"stt_idx": 105,
"end_idx": 112
},
"best_label": "ORIGIN",
"labels": {
"ORIGIN": 0.9151841402053833
},
"coreference_id": "ORIGIN_1"
},
{
"text": "professional basketball player",
"location": {
"stt_idx": 113,
"end_idx": 143
},
"best_label": "OCCUPATION",
"labels": {
"OCCUPATION": 0.8843509753545126
},
"coreference_id": "OCCUPATION_1"
},
{
"text": "center",
"location": {
"stt_idx": 153,
"end_idx": 159
},
"best_label": "OCCUPATION",
"labels": {
"OCCUPATION": 0.8316260576248169
},
"coreference_id": "OCCUPATION_2"
},
{
"text": "Denver Nuggets",
"location": {
"stt_idx": 168,
"end_idx": 182
},
"best_label": "ORGANIZATION",
"labels": {
"LOCATION_CITY": 0.48198258876800537,
"ORGANIZATION": 0.9154168367385864,
"LOCATION": 0.4703272879123688
},
"coreference_id": "ORGANIZATION_1"
},
{
"text": "National Basketball Association",
"location": {
"stt_idx": 190,
"end_idx": 221
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.9143192768096924
},
"coreference_id": "ORGANIZATION_2"
},
{
"text": "NBA",
"location": {
"stt_idx": 223,
"end_idx": 226
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8653480410575867
},
"coreference_id": "ORGANIZATION_2"
},
{
"text": "Jokić",
"location": {
"stt_idx": 229,
"end_idx": 234
},
"best_label": "NAME_FAMILY",
"labels": {
"NAME_FAMILY": 0.9177489876747131,
"NAME": 0.9098437031110128
},
"coreference_id": "NAME_1"
},
{
"text": "Sombor",
"location": {
"stt_idx": 259,
"end_idx": 265
},
"best_label": "LOCATION_CITY",
"labels": {
"LOCATION_CITY": 0.925605853398641,
"LOCATION": 0.9114498297373453
},
"coreference_id": "LOCATION_CITY_1"
},
{
"text": "Serbia",
"location": {
"stt_idx": 290,
"end_idx": 296
},
"best_label": "LOCATION_COUNTRY",
"labels": {
"LOCATION_COUNTRY": 0.9711890816688538,
"LOCATION": 0.9073220491409302
},
"coreference_id": "LOCATION_COUNTRY_1"
}
],
"entities_present": true,
"characters_processed": 381,
"languages_detected": {
"en": 0.9837551116943359
}
}
]
The response includes a key element for each entity:
-
coreference_id
: A unique identifier added to each entity that groups coreferential entities under a common label. This behavior matches the
/process/text
endpoint whenprocessed_text
is set to MARKER and coreference resolution is applied. For example, "Nikola Jokić", "Никола Јокић", "nǐkola jôkitɕ", and "Jokić" all share the samecoreference_id
("NAME_1"), indicating that they refer to the same person.
The following example demonstrates how to use the coreference information from the API to replace all mentions of a specific person with a fictive name, while leaving other entities in the text unchanged:
# This code assumes that you have the Private AI deidentification service running locally on port 8080.
# It also assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
client = PAIClient(url="http://localhost:8080")
text = [
"Nikola Jokić is a basketball player. LeBron James is also a basketball player. Jokić and James played against each other. Jokić led his team with a triple-double performance. After the game, Nikola praised his teammates for their effort. Many fans consider Nikola Jokić one of the best centers in NBA history."
]
# Create request with coreference resolution enabled
request = {
"text": text,
"locale": "en-US",
"entity_detection": {"accuracy": "high"},
"relation_detection": {"coreference_resolution": "model_prediction"}
}
text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)
# Find the coreference_id for "Nikola Jokić"
target_name = "Nikola Jokić"
target_coref_id = next(
(entity["coreference_id"] for entity in resp.entities[0] if entity.get("text") == target_name and "coreference_id" in entity), None
)
if target_coref_id is None:
raise ValueError(f"Could not find coreference_id for {target_name}")
def replace_with_fictive_name(entity, target_coref_id=target_coref_id, fictive_name="John Doe"):
"""Replace all mentions of the target person with a fictive name."""
if "coreference_id" in entity and entity["coreference_id"] == target_coref_id:
return fictive_name
return entity["text"]
entity_processors = {
"NAME": replace_with_fictive_name,
"NAME_GIVEN": replace_with_fictive_name,
"NAME_FAMILY": replace_with_fictive_name
}
deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=lambda entity: entity["text"])
for example in deidentified_text:
print(example)
The output of running this code is:
John Doe is a basketball player. LeBron James is also a basketball player. John Doe and James played against each other. John Doe led his team with a triple-double performance. After the game, John Doe praised his teammates for their effort. Many fans consider John Doe one of the best centers in NBA history.
This example demonstrates targeted redaction by replacing all mentions of a specific person (Nikola Jokić) in the text, while preserving other names in the document. The coreference information allows consistent redaction across all forms of a name, including different scripts and variants, when using model_prediction
or combined
mode. In heuristics
mode, not all variants may be grouped under the same coreference identifier.