Customizing Redaction
info
In order to run the example code in this guide, please sign up for your free test API key here or run the container.
The Private AI APIs offer a lot of flexibility when it comes to create redacted or de-identified content. This guide introduces a few techniques to modify and extend the existing capabilities of Private AI APIs. It is organized into four parts:
- Part 1: Configuring a mask covers the basics of setting a mask to meet your preferences.
- Part 2: Configuring a marker covers the basics of setting the marker format to meet your preferences.
- Part 3: Using synthetic PII explains how to replace the original PII with synthetic values.
- Part 4: Custom redaction using the NER Text route presents an approach to create a fully customized redacted output using the NER route.
The techniques described in the first two sections apply to most of the Private AI APIs. In particular, they can be used to customize the redaction in the Process Text route, the File URI route and the File Base64 route. The configuration is done through the shared object processed_text
, part of the API's request. See the specific route documentations for details. The following description will be using example requests and responses from the Process Text route for simplicity.
When redacting or de-identifying text, a customizable string pattern is used to replace the detected PII in the text. Private AI supports these replacement options:
-
MASK
containing repeated characters up to the length of the replaced entity. This option provides a redacted text containing no information about the actual entities that were replaced. -
MARKER
containing the type of the entity being replaced. Markers can also be configured to link different mentions of the same entities in the redacted text ( i.e. , a name that appear twice in the text will have the same unique replacement marker). -
SYNTHETIC
text containing an AI generated replacement for the original entity. This option provides a processed text that is very similar to the original input text except that sensitive PII has been replaced with fake values.
Configuring a mask
Masking is also known as hashing when the #
character is used. Setting the mask option is as simple as setting the type MASK
in the processed_text
object.
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"processed_text": {
"type": "MASK"
}
}
This option will replace all names, organizations and other PII mentioned with the default mask character #
. The redacted text will then look like:
"Hi! This is ###############################. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ###### Frequent Voyager points. I checked my account today. I had 5,000 points in ####. Now it’s just 3,500. I haven’t flown in ###### since #### because of #####. What happened? May I have your name and account number, ma’am? #############, #############."
[
{
"processed_text": "Hi! This is ###############################. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ###### Frequent Voyager points. I checked my account today. I had 5,000 points in ####. Now it’s just 3,500. I haven’t flown in ###### since #### because of #####. What happened? May I have your name and account number, ma’am? #############, #############.",
"entities": [
{
"processed_text": "###############################",
"text": "Icarus Airways Customer Service",
"location": {
"stt_idx": 12,
"end_idx": 43,
"stt_idx_processed": 12,
"end_idx_processed": 43
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8451
}
},
{
"processed_text": "######",
"text": "Icarus",
"location": {
"stt_idx": 134,
"end_idx": 140,
"stt_idx_processed": 134,
"end_idx_processed": 140
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.7969
}
},
{
"processed_text": "####",
"text": "2019",
"location": {
"stt_idx": 216,
"end_idx": 220,
"stt_idx_processed": 216,
"end_idx_processed": 220
},
"best_label": "DATE_INTERVAL",
"labels": {
"DATE_INTERVAL": 0.9384
}
},
{
"processed_text": "######",
"text": "Icarus",
"location": {
"stt_idx": 262,
"end_idx": 268,
"stt_idx_processed": 262,
"end_idx_processed": 268
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8285
}
},
{
"processed_text": "####",
"text": "2019",
"location": {
"stt_idx": 275,
"end_idx": 279,
"stt_idx_processed": 275,
"end_idx_processed": 279
},
"best_label": "DATE_INTERVAL",
"labels": {
"DATE_INTERVAL": 0.9393
}
},
{
"processed_text": "#####",
"text": "COVID",
"location": {
"stt_idx": 291,
"end_idx": 296,
"stt_idx_processed": 291,
"end_idx_processed": 296
},
"best_label": "CONDITION",
"labels": {
"CONDITION": 0.9327
}
},
{
"processed_text": "#############",
"text": "Nessa Jonsson",
"location": {
"stt_idx": 361,
"end_idx": 374,
"stt_idx_processed": 361,
"end_idx_processed": 374
},
"best_label": "NAME",
"labels": {
"NAME": 0.903,
"NAME_GIVEN": 0.3583,
"NAME_FAMILY": 0.5411
}
},
{
"processed_text": "#############",
"text": "N-E-S-S-A J-O-N-S-S-O-N",
"location": {
"stt_idx": 376,
"end_idx": 399,
"stt_idx_processed": 376,
"end_idx_processed": 389
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.3708,
"NAME": 0.907,
"NAME_FAMILY": 0.5271
}
}
],
"entities_present": true,
"characters_processed": 400,
"languages_detected": {
"en": 0.9167966246604919
}
}
]
If you prefer a different masking character in your redacted text, you can specify it in the request.
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"processed_text": {
"type": "MASK",
"mask_character": "■"
}
}
The above request will redact using the provided mask character:
"Hi! This is ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ■■■■■■ Frequent Voyager points. I checked my account today. I had 5,000 points in ■■■■. Now it’s just 3,500. I haven’t flown in ■■■■■■ since ■■■■ because of ■■■■■. What happened? May I have your name and account number, ma’am? ■■■■■■■■■■■■■, ■■■■■■■■■■■■■."
[
{
"processed_text": "Hi! This is ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ■■■■■■ Frequent Voyager points. I checked my account today. I had 5,000 points in ■■■■. Now it’s just 3,500. I haven’t flown in ■■■■■■ since ■■■■ because of ■■■■■. What happened? May I have your name and account number, ma’am? ■■■■■■■■■■■■■, ■■■■■■■■■■■■■.",
"entities": [
{
"processed_text": "■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■",
"text": "Icarus Airways Customer Service",
"location": {
"stt_idx": 12,
"end_idx": 43,
"stt_idx_processed": 12,
"end_idx_processed": 43
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8451
}
},
{
"processed_text": "■■■■■■",
"text": "Icarus",
"location": {
"stt_idx": 134,
"end_idx": 140,
"stt_idx_processed": 134,
"end_idx_processed": 140
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.7969
}
},
{
"processed_text": "■■■■",
"text": "2019",
"location": {
"stt_idx": 216,
"end_idx": 220,
"stt_idx_processed": 216,
"end_idx_processed": 220
},
"best_label": "DATE_INTERVAL",
"labels": {
"DATE_INTERVAL": 0.9384
}
},
{
"processed_text": "■■■■■■",
"text": "Icarus",
"location": {
"stt_idx": 262,
"end_idx": 268,
"stt_idx_processed": 262,
"end_idx_processed": 268
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8285
}
},
{
"processed_text": "■■■■",
"text": "2019",
"location": {
"stt_idx": 275,
"end_idx": 279,
"stt_idx_processed": 275,
"end_idx_processed": 279
},
"best_label": "DATE_INTERVAL",
"labels": {
"DATE_INTERVAL": 0.9393
}
},
{
"processed_text": "■■■■■",
"text": "COVID",
"location": {
"stt_idx": 291,
"end_idx": 296,
"stt_idx_processed": 291,
"end_idx_processed": 296
},
"best_label": "CONDITION",
"labels": {
"CONDITION": 0.9327
}
},
{
"processed_text": "■■■■■■■■■■■■■",
"text": "Nessa Jonsson",
"location": {
"stt_idx": 361,
"end_idx": 374,
"stt_idx_processed": 361,
"end_idx_processed": 374
},
"best_label": "NAME",
"labels": {
"NAME": 0.903,
"NAME_GIVEN": 0.3583,
"NAME_FAMILY": 0.5411
}
},
{
"processed_text": "■■■■■■■■■■■■■",
"text": "N-E-S-S-A J-O-N-S-S-O-N",
"location": {
"stt_idx": 376,
"end_idx": 399,
"stt_idx_processed": 376,
"end_idx_processed": 389
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.3708,
"NAME": 0.907,
"NAME_FAMILY": 0.5271
}
}
],
"entities_present": true,
"characters_processed": 400,
"languages_detected": {
"en": 0.9167966246604919
}
}
]
In both cases, the redacted text does not contain any information about the entities that were replaced beside their length.
Configuring a marker
The marker option allows the redacted text to include the entity type. This may improve the readability of the redacted text. Setting the default marker option is as simple as setting the mask option.
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"processed_text": {
"type": "MARKER"
}
}
With the default marker settings, the response will contain markers with the entity type and a unique number.
"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since [DATE_INTERVAL_1] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."
[
{
"processed_text": "Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since [DATE_INTERVAL_1] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
"entities": [
{
"processed_text": "ORGANIZATION_1",
"text": "Icarus Airways Customer Service",
"location": {
"stt_idx": 12,
"end_idx": 43,
"stt_idx_processed": 12,
"end_idx_processed": 28
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8451
}
},
{
"processed_text": "ORGANIZATION_2",
"text": "Icarus",
"location": {
"stt_idx": 134,
"end_idx": 140,
"stt_idx_processed": 119,
"end_idx_processed": 135
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.7969
}
},
{
"processed_text": "DATE_INTERVAL_1",
"text": "2019",
"location": {
"stt_idx": 216,
"end_idx": 220,
"stt_idx_processed": 211,
"end_idx_processed": 228
},
"best_label": "DATE_INTERVAL",
"labels": {
"DATE_INTERVAL": 0.9384
}
},
{
"processed_text": "ORGANIZATION_2",
"text": "Icarus",
"location": {
"stt_idx": 262,
"end_idx": 268,
"stt_idx_processed": 270,
"end_idx_processed": 286
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8285
}
},
{
"processed_text": "DATE_INTERVAL_1",
"text": "2019",
"location": {
"stt_idx": 275,
"end_idx": 279,
"stt_idx_processed": 293,
"end_idx_processed": 310
},
"best_label": "DATE_INTERVAL",
"labels": {
"DATE_INTERVAL": 0.9393
}
},
{
"processed_text": "CONDITION_1",
"text": "COVID",
"location": {
"stt_idx": 291,
"end_idx": 296,
"stt_idx_processed": 322,
"end_idx_processed": 335
},
"best_label": "CONDITION",
"labels": {
"CONDITION": 0.9327
}
},
{
"processed_text": "NAME_1",
"text": "Nessa Jonsson",
"location": {
"stt_idx": 361,
"end_idx": 374,
"stt_idx_processed": 400,
"end_idx_processed": 408
},
"best_label": "NAME",
"labels": {
"NAME": 0.903,
"NAME_GIVEN": 0.3583,
"NAME_FAMILY": 0.5411
}
},
{
"processed_text": "NAME_1",
"text": "N-E-S-S-A J-O-N-S-S-O-N",
"location": {
"stt_idx": 376,
"end_idx": 399,
"stt_idx_processed": 410,
"end_idx_processed": 418
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.3708,
"NAME": 0.907,
"NAME_FAMILY": 0.5271
}
}
],
"entities_present": true,
"characters_processed": 400,
"languages_detected": {
"en": 0.9167966246604919
}
}
]
Notice how, Nessa Jonsson
and the spelled-out mention N-E-S-S-A J-O-N-S-S-O-N
have been replaced with the same marker index NAME_1
. Similarly, two mentions of Icarus
have been replaced with ORGANIZATION_2
. When creating the markers with the default settings, the de-identification service will use a unique marker index unless the entity was previously seen in the text. If the entity is repeated more than once in the text, the service will do its best to assign the same unique marker. Read more about keeping the relationship between entities in the Coreference Resolution Section below.
You can also create your own marker by providing a format containing one of the marker keywords below (e.g., [ENTITY_TYPE]
):
Marker keywords | Description |
---|---|
ENTITY_TYPE |
Replace the entity with the type that best describes the entity (e.g., John -> NAME_GIVEN ) |
ALL_ENTITY_TYPES |
Replace the entity with all the labels that applies (e.g., John -> NAME_GIVEN,NAME ) |
UNIQUE_ENTITY_TYPE (default) |
Replace the entity with the type that best describes the entity and append a number so that different entities have different markers (e.g. John -> NAME_GIVEN_1 , Mary -> NAME_GIVEN_2 ) |
UNIQUE_HASHED_ENTITY_TYPE |
Similar to UNIQUE_ENTITY_TYPE except that to make the markers unique the service appends a hash value instead of an sequential integer. |
Private AI will replace detected entities with the provided format, using the above keyword as a format specification. Here are some examples of custom markers with the configuration that generated them.
Replacing entities with a list of all applicable entity types:
"Hi! This is <ORGANIZATION>. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my <ORGANIZATION> Frequent Voyager points. I checked my account today. I had 5,000 points in <DATE_INTERVAL>. Now it’s just 3,500. I haven’t flown in <ORGANIZATION> since <DATE_INTERVAL> because of <CONDITION>. What happened? May I have your name and account number, ma’am? <NAME,NAME_FAMILY,NAME_GIVEN>, <NAME,NAME_FAMILY,NAME_GIVEN>."
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"processed_text": {
"type": "MARKER",
"pattern": "<ALL_ENTITY_TYPES>"
}
}
Replacing entities with unique hashed markers:
"Hi! This is -->ORGANIZATION_6SjIF<--. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my -->ORGANIZATION_5AP6v<-- Frequent Voyager points. I checked my account today. I had 5,000 points in -->DATE_INTERVAL_m8wzC<--. Now it’s just 3,500. I haven’t flown in -->ORGANIZATION_5AP6v<-- since -->DATE_INTERVAL_m8wzC<-- because of -->CONDITION_FCj0D<--. What happened? May I have your name and account number, ma’am? -->NAME_HBtjJ<--, -->NAME_HBtjJ<--."
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"processed_text": {
"type": "MARKER",
"pattern": "-->UNIQUE_HASHED_ENTITY_TYPE<--"
}
}
These are only a few examples of ways you can customize the markers in your de-identified text.
What is coreference resolution
Coreference resolution is a natural language processing (NLP) task that consists of locating and associating different mentions of real-world entities in unstructured text.
Let's look at an example.
Hi! I'm Michael but call me Mike please.
In the short text above, the same individual is mentioned twice: once as Michael and once as Mike. Because these two mentions refer to the same person, we say that they are coreferential. Coreference resolution does not only address the problem of associating different mentions of an individual, but can also be used to identify mentions of organizations, locations, and so on. The coreference resolution task is often extended to also include pronouns (e.g., "he" and "she") and other nominal forms (e.g., "the PM of Canada").
Coreference resolution has proven to be very useful in many other NLP tasks, including: question answering, sentiment analysis, and document summarization, to name a few. If you are considering using redacted text as input to an existing model (e.g., a LLM) or to train a model for another NLP task, you should consider how coreference resolution might enhance those tasks.
Private AI and coreference resolution (new in 4.0)
Private AI's de-identification service offers the ability to use coreference resolution on its process/text
endpoint. The current implementation of coreference resolution is done on top of the named entity recognition (NER) model. It is, therefore, limited to returning coreference between entities only.
Private AI offers three different methods of performing coreference resolution. Whether you need to feed the redacted text to a ML model or simply make it easier to identify the different mentions of entities, you can benefit from Private AI's coreference resolution support.
Here is an example of how to set coreference resolution in your de-identification request using the process/text
endpoint. Notice the coreference_resolution
field part of the processed_text
object.
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"processed_text": {
"type": "MARKER",
"pattern": "[UNIQUE_NUMBERED_ENTITY_TYPE]",
"coreference_resolution": "model_prediction"
}
}
The coreference_resolution
field can take one of three values: heuristics
, model_prediction
or combined
. Note that coreference resolution is enabled whenever a unique marker is used (i.e., UNIQUE_NUMBERED_ENTITY_TYPE
or UNIQUE_HASHED_ENTITY_TYPE
). By default, the heuristics
mode will be enabled.
The following sections describe each of these options.
Heuristics
This method of coreference resolution is based solely on string matching. It is therefore only capable of linking entities that are mentioned in the same way. For example, the entities Mary
and mary
will be linked together because the strings match, except for a small difference in casing.
However, the entities John A. Smith
and Mr Smith
will not be linked together in this mode even if they are referring to the same person. As a consequence, the two entities will be assigned different unique markers (e.g., NAME_1
and NAME_2
) in the redacted text.
Here is the output of the example above using the heuristics
mode:
"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since [DATE_INTERVAL_1] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"processed_text": {
"type": "MARKER",
"pattern": "[UNIQUE_NUMBERED_ENTITY_TYPE]",
"coreference_resolution": "heuristics"
}
}
Notice how only two of the three mentions of the Icarus organization were linked together (i.e., using the same ORGANIZATION_2
marker). The first mention is too different from the two other mentions for it to be resolved using heuristics. However, the two mentions of the person: Nessa Jonsson
and the spelled-out mention N-E-S-S-A J-O-N-S-S-O-N
were correctly linked despite the differences between the strings.
While the heuristics
mode has its limitations, it is great when a more predictable output is required (e.g., all exact mentions of an entity will be linked together in a text, no matter how long or difficult the text is). This option is also the fastest one and all entity types are supported.
The heuristics
mode is currently the default one in the process/text
endpoint.
Model prediction (new in 4.0)
The model_prediction
option was introduced by Private AI to work around some of the limitations of the heuristics
mode of resolution. This option uses a neural network model to resolve coreferences. It is capable of resolving mentions that have different spellings or even mentions containing typos.
"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_1] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_1] since [DATE_INTERVAL_2] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_2]."
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"processed_text": {
"type": "MARKER",
"pattern": "[UNIQUE_NUMBERED_ENTITY_TYPE]",
"coreference_resolution": "model_prediction"
}
}
In the above request, only the value of the coreference_resolution
field has changed. As you can see, the model_prediction
mode is capable of linking Icarus Airways Customer Service with Icarus, leading to a redacted text in which all mentions of the Icarus organization are replaced with the same marker (i.e., ORGANIZATION_1
). However, the model was unabled to link the person's name Nessa Jonsson with the spelled-out form.
The model_prediction
option is great if you are dealing with entity mentions that may contain variation or typos. The model_prediction
option currently only resolves NAME
and ORGANIZATION
entities in English text. Note also that this option is much slower than the heuristics
one. It is not recommended for text samples that contain more than a few hundred words.
Combined (new in 4.0)
As its name suggests, this option combines the two other coreference resolution modes.
"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_1] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_1] since [DATE_INTERVAL_1] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"processed_text": {
"type": "MARKER",
"pattern": "[UNIQUE_NUMBERED_ENTITY_TYPE]",
"coreference_resolution": "combined"
}
}
By leveraging the two approaches for finding coreferential mentions, the combined
mode is able to resolve all coreferential mentions in the above example. If you plan to process only English text and want the best coverage to identify and resolve coreferential mentions of all types, then this mode is for you!
Note that the combined
mode suffers from the same limitations as the model_prediction
mode. It is much slower than the heuristics
mode alone and it is not recommended when processing large volumnes of text (e.g., several thousand words).
coreference resolution across multiple requests
Note that the coreference resolution is only performed within a single request. Identical entities across different requests will usually be assigned different markers. If you are processing related text fragments, you may consider passing them as a batch in a single request and setting link_batch
to True. This will allow the de-identification service to link entities across these fragments.
Synthetic PII
You may choose to replace the entities in your text with fake or synthetic entities instead of markers and masks. There are a few reasons to do so. For example, if you train an AI model on your data, synthetic replacements might provide a more realistic input to train your model.
Generating synthetic PII is done by setting processed_text.type
to SYNTHETIC
.
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"processed_text": {
"type": "SYNTHETIC"
}
}
The synthetic text output will be similar to:
"Hi! This is United Federal Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my United Federal Customer Service Frequent Voyager points. I checked my account today. I had 5,000 points in 2012. Now it’s just 3,500. I haven’t flown in United Federal Customer Service since 2012 because of COVID. What happened? May I have your name and account number, ma’am? Maria Carlotta, Maria Carlotta."
[
{
"processed_text": "Hi! This is United Federal Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my United Federal Customer Service Frequent Voyager points. I checked my account today. I had 5,000 points in 2012. Now it’s just 3,500. I haven’t flown in United Federal Customer Service since 2012 because of COVID. What happened? May I have your name and account number, ma’am? Maria Carlotta, Maria Carlotta.",
"entities": [
{
"processed_text": "United Federal Customer Service",
"text": "Icarus Airways Customer Service",
"location": {
"stt_idx": 12,
"end_idx": 43,
"stt_idx_processed": 12,
"end_idx_processed": 43
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8451
}
},
{
"processed_text": "United Federal Customer Service",
"text": "Icarus",
"location": {
"stt_idx": 134,
"end_idx": 140,
"stt_idx_processed": 134,
"end_idx_processed": 165
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.7969
}
},
{
"processed_text": "2012",
"text": "2019",
"location": {
"stt_idx": 216,
"end_idx": 220,
"stt_idx_processed": 241,
"end_idx_processed": 245
},
"best_label": "DATE_INTERVAL",
"labels": {
"DATE_INTERVAL": 0.9384
}
},
{
"processed_text": "United Federal Customer Service",
"text": "Icarus",
"location": {
"stt_idx": 262,
"end_idx": 268,
"stt_idx_processed": 287,
"end_idx_processed": 318
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8285
}
},
{
"processed_text": "2012",
"text": "2019",
"location": {
"stt_idx": 275,
"end_idx": 279,
"stt_idx_processed": 325,
"end_idx_processed": 329
},
"best_label": "DATE_INTERVAL",
"labels": {
"DATE_INTERVAL": 0.9393
}
},
{
"processed_text": "COVID",
"text": "COVID",
"location": {
"stt_idx": 291,
"end_idx": 296,
"stt_idx_processed": 341,
"end_idx_processed": 346
},
"best_label": "CONDITION",
"labels": {
"CONDITION": 0.9327
}
},
{
"processed_text": "Maria Carlotta",
"text": "Nessa Jonsson",
"location": {
"stt_idx": 361,
"end_idx": 374,
"stt_idx_processed": 411,
"end_idx_processed": 425
},
"best_label": "NAME",
"labels": {
"NAME": 0.903,
"NAME_GIVEN": 0.3583,
"NAME_FAMILY": 0.5411
}
},
{
"processed_text": "Maria Carlotta",
"text": "N-E-S-S-A J-O-N-S-S-O-N",
"location": {
"stt_idx": 376,
"end_idx": 399,
"stt_idx_processed": 427,
"end_idx_processed": 441
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.3708,
"NAME": 0.907,
"NAME_FAMILY": 0.5271
}
}
],
"entities_present": true,
"characters_processed": 400,
"languages_detected": {
"en": 0.9167966246604919
}
}
]
Note how the PII has been replaced with similar looking fake entities. Also you should know that each synthetic data request may have a different response as the synthetic data generation is non-deterministic.
You can optionally configure the language in which the text is generated using the synthetic_entity_accuracy
field. For English generation, set this parameter to standard
for best results. For other languages, set it to standard_multilingual
and the synthetic model will attempt to predict entities matching the input text language. The default accuracy is standard_automatic
which will determine the appropriate model (i.e., standard
or standard_multilingual
) from the input language.
{
"text": [
"Publié le 03/01/2017 de la baie de Vaitupa, Polynésie française, GPS 17 34.06 S 149 37.1 W\n Nous nous sommes probablement rencontrés chez Yan Labrosse … il ya longtemps. Je suis ton periple avec … beaucoup d’envie!! Yan m’a dit que tu comptais rejoindre l’indonésie."
],
"processed_text": {
"type": "SYNTHETIC",
"synthetic_entity_accuracy": "standard_multilingual"
}
}
The response show how the entities were replaced with French locations and country.
"Publié le 31/08/2014 de la gare de Memphis, Tennessee américain, GPS 20 9sn.10 N ec 15.5 S\n Nous nous sommes probablement rencontrés chez Max Fontaine … il ya longtemps. Je suis ton periple avec … beaucoup d’envie!! Ben m’a dit que tu comptais rejoindre l’Argentine."
[
{
"processed_text": "Publié le 31/08/2014 de la gare de Memphis, Tennessee américain, GPS 20 9sn.10 N ec 15.5 S\n Nous nous sommes probablement rencontrés chez Max Fontaine … il ya longtemps. Je suis ton periple avec … beaucoup d’envie!! Ben m’a dit que tu comptais rejoindre l’Argentine.",
"entities": [
{
"processed_text": "31/08/2014",
"text": "03/01/2017",
"location": {
"stt_idx": 10,
"end_idx": 20,
"stt_idx_processed": 10,
"end_idx_processed": 20
},
"best_label": "DATE",
"labels": {
"DATE": 0.9961
}
},
{
"processed_text": "gare de Memphis, Tennessee américain",
"text": "baie de Vaitupa, Polynésie française",
"location": {
"stt_idx": 27,
"end_idx": 63,
"stt_idx_processed": 27,
"end_idx_processed": 63
},
"best_label": "LOCATION",
"labels": {
"LOCATION": 0.8305,
"LOCATION_CITY": 0.0939,
"LOCATION_STATE": 0.0878,
"ORIGIN": 0.0745
}
},
{
"processed_text": "20 9sn.10 N ec 15.5 S",
"text": "17 34.06 S 149 37.1 W",
"location": {
"stt_idx": 69,
"end_idx": 90,
"stt_idx_processed": 69,
"end_idx_processed": 92
},
"best_label": "LOCATION_COORDINATE",
"labels": {
"LOCATION_COORDINATE": 0.989,
"LOCATION": 0.9506
}
},
{
"processed_text": "Max Fontaine",
"text": "Yan Labrosse",
"location": {
"stt_idx": 138,
"end_idx": 150,
"stt_idx_processed": 140,
"end_idx_processed": 152
},
"best_label": "NAME",
"labels": {
"NAME": 0.9953,
"NAME_GIVEN": 0.2486,
"NAME_FAMILY": 0.7461
}
},
{
"processed_text": "Ben",
"text": "Yan",
"location": {
"stt_idx": 216,
"end_idx": 219,
"stt_idx_processed": 218,
"end_idx_processed": 221
},
"best_label": "NAME_GIVEN",
"labels": {
"NAME": 0.9941,
"NAME_GIVEN": 0.9915
}
},
{
"processed_text": "l’Argentine",
"text": "l’indonésie",
"location": {
"stt_idx": 254,
"end_idx": 265,
"stt_idx_processed": 256,
"end_idx_processed": 267
},
"best_label": "LOCATION_COUNTRY",
"labels": {
"LOCATION": 0.9858,
"LOCATION_COUNTRY": 0.9682
}
}
],
"entities_present": true,
"characters_processed": 266,
"languages_detected": {
"fr": 0.9757143259048462
}
}
]
See the Process Text route documentation for additional configuration options for synthetic data generation.
Custom redaction using the NER Text route
As we have seen above, the Process Text route offers a lot of flexibility in how text and files are redacted.
In the event that you have a specific use case that is not completely covered by the API, it is possible to create your own custom redaction function. This section shows how the NER Text route that was introduced in 3.9
can be used to create a custom redaction function with more "fine-grained" labels.
Process Text route redaction
Let's say that you want to redact this fragment of text:
"ERIC G. BADORREK was born in 1960 and registered to vote on 10 February 2012, giving the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. BADORREK is registered to vote in the Republican Party. Voter ID number: 100917654"
Using the Process Text route, the redacted content will look like:
"[NAME] was born in [DOB] and registered to vote on [DATE], giving the address [LOCATION_ADDRESS]. [NAME_FAMILY] is registered to vote in the [ORGANIZATION]. Voter ID number: [ACCOUNT_NUMBER]"
Notice how all the parts of the name ERIC G. BADORREK
including the first name, initial and last name were combined into a single NAME
marker. This grouping of words into a single marker is even more apparent on the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A.
which is redacted as a single LOCATION_ADDRESS
label. This is certainly making the redacted contents more readable but it is hiding some information that may be useful for your use case. For example, you might want to know if the provided address was containing a zip code or a country which is impossible to determine from the current redacted output.
Using the NER Text route to create your own redacted content
Unlike the Process Text route, the NER Text route does not provide a redacted output. However, the entities it returns can be used to create one. Let's see how.
Consider this piece of code which is processing the same sample text but with the NER Text route this time.
import requests
from itertools import groupby
text = "ERIC G. BADORREK was born in 1960 and registered to vote on 10 February 2012, giving the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. BADORREK is registered to vote in the Republican Party. Voter ID number: 100917654"
request = {
"text": [text]
}
# TODO - you should be updating this part to point to your local instance of Private AI or to one of the Private AI cloud API.
resp = requests.post("http://localhost:8999/ner/text", json=request).json()
# sort the entities so that entities with longest spans are first
entities = sorted(resp[0]["entities"], key=lambda e: (e["location"]["stt_idx"], -e["location"]["end_idx"], len(e["label"])))
class NotComparable(str):
"""Turns a string (_e.g._, a string literal like "e") which would otherwise compare equal to itself non-comparable"""
def __init__(self, value: str):
self.value = value
redacted_chunks = [NotComparable(c) for c in text]
for entity in entities:
start = entity["location"]["stt_idx"]
end = entity["location"]["end_idx"]
redacted_chunks[start:end] = [f"""[{entity["label"]}]"""] * (end - start)
print("".join(key for key, _ in groupby(redacted_chunks)))
We first make a request to the NER Text route endpoint, passing the text to analyse. Then we extract and sort the entities from the response.
# TODO - you should be updating this part to point to your local instance of Private AI or to one of the Private AI cloud API.
resp = requests.post("http://localhost:8080/ner/text", json=request).json()
# sort the entities so that entities with longest spans are first
entities = sorted(resp[0]["entities"], key=lambda e: (e["location"]["stt_idx"], -e["location"]["end_idx"], len(e["label"])))
We are going to use these entities to create a redacted text containing more details about the original text (e.g., whether an address was containing a COUNTRY). Because we are interested in showing "fine-grained" entities (i.e., the one with smaller spans) over "coarser" entities, we are sorting the overlapping entities from the longest to the shortest. The following code will contruct the redacted text by iterating over the list of sorted entities.
While doing so, it is easier to turn the input text in to a list of characters. This allows us to more easily replace the sensitive contents (i.e., the characters covered by an entity span) with a redaction marker. The following code is converting the input text to a list of characters and then replace each character that is part of an entity with the entity label. A small utility, NotComparable
, is created to ensure that identical strings are not comparable (i.e., NotComparable("e") != NotComparable("e")
). This will be useful when outputting the redacted text.
class NotComparable(str):
"""Turns a string (_e.g._, a string literal like "e") which would otherwise compare equal to itself non-comparable"""
def __init__(self, value: str):
self.value = value
redacted_chunks = [NotComparable(c) for c in text]
for entity in entities:
start = entity["location"]["stt_idx"]
end = entity["location"]["end_idx"]
redacted_chunks[start:end] = [f"""[{entity["label"]}]"""] * (end - start)
The last step is simply to join all the characters of the original text and the redaction markers into a redacted contents. Since the markers are repeated for each entity characters that were replaced, we use the groupby
function to only output it once. This is where the NotComparable
utility plays its role by preventing consecutive identical characters (e.g., the two R
in BADORREK
) to be grouped together.
print("".join(key for key, _ in groupby(redacted_chunks)))
The result is a redacted text with all the necessary details.
"[NAME_GIVEN][NAME][NAME_FAMILY] was born in [DOB] and registered to vote on [DATE], giving the address [LOCATION_ADDRESS_STREET][LOCATION_ADDRESS][LOCATION_CITY][LOCATION_ADDRESS][LOCATION_STATE][LOCATION_ADDRESS][LOCATION_COUNTRY]. [NAME_FAMILY] is registered to vote in the [POLITICAL_AFFILIATION]. Voter ID number: [ACCOUNT_NUMBER]"
See how the name ERIC G. BADORREK
has been replace with [NAME_GIVEN][NAME][NAME_FAMILY]
instead of a single NAME
marker and how the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A.
was redacted with much more details [LOCATION_ADDRESS_STREET][LOCATION_ADDRESS][LOCATION_CITY][LOCATION_ADDRESS][LOCATION_STATE][LOCATION_ADDRESS][LOCATION_COUNTRY]
. From the above redacted text, it becomes clear that the original address contained a city, a state and a country but no zip code.
A parting note about privacy
You may wonder if the redacted results achieved in this section could have been obtained in a simpler way by disabling the NAME
, LOCATION
and LOCATION_ADDRESS
entity types when making the request. While disabling entity types has its use, the technique described above has the advantage of lowering the chances of leaking sensitive data.
Consider for example the words Sussex County
part of the provided address. These words are part of the LOCATION_ADDRESS
but not part of any other sub-entities. As a result, these words would be left unredacted if both LOCATION
and LOCATION_ADDRESS
were disabled. This is applicable to many other entities like Mount Everest which is a LOCATION
but does not match any other location sub-entities. By disabling the LOCATION
label, we let these entities unredacted. This might not be desirable for some use cases.