Customizing Redaction

info

Connect with one of our privacy experts to run this code.

The Private AI APIs offer a lot of flexibility when it comes to create redacted or de-identified content. This guide introduces a few techniques to modify and extend the existing capabilities of Private AI APIs. It is organized into four parts:

Part 1: Configuring a mask covers the basics of setting a mask to meet your preferences.
Part 2: Configuring a marker covers the basics of setting the marker format to meet your preferences.
Part 3: Using synthetic PII explains how to replace the original PII with synthetic values.
Part 4: Custom redaction using the NER Text route presents an approach to create a fully customized redacted output using the NER route.

The techniques described in the first two sections apply to most of the Private AI APIs. In particular, they can be used to customize the redaction in the Process Text route, the File URI route and the File Base64 route. The configuration is done through the shared object processed_text, part of the API's request. See the specific route documentations for details. The following description will be using example requests and responses from the Process Text route for simplicity.

When redacting or de-identifying text, a customizable string pattern is used to replace the detected PII in the text. Private AI supports these replacement options:

MASK containing repeated characters up to the length of the replaced entity. This option provides a redacted text containing no information about the actual entities that were replaced.
MARKER containing the type of the entity being replaced. Markers can also be configured to link different mentions of the same entities in the redacted text ( i.e. , a name that appear twice in the text will have the same unique replacement marker).
SYNTHETIC text containing an AI generated replacement for the original entity. This option provides a processed text that is very similar to the original input text except that sensitive PII has been replaced with fake values.

Configuring a mask

Masking is also known as hashing when the # character is used. Setting the mask option is as simple as setting the type MASK in the processed_text object.

Copy

Copied

{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MASK"
    }
 }

This option will replace all names, organizations and other PII mentioned with the default mask character #. The redacted text will then look like:

Mask Default (redacted text)Mask Default (full response)

Copy

Copied

"Hi! This is ###############################. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ###### Frequent Voyager points. I checked my account today. I had 5,000 points in ####. Now it’s just 3,500. I haven’t flown in ###### since #### because of #####. What happened? May I have your name and account number, ma’am? #############, #############."

Copy

Copied

[
  {
    "processed_text": "Hi! This is ###############################. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ###### Frequent Voyager points. I checked my account today. I had 5,000 points in ####. Now it’s just 3,500. I haven’t flown in ###### since #### because of #####. What happened? May I have your name and account number, ma’am? #############, #############.",
    "entities": [
      {
        "processed_text": "###############################",
        "text": "Icarus Airways Customer Service",
        "location": {
          "stt_idx": 12,
          "end_idx": 43,
          "stt_idx_processed": 12,
          "end_idx_processed": 43
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8451
        }
      },
      {
        "processed_text": "######",
        "text": "Icarus",
        "location": {
          "stt_idx": 134,
          "end_idx": 140,
          "stt_idx_processed": 134,
          "end_idx_processed": 140
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.7969
        }
      },
      {
        "processed_text": "####",
        "text": "2019",
        "location": {
          "stt_idx": 216,
          "end_idx": 220,
          "stt_idx_processed": 216,
          "end_idx_processed": 220
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9384
        }
      },
      {
        "processed_text": "######",
        "text": "Icarus",
        "location": {
          "stt_idx": 262,
          "end_idx": 268,
          "stt_idx_processed": 262,
          "end_idx_processed": 268
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8285
        }
      },
      {
        "processed_text": "####",
        "text": "2019",
        "location": {
          "stt_idx": 275,
          "end_idx": 279,
          "stt_idx_processed": 275,
          "end_idx_processed": 279
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9393
        }
      },
      {
        "processed_text": "#####",
        "text": "COVID",
        "location": {
          "stt_idx": 291,
          "end_idx": 296,
          "stt_idx_processed": 291,
          "end_idx_processed": 296
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9327
        }
      },
      {
        "processed_text": "#############",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 361,
          "end_idx_processed": 374
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.903,
          "NAME_GIVEN": 0.3583,
          "NAME_FAMILY": 0.5411
        }
      },
      {
        "processed_text": "#############",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 376,
          "end_idx_processed": 389
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3708,
          "NAME": 0.907,
          "NAME_FAMILY": 0.5271
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

If you prefer a different masking character in your redacted text, you can specify it in the request.

Copy

Copied

{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MASK",
        "mask_character": "■"
    }
 }

The above request will redact using the provided mask character:

Custom Mask (redacted text)Custom Mask (full response)

Copy

Copied

"Hi! This is ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ■■■■■■ Frequent Voyager points. I checked my account today. I had 5,000 points in ■■■■. Now it’s just 3,500. I haven’t flown in ■■■■■■ since ■■■■ because of ■■■■■. What happened? May I have your name and account number, ma’am? ■■■■■■■■■■■■■, ■■■■■■■■■■■■■."

Copy

Copied

[
  {
    "processed_text": "Hi! This is ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ■■■■■■ Frequent Voyager points. I checked my account today. I had 5,000 points in ■■■■. Now it’s just 3,500. I haven’t flown in ■■■■■■ since ■■■■ because of ■■■■■. What happened? May I have your name and account number, ma’am? ■■■■■■■■■■■■■, ■■■■■■■■■■■■■.",
    "entities": [
      {
        "processed_text": "■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■",
        "text": "Icarus Airways Customer Service",
        "location": {
          "stt_idx": 12,
          "end_idx": 43,
          "stt_idx_processed": 12,
          "end_idx_processed": 43
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8451
        }
      },
      {
        "processed_text": "■■■■■■",
        "text": "Icarus",
        "location": {
          "stt_idx": 134,
          "end_idx": 140,
          "stt_idx_processed": 134,
          "end_idx_processed": 140
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.7969
        }
      },
      {
        "processed_text": "■■■■",
        "text": "2019",
        "location": {
          "stt_idx": 216,
          "end_idx": 220,
          "stt_idx_processed": 216,
          "end_idx_processed": 220
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9384
        }
      },
      {
        "processed_text": "■■■■■■",
        "text": "Icarus",
        "location": {
          "stt_idx": 262,
          "end_idx": 268,
          "stt_idx_processed": 262,
          "end_idx_processed": 268
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8285
        }
      },
      {
        "processed_text": "■■■■",
        "text": "2019",
        "location": {
          "stt_idx": 275,
          "end_idx": 279,
          "stt_idx_processed": 275,
          "end_idx_processed": 279
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9393
        }
      },
      {
        "processed_text": "■■■■■",
        "text": "COVID",
        "location": {
          "stt_idx": 291,
          "end_idx": 296,
          "stt_idx_processed": 291,
          "end_idx_processed": 296
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9327
        }
      },
      {
        "processed_text": "■■■■■■■■■■■■■",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 361,
          "end_idx_processed": 374
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.903,
          "NAME_GIVEN": 0.3583,
          "NAME_FAMILY": 0.5411
        }
      },
      {
        "processed_text": "■■■■■■■■■■■■■",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 376,
          "end_idx_processed": 389
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3708,
          "NAME": 0.907,
          "NAME_FAMILY": 0.5271
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

In both cases, the redacted text does not contain any information about the entities that were replaced beside their length.

Configuring a marker

The marker option allows the redacted text to include the entity type. This may improve the readability of the redacted text. Setting the default marker option is as simple as setting the mask option.

Copy

Copied

{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MARKER"
    }
 }

With the default marker settings, the response will contain markers with the entity type and a unique number.

Marker Default (redacted text)Marker Default (full response)

Copy

Copied

"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since [DATE_INTERVAL_1] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."

Copy

Copied

[
 {
   "processed_text": "Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since [DATE_INTERVAL_1] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
   "entities": [
     {
       "processed_text": "ORGANIZATION_1",
       "text": "Icarus Airways Customer Service",
       "location": {
         "stt_idx": 12,
         "end_idx": 43,
         "stt_idx_processed": 12,
         "end_idx_processed": 28
       },
       "best_label": "ORGANIZATION",
       "labels": {
         "ORGANIZATION": 0.8451
       }
     },
     {
       "processed_text": "ORGANIZATION_2",
       "text": "Icarus",
       "location": {
         "stt_idx": 134,
         "end_idx": 140,
         "stt_idx_processed": 119,
         "end_idx_processed": 135
       },
       "best_label": "ORGANIZATION",
       "labels": {
         "ORGANIZATION": 0.7969
       }
     },
     {
       "processed_text": "DATE_INTERVAL_1",
       "text": "2019",
       "location": {
         "stt_idx": 216,
         "end_idx": 220,
         "stt_idx_processed": 211,
         "end_idx_processed": 228
       },
       "best_label": "DATE_INTERVAL",
       "labels": {
         "DATE_INTERVAL": 0.9384
       }
     },
     {
       "processed_text": "ORGANIZATION_2",
       "text": "Icarus",
       "location": {
         "stt_idx": 262,
         "end_idx": 268,
         "stt_idx_processed": 270,
         "end_idx_processed": 286
       },
       "best_label": "ORGANIZATION",
       "labels": {
         "ORGANIZATION": 0.8285
       }
     },
     {
       "processed_text": "DATE_INTERVAL_1",
       "text": "2019",
       "location": {
         "stt_idx": 275,
         "end_idx": 279,
         "stt_idx_processed": 293,
         "end_idx_processed": 310
       },
       "best_label": "DATE_INTERVAL",
       "labels": {
         "DATE_INTERVAL": 0.9393
       }
     },
     {
       "processed_text": "CONDITION_1",
       "text": "COVID",
       "location": {
         "stt_idx": 291,
         "end_idx": 296,
         "stt_idx_processed": 322,
         "end_idx_processed": 335
       },
       "best_label": "CONDITION",
       "labels": {
         "CONDITION": 0.9327
       }
     },
     {
       "processed_text": "NAME_1",
       "text": "Nessa Jonsson",
       "location": {
         "stt_idx": 361,
         "end_idx": 374,
         "stt_idx_processed": 400,
         "end_idx_processed": 408
       },
       "best_label": "NAME",
       "labels": {
         "NAME": 0.903,
         "NAME_GIVEN": 0.3583,
         "NAME_FAMILY": 0.5411
       }
     },
     {
       "processed_text": "NAME_1",
       "text": "N-E-S-S-A J-O-N-S-S-O-N",
       "location": {
         "stt_idx": 376,
         "end_idx": 399,
         "stt_idx_processed": 410,
         "end_idx_processed": 418
       },
       "best_label": "NAME",
       "labels": {
         "NAME_GIVEN": 0.3708,
         "NAME": 0.907,
         "NAME_FAMILY": 0.5271
       }
     }
   ],
   "entities_present": true,
   "characters_processed": 400,
   "languages_detected": {
     "en": 0.9167966246604919
   }
 }
]

Notice how, Nessa Jonsson and the spelled-out mention N-E-S-S-A J-O-N-S-S-O-N have been replaced with the same marker index NAME_1. Similarly, two mentions of Icarus have been replaced with ORGANIZATION_2. When creating the markers with the default settings, the de-identification service will use a unique marker index unless the entity was previously seen in the text. If the entity is repeated more than once in the text, the service will do its best to assign the same unique marker. Read more about keeping the relationship between entities in the Coreference Resolution Section below.

You can also create your own marker by providing a format containing one of the marker keywords below (e.g., [ENTITY_TYPE]):

Marker keywords	Description
`ENTITY_TYPE`	Replace the entity with the type that best describes the entity (e.g., John -> `NAME_GIVEN`)
`ALL_ENTITY_TYPES`	Replace the entity with all the labels that applies (e.g., John -> `NAME_GIVEN,NAME`)
`UNIQUE_ENTITY_TYPE` (default)	Replace the entity with the type that best describes the entity and append a number so that different entities have different markers (e.g. John -> `NAME_GIVEN_1`, Mary -> `NAME_GIVEN_2`)
`UNIQUE_HASHED_ENTITY_TYPE`	Similar to `UNIQUE_ENTITY_TYPE` except that to make the markers unique the service appends a hash value instead of an sequential integer.

Private AI will replace detected entities with the provided format, using the above keyword as a format specification. Here are some examples of custom markers with the configuration that generated them.

Replacing entities with a list of all applicable entity types:

ALL_ENTITY_TYPES redacted textALL_ENTITY_TYPES request

Copy

Copied

"Hi! This is <ORGANIZATION>. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my <ORGANIZATION> Frequent Voyager points. I checked my account today. I had 5,000 points in <DATE_INTERVAL>. Now it’s just 3,500. I haven’t flown in <ORGANIZATION> since <DATE_INTERVAL> because of <CONDITION>. What happened? May I have your name and account number, ma’am? <NAME,NAME_FAMILY,NAME_GIVEN>, <NAME,NAME_FAMILY,NAME_GIVEN>."

Copy

Copied

{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MARKER",
        "pattern": "<ALL_ENTITY_TYPES>"
    }
 }

Replacing entities with unique hashed markers:

UNIQUE_HASHED_ENTITY_TYPE redacted textUNIQUE_HASHED_ENTITY_TYPE request

Copy

Copied

"Hi! This is -->ORGANIZATION_6SjIF<--. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my -->ORGANIZATION_5AP6v<-- Frequent Voyager points. I checked my account today. I had 5,000 points in -->DATE_INTERVAL_m8wzC<--. Now it’s just 3,500. I haven’t flown in -->ORGANIZATION_5AP6v<-- since -->DATE_INTERVAL_m8wzC<-- because of -->CONDITION_FCj0D<--. What happened? May I have your name and account number, ma’am? -->NAME_HBtjJ<--, -->NAME_HBtjJ<--."

Copy

Copied

{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MARKER",
        "pattern": "-->UNIQUE_HASHED_ENTITY_TYPE<--"
    }
 }

These are only a few examples of ways you can customize the markers in your de-identified text.

What is coreference resolution

Coreference resolution is a natural language processing (NLP) task that consists of locating and associating different mentions of real-world entities in unstructured text.

Let's look at an example.

Copy

Copied

Hi! I'm Michael but call me Mike please.

In the short text above, the same individual is mentioned twice: once as Michael and once as Mike. Because these two mentions refer to the same person, we say that they are coreferential. Coreference resolution does not only address the problem of associating different mentions of an individual, but can also be used to identify mentions of organizations, locations, and so on. Traditionally, the coreference resolution task is often extended to also include pronouns (e.g., "he" and "she") and other nominal forms (e.g., "the PM of Canada").

Coreference resolution has proven to be very useful in many other NLP tasks, including: question answering, sentiment analysis, and document summarization, to name a few. If you are considering using redacted text as input to an existing model (e.g., a LLM) or to train a model for another NLP task, you should consider how coreference resolution might enhance those tasks.

Private AI and coreference resolution (new in 4.0)

Private AI's de-identification service offers the ability to use coreference resolution on its process/text & analyze/text endpoint. The current implementation of coreference resolution is done on top of the named entity recognition (NER) model. It is, therefore, limited to returning coreference between entities only.

Private AI offers three different methods of performing coreference resolution. Whether you need to feed the redacted text to a ML model or simply make it easier to identify the different mentions of entities, you can benefit from Private AI's coreference resolution support.

Method Name	Description	Speed	Limitations
Heuristics	Uses rule-based methods for linking entities based on string matching.	Fast	Mostly links exact matches and a few minor variations (e.g., difference in casing). It may miss more complex variations and typos.
Model Prediction	Uses a neural network model to resolve coreferences, allowing for variations.	Slower	Currently only supports `NAME` and `ORGANIZATION` entities in English. This method is much slower than the heuristics one.
Combined	Combines both heuristics and model prediction for better coverage.	Slowest	Supports all entities with the heuristics method but resolves more complex cases for `NAME` and `ORGANIZATION` with the model prediction method. Slower than heuristics.

For details on using coreference resolution with the analyze/text endpoint, refer to the analyze-text documentation.

Here is an example of how to set coreference resolution in your de-identification request using the process/text endpoint. Notice the coreference_resolution field part of the processed_text object.

Copy

Copied

{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MARKER",
        "pattern": "[UNIQUE_NUMBERED_ENTITY_TYPE]",
        "coreference_resolution": "model_prediction"
    }
 }

The coreference_resolution field can take one of three values: heuristics, model_prediction or combined. Note that coreference resolution is enabled whenever a unique marker is used (i.e., UNIQUE_NUMBERED_ENTITY_TYPE or UNIQUE_HASHED_ENTITY_TYPE). By default, the heuristics mode will be enabled.

The following sections describe each of these options.

Heuristics

This method of coreference resolution is based solely on string matching. It is therefore only capable of linking entities that are mentioned in the same way. For example, the entities Mary and mary will be linked together because the strings match, except for a small difference in casing.

However, the entities John A. Smith and Mr Smith will not be linked together in this mode even if they are referring to the same person. As a consequence, the two entities will be assigned different unique markers (e.g., NAME_1 and NAME_2) in the redacted text.

Here is the output of the example above using the heuristics mode:

heuristics responseheuristics request

Copy

Copied

"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since [DATE_INTERVAL_1] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."

Copy

Copied

{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MARKER",
        "pattern": "[UNIQUE_NUMBERED_ENTITY_TYPE]",
        "coreference_resolution": "heuristics"
    }
 }

Notice how only two of the three mentions of the Icarus organization were linked together (i.e., using the same ORGANIZATION_2 marker). The first mention is too different from the two other mentions for it to be resolved using heuristics. However, the two mentions of the person: Nessa Jonsson and the spelled-out mention N-E-S-S-A J-O-N-S-S-O-N were correctly linked despite the differences between the strings.

While the heuristics mode has its limitations, it is great when a more predictable output is required (e.g., all exact mentions of an entity need to be linked together in a text, no matter how long or difficult the text is). This option is also the fastest one and all entity types in addition to NAME and ORGANIZATION are supported.

The heuristics mode is currently the default one in the process/text endpoint.

Model prediction (new in 4.0)

The model_prediction option was introduced by Private AI to work around some of the limitations of the heuristics mode of resolution. This option uses a neural network model to resolve coreferences. It is capable of resolving mentions that have different spellings or even mentions containing typos.

model_prediction responsemodel_prediction request

Copy

Copied

"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_1] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_1] since [DATE_INTERVAL_2] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_2]."

Copy

Copied

{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MARKER",
        "pattern": "[UNIQUE_NUMBERED_ENTITY_TYPE]",
        "coreference_resolution": "model_prediction"
    }
 }

In the above request, only the value of the coreference_resolution field has changed. As you can see, the model_prediction mode is capable of linking Icarus Airways Customer Service with Icarus, leading to a redacted text in which all mentions of the Icarus organization are replaced with the same marker (i.e., ORGANIZATION_1). However, the model was unabled to link the person's name Nessa Jonsson with the spelled-out form.

The model_prediction option is great if you are dealing with entity mentions that may contain variation or typos. The model_prediction option currently only resolves NAME and ORGANIZATION entities in English text. Note also that this option is much slower than the heuristics one. It is not recommended for text samples that contain more than a few hundred words.

Combined (new in 4.0)

As its name suggests, this option combines the two other coreference resolution modes.

combined responsecombined request

Copy

Copied

"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_1] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_1] since [DATE_INTERVAL_1] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."

Copy

Copied

{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MARKER",
        "pattern": "[UNIQUE_NUMBERED_ENTITY_TYPE]",
        "coreference_resolution": "combined"
    }
 }

By leveraging the two approaches for finding coreferential mentions, the combined mode is able to resolve all coreferential mentions in the above example. If you plan to process only English text and want the best coverage to identify and resolve coreferential mentions of all types, then this mode is for you!

Note that the combined mode suffers from the same limitations as the model_prediction mode. It is much slower than the heuristics mode alone and it is not recommended when processing large volumnes of text (e.g., several thousand words).

coreference resolution across multiple requests

Note that the coreference resolution is only performed within a single request. Identical entities across different requests will usually be assigned different markers. If you are processing related text fragments, you may consider passing them as a batch in a single request and setting link_batch to True. This will allow the de-identification service to link entities across these fragments.

Synthetic PII

You may choose to replace the entities in your text with fake or synthetic entities instead of markers and masks. There are a few reasons to do so. For example, if you train an AI model on your data, synthetic replacements might provide a more realistic input to train your model.

Generating synthetic PII is done by setting processed_text.type to SYNTHETIC.

Copy

Copied

{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "SYNTHETIC"
    }
 }

The synthetic text output will be similar to:

Synthetic TextSynthetic Response

Copy

Copied

"Hi! This is United Federal Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my United Federal Customer Service Frequent Voyager points. I checked my account today. I had 5,000 points in 2012. Now it’s just 3,500. I haven’t flown in United Federal Customer Service since 2012 because of COVID. What happened? May I have your name and account number, ma’am? Maria Carlotta, Maria Carlotta."

Copy

Copied

[
  {
    "processed_text": "Hi! This is United Federal Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my United Federal Customer Service Frequent Voyager points. I checked my account today. I had 5,000 points in 2012. Now it’s just 3,500. I haven’t flown in United Federal Customer Service since 2012 because of COVID. What happened? May I have your name and account number, ma’am? Maria Carlotta, Maria Carlotta.",
    "entities": [
      {
        "processed_text": "United Federal Customer Service",
        "text": "Icarus Airways Customer Service",
        "location": {
          "stt_idx": 12,
          "end_idx": 43,
          "stt_idx_processed": 12,
          "end_idx_processed": 43
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8451
        }
      },
      {
        "processed_text": "United Federal Customer Service",
        "text": "Icarus",
        "location": {
          "stt_idx": 134,
          "end_idx": 140,
          "stt_idx_processed": 134,
          "end_idx_processed": 165
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.7969
        }
      },
      {
        "processed_text": "2012",
        "text": "2019",
        "location": {
          "stt_idx": 216,
          "end_idx": 220,
          "stt_idx_processed": 241,
          "end_idx_processed": 245
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9384
        }
      },
      {
        "processed_text": "United Federal Customer Service",
        "text": "Icarus",
        "location": {
          "stt_idx": 262,
          "end_idx": 268,
          "stt_idx_processed": 287,
          "end_idx_processed": 318
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8285
        }
      },
      {
        "processed_text": "2012",
        "text": "2019",
        "location": {
          "stt_idx": 275,
          "end_idx": 279,
          "stt_idx_processed": 325,
          "end_idx_processed": 329
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9393
        }
      },
      {
        "processed_text": "COVID",
        "text": "COVID",
        "location": {
          "stt_idx": 291,
          "end_idx": 296,
          "stt_idx_processed": 341,
          "end_idx_processed": 346
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9327
        }
      },
      {
        "processed_text": "Maria Carlotta",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 411,
          "end_idx_processed": 425
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.903,
          "NAME_GIVEN": 0.3583,
          "NAME_FAMILY": 0.5411
        }
      },
      {
        "processed_text": "Maria Carlotta",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 427,
          "end_idx_processed": 441
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3708,
          "NAME": 0.907,
          "NAME_FAMILY": 0.5271
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

Note how the PII has been replaced with similar looking fake entities. Also you should know that each synthetic data request may have a different response as the synthetic data generation is non-deterministic.

You can optionally configure the language in which the text is generated using the synthetic_entity_accuracy field. For English generation, set this parameter to standard for best results. For other languages, set it to standard_multilingual and the synthetic model will attempt to predict entities matching the input text language. The default accuracy is standard_automatic which will determine the appropriate model (i.e., standard or standard_multilingual) from the input language.

Copy

Copied

{
    "text": [
       "Publié le 03/01/2017 de la baie de Vaitupa, Polynésie française, GPS 17 34.06 S 149 37.1 W\n Nous nous sommes probablement rencontrés chez Yan Labrosse … il ya longtemps. Je suis ton periple avec … beaucoup d’envie!! Yan m’a dit que tu comptais rejoindre l’indonésie."
    ],
    "processed_text": {
        "type": "SYNTHETIC",
        "synthetic_entity_accuracy": "standard_multilingual"
    }
 }

The response show how the entities were replaced with French locations and country.

Multilingual Synthetic (text only)Multilingual Synthetic (full response)

Copy

Copied

 "Publié le 31/08/2014 de la gare de Memphis, Tennessee américain, GPS 20 9sn.10 N ec   15.5 S\n Nous nous sommes probablement rencontrés chez Max Fontaine … il ya longtemps. Je suis ton periple avec … beaucoup d’envie!! Ben m’a dit que tu comptais rejoindre l’Argentine."

Copy

Copied

[
  {
    "processed_text": "Publié le 31/08/2014 de la gare de Memphis, Tennessee américain, GPS 20 9sn.10 N ec   15.5 S\n Nous nous sommes probablement rencontrés chez Max Fontaine … il ya longtemps. Je suis ton periple avec … beaucoup d’envie!! Ben m’a dit que tu comptais rejoindre l’Argentine.",
    "entities": [
      {
        "processed_text": "31/08/2014",
        "text": "03/01/2017",
        "location": {
          "stt_idx": 10,
          "end_idx": 20,
          "stt_idx_processed": 10,
          "end_idx_processed": 20
        },
        "best_label": "DATE",
        "labels": {
          "DATE": 0.9961
        }
      },
      {
        "processed_text": "gare de Memphis, Tennessee américain",
        "text": "baie de Vaitupa, Polynésie française",
        "location": {
          "stt_idx": 27,
          "end_idx": 63,
          "stt_idx_processed": 27,
          "end_idx_processed": 63
        },
        "best_label": "LOCATION",
        "labels": {
          "LOCATION": 0.8305,
          "LOCATION_CITY": 0.0939,
          "LOCATION_STATE": 0.0878,
          "ORIGIN": 0.0745
        }
      },
      {
        "processed_text": "20 9sn.10 N ec   15.5 S",
        "text": "17 34.06 S 149 37.1 W",
        "location": {
          "stt_idx": 69,
          "end_idx": 90,
          "stt_idx_processed": 69,
          "end_idx_processed": 92
        },
        "best_label": "LOCATION_COORDINATE",
        "labels": {
          "LOCATION_COORDINATE": 0.989,
          "LOCATION": 0.9506
        }
      },
      {
        "processed_text": "Max Fontaine",
        "text": "Yan Labrosse",
        "location": {
          "stt_idx": 138,
          "end_idx": 150,
          "stt_idx_processed": 140,
          "end_idx_processed": 152
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.9953,
          "NAME_GIVEN": 0.2486,
          "NAME_FAMILY": 0.7461
        }
      },
      {
        "processed_text": "Ben",
        "text": "Yan",
        "location": {
          "stt_idx": 216,
          "end_idx": 219,
          "stt_idx_processed": 218,
          "end_idx_processed": 221
        },
        "best_label": "NAME_GIVEN",
        "labels": {
          "NAME": 0.9941,
          "NAME_GIVEN": 0.9915
        }
      },
      {
        "processed_text": "l’Argentine",
        "text": "l’indonésie",
        "location": {
          "stt_idx": 254,
          "end_idx": 265,
          "stt_idx_processed": 256,
          "end_idx_processed": 267
        },
        "best_label": "LOCATION_COUNTRY",
        "labels": {
          "LOCATION": 0.9858,
          "LOCATION_COUNTRY": 0.9682
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 266,
    "languages_detected": {
      "fr": 0.9757143259048462
    }
  }
]

See the Process Text route documentation for additional configuration options for synthetic data generation.

Custom redaction using the NER Text route

As we have seen above, the Process Text route offers a lot of flexibility in how text and files are redacted.

In the event that you have a specific use case that is not completely covered by the API, it is possible to create your own custom redaction function. This section shows how the NER Text route that was introduced in 3.9 can be used to create a custom redaction function with more "fine-grained" labels.

Process Text route redaction

Let's say that you want to redact this fragment of text:

Copy

Copied

"ERIC G. BADORREK was born in 1960 and registered to vote on 10 February 2012, giving the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. BADORREK is registered to vote in the Republican Party. Voter ID number: 100917654"

Using the Process Text route, the redacted content will look like:

Copy

Copied

"[NAME] was born in [DOB] and registered to vote on [DATE], giving the address [LOCATION_ADDRESS]. [NAME_FAMILY] is registered to vote in the [ORGANIZATION]. Voter ID number: [ACCOUNT_NUMBER]"

Notice how all the parts of the name ERIC G. BADORREK including the first name, initial and last name were combined into a single NAME marker. This grouping of words into a single marker is even more apparent on the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. which is redacted as a single LOCATION_ADDRESS label. This is certainly making the redacted contents more readable but it is hiding some information that may be useful for your use case. For example, you might want to know if the provided address was containing a zip code or a country which is impossible to determine from the current redacted output.

Using the NER Text route to create your own redacted content

Unlike the Process Text route, the NER Text route does not provide a redacted output. However, the entities it returns can be used to create one. Let's see how.

Consider this piece of code which is processing the same sample text but with the NER Text route this time.

Copy

Copied

import requests
from itertools import groupby

text = "ERIC G. BADORREK was born in 1960 and registered to vote on 10 February 2012, giving the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. BADORREK is registered to vote in the Republican Party. Voter ID number: 100917654"

request = {
    "text": [text]
}

# TODO - you should be updating this part to point to your local instance of Private AI or to one of the Private AI cloud API.
resp = requests.post("http://localhost:8080/ner/text", json=request).json()

# sort the entities so that entities with longest spans are first
entities = sorted(resp[0]["entities"], key=lambda e: (e["location"]["stt_idx"], -e["location"]["end_idx"], len(e["label"])))

class NotComparable(str):
    """Turns a string (_e.g._, a string literal like "e") which would otherwise compare equal to itself non-comparable"""
    def __init__(self, value: str):
        self.value = value

    def __eq__(self, other) -> bool:
        return False


redacted_chunks = [NotComparable(c) for c in text]

for entity in entities:
    start = entity["location"]["stt_idx"]
    end = entity["location"]["end_idx"]
    redacted_chunks[start:end] = [f"""[{entity["label"]}]"""] * (end - start)

print("".join(key for key, _ in groupby(redacted_chunks)))

We first make a request to the NER Text route endpoint, passing the text to analyse. Then we extract and sort the entities from the response.

Copy

Copied

# TODO - you should be updating this part to point to your local instance of Private AI or to one of the Private AI cloud API.
resp = requests.post("http://localhost:8080/ner/text", json=request).json()

# sort the entities so that entities with longest spans are first
entities = sorted(resp[0]["entities"], key=lambda e: (e["location"]["stt_idx"], -e["location"]["end_idx"], len(e["label"])))

We are going to use these entities to create a redacted text containing more details about the original text (e.g., whether an address was containing a COUNTRY). Because we are interested in showing "fine-grained" entities (i.e., the one with smaller spans) over "coarser" entities, we are sorting the overlapping entities from the longest to the shortest. The following code will contruct the redacted text by iterating over the list of sorted entities.

While doing so, it is easier to turn the input text in to a list of characters. This allows us to more easily replace the sensitive contents (i.e., the characters covered by an entity span) with a redaction marker. The following code is converting the input text to a list of characters and then replace each character that is part of an entity with the entity label. A small utility, NotComparable, is created to ensure that identical strings are not comparable (i.e., NotComparable("e") != NotComparable("e")). This will be useful when outputting the redacted text.

Copy

Copied

class NotComparable(str):
    """Turns a string (_e.g._, a string literal like "e") which would otherwise compare equal to itself non-comparable"""
    def __init__(self, value: str):
        self.value = value

redacted_chunks = [NotComparable(c) for c in text]

for entity in entities:
    start = entity["location"]["stt_idx"]
    end = entity["location"]["end_idx"]
    redacted_chunks[start:end] = [f"""[{entity["label"]}]"""] * (end - start)

The last step is simply to join all the characters of the original text and the redaction markers into a redacted contents. Since the markers are repeated for each entity characters that were replaced, we use the groupby function to only output it once. This is where the NotComparable utility plays its role by preventing consecutive identical characters (e.g., the two R in BADORREK) to be grouped together.

Copy

Copied

print("".join(key for key, _ in groupby(redacted_chunks)))

The result is a redacted text with all the necessary details.

Copy

Copied

"[NAME_GIVEN][NAME][NAME_FAMILY] was born in [DOB] and registered to vote on [DATE], giving the address [LOCATION_ADDRESS_STREET][LOCATION_ADDRESS][LOCATION_CITY][LOCATION_ADDRESS][LOCATION_STATE][LOCATION_ADDRESS][LOCATION_COUNTRY]. [NAME_FAMILY] is registered to vote in the [POLITICAL_AFFILIATION]. Voter ID number: [ACCOUNT_NUMBER]"

See how the name ERIC G. BADORREK has been replace with [NAME_GIVEN][NAME][NAME_FAMILY] instead of a single NAME marker and how the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. was redacted with much more details [LOCATION_ADDRESS_STREET][LOCATION_ADDRESS][LOCATION_CITY][LOCATION_ADDRESS][LOCATION_STATE][LOCATION_ADDRESS][LOCATION_COUNTRY]. From the above redacted text, it becomes clear that the original address contained a city, a state and a country but no zip code.

A parting note about privacy

You may wonder if the redacted results achieved in this section could have been obtained in a simpler way by disabling the NAME, LOCATION and LOCATION_ADDRESS entity types when making the request. While disabling entity types has its use, the technique described above has the advantage of lowering the chances of leaking sensitive data.

Consider for example the words Sussex County part of the provided address. These words are part of the LOCATION_ADDRESS but not part of any other sub-entities. As a result, these words would be left unredacted if both LOCATION and LOCATION_ADDRESS were disabled. This is applicable to many other entities like Mount Everest which is a LOCATION but does not match any other location sub-entities. By disabling the LOCATION label, we let these entities unredacted. This might not be desirable for some use cases.