Customizing Redaction

info

In order to run the example code in this guide, please sign up for your free test API key here or run the container.

The Private AI APIs offer a lot of flexibility when it comes to create redacted or de-identified content. This guide introduces a few techniques to modify and extend the existing capabilities of Private AI APIs. It is organized into four parts:

The techniques described in the first two sections apply to most of the Private AI APIs. In particular, they can be used to customize the redaction in the Process Text route, the File URI route and the File Base64 route. The configuration is done through the shared object processed_text, part of the API's request. See the specific route documentations for details. The following description will be using example requests and responses from the Process Text route for simplicity.

When redacting or de-identifying text, a customizable string pattern is used to replace the detected PII in the text. Private AI supports these replacement options:

  • MASK containing repeated characters up to the length of the replaced entity. This option provides a redacted text containing no information about the actual entities that were replaced.
  • MARKER containing the type of the entity being replaced. Markers can also be configured to link different mentions of the same entities in the redacted text (i.e. a name that appear twice in the text will have the same unique replacement marker).
  • SYNTHETIC text containing an AI generated replacement for the original entity. This option provides a processed text that is very similar to the original input text except that sensitive PII has been replaced with fake values.

Configuring a mask

Masking is also known as hashing when the # character is used. Setting the mask option is as simple as setting the type MASK in the processed_text object.

Copy
Copied
{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MASK"
    }
 }

This option will replace all names, organizations and other PII mentioned with the default mask character #. The redacted text will then look like:

Mask Default (redacted text)Mask Default (full response)
Copy
Copied
"Hi! This is ###############################. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ###### Frequent Voyager points. I checked my account today. I had 5,000 points in ####. Now it’s just 3,500. I haven’t flown in ###### since #### because of #####. What happened? May I have your name and account number, ma’am? #############, #############."
Copy
Copied
[
  {
    "processed_text": "Hi! This is ###############################. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ###### Frequent Voyager points. I checked my account today. I had 5,000 points in ####. Now it’s just 3,500. I haven’t flown in ###### since #### because of #####. What happened? May I have your name and account number, ma’am? #############, #############.",
    "entities": [
      {
        "processed_text": "###############################",
        "text": "Icarus Airways Customer Service",
        "location": {
          "stt_idx": 12,
          "end_idx": 43,
          "stt_idx_processed": 12,
          "end_idx_processed": 43
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8451
        }
      },
      {
        "processed_text": "######",
        "text": "Icarus",
        "location": {
          "stt_idx": 134,
          "end_idx": 140,
          "stt_idx_processed": 134,
          "end_idx_processed": 140
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.7969
        }
      },
      {
        "processed_text": "####",
        "text": "2019",
        "location": {
          "stt_idx": 216,
          "end_idx": 220,
          "stt_idx_processed": 216,
          "end_idx_processed": 220
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9384
        }
      },
      {
        "processed_text": "######",
        "text": "Icarus",
        "location": {
          "stt_idx": 262,
          "end_idx": 268,
          "stt_idx_processed": 262,
          "end_idx_processed": 268
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8285
        }
      },
      {
        "processed_text": "####",
        "text": "2019",
        "location": {
          "stt_idx": 275,
          "end_idx": 279,
          "stt_idx_processed": 275,
          "end_idx_processed": 279
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9393
        }
      },
      {
        "processed_text": "#####",
        "text": "COVID",
        "location": {
          "stt_idx": 291,
          "end_idx": 296,
          "stt_idx_processed": 291,
          "end_idx_processed": 296
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9327
        }
      },
      {
        "processed_text": "#############",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 361,
          "end_idx_processed": 374
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.903,
          "NAME_GIVEN": 0.3583,
          "NAME_FAMILY": 0.5411
        }
      },
      {
        "processed_text": "#############",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 376,
          "end_idx_processed": 389
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3708,
          "NAME": 0.907,
          "NAME_FAMILY": 0.5271
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

If you prefer a different masking character in your redacted text, you can specify it in the request.

Copy
Copied
{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MASK",
        "mask_character": "■"
    }
 }

The above request will redact using the provided mask character:

Custom Mask (redacted text)Custom Mask (full response)
Copy
Copied
"Hi! This is ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ■■■■■■ Frequent Voyager points. I checked my account today. I had 5,000 points in ■■■■. Now it’s just 3,500. I haven’t flown in ■■■■■■ since ■■■■ because of ■■■■■. What happened? May I have your name and account number, ma’am? ■■■■■■■■■■■■■, ■■■■■■■■■■■■■."
Copy
Copied
[
  {
    "processed_text": "Hi! This is ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my ■■■■■■ Frequent Voyager points. I checked my account today. I had 5,000 points in ■■■■. Now it’s just 3,500. I haven’t flown in ■■■■■■ since ■■■■ because of ■■■■■. What happened? May I have your name and account number, ma’am? ■■■■■■■■■■■■■, ■■■■■■■■■■■■■.",
    "entities": [
      {
        "processed_text": "■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■",
        "text": "Icarus Airways Customer Service",
        "location": {
          "stt_idx": 12,
          "end_idx": 43,
          "stt_idx_processed": 12,
          "end_idx_processed": 43
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8451
        }
      },
      {
        "processed_text": "■■■■■■",
        "text": "Icarus",
        "location": {
          "stt_idx": 134,
          "end_idx": 140,
          "stt_idx_processed": 134,
          "end_idx_processed": 140
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.7969
        }
      },
      {
        "processed_text": "■■■■",
        "text": "2019",
        "location": {
          "stt_idx": 216,
          "end_idx": 220,
          "stt_idx_processed": 216,
          "end_idx_processed": 220
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9384
        }
      },
      {
        "processed_text": "■■■■■■",
        "text": "Icarus",
        "location": {
          "stt_idx": 262,
          "end_idx": 268,
          "stt_idx_processed": 262,
          "end_idx_processed": 268
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8285
        }
      },
      {
        "processed_text": "■■■■",
        "text": "2019",
        "location": {
          "stt_idx": 275,
          "end_idx": 279,
          "stt_idx_processed": 275,
          "end_idx_processed": 279
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9393
        }
      },
      {
        "processed_text": "■■■■■",
        "text": "COVID",
        "location": {
          "stt_idx": 291,
          "end_idx": 296,
          "stt_idx_processed": 291,
          "end_idx_processed": 296
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9327
        }
      },
      {
        "processed_text": "■■■■■■■■■■■■■",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 361,
          "end_idx_processed": 374
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.903,
          "NAME_GIVEN": 0.3583,
          "NAME_FAMILY": 0.5411
        }
      },
      {
        "processed_text": "■■■■■■■■■■■■■",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 376,
          "end_idx_processed": 389
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3708,
          "NAME": 0.907,
          "NAME_FAMILY": 0.5271
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

In both cases, the redacted text does not contain any information about the entities that were replaced beside their length.

Configuring a marker

The marker option allows the redacted text to include the entity type. This may improve the readability of the redacted text. Setting the default marker option is as simple as setting the mask option.

Copy
Copied
{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MARKER"
    }
 }

With the default marker settings, the response will contain markers with the entity type and a unique number.

Marker Default (redacted text)Marker Default (full response)
Copy
Copied
"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since [DATE_INTERVAL_1] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."
Copy
Copied
[
 {
   "processed_text": "Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in [DATE_INTERVAL_1]. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since [DATE_INTERVAL_1] because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
   "entities": [
     {
       "processed_text": "ORGANIZATION_1",
       "text": "Icarus Airways Customer Service",
       "location": {
         "stt_idx": 12,
         "end_idx": 43,
         "stt_idx_processed": 12,
         "end_idx_processed": 28
       },
       "best_label": "ORGANIZATION",
       "labels": {
         "ORGANIZATION": 0.8451
       }
     },
     {
       "processed_text": "ORGANIZATION_2",
       "text": "Icarus",
       "location": {
         "stt_idx": 134,
         "end_idx": 140,
         "stt_idx_processed": 119,
         "end_idx_processed": 135
       },
       "best_label": "ORGANIZATION",
       "labels": {
         "ORGANIZATION": 0.7969
       }
     },
     {
       "processed_text": "DATE_INTERVAL_1",
       "text": "2019",
       "location": {
         "stt_idx": 216,
         "end_idx": 220,
         "stt_idx_processed": 211,
         "end_idx_processed": 228
       },
       "best_label": "DATE_INTERVAL",
       "labels": {
         "DATE_INTERVAL": 0.9384
       }
     },
     {
       "processed_text": "ORGANIZATION_2",
       "text": "Icarus",
       "location": {
         "stt_idx": 262,
         "end_idx": 268,
         "stt_idx_processed": 270,
         "end_idx_processed": 286
       },
       "best_label": "ORGANIZATION",
       "labels": {
         "ORGANIZATION": 0.8285
       }
     },
     {
       "processed_text": "DATE_INTERVAL_1",
       "text": "2019",
       "location": {
         "stt_idx": 275,
         "end_idx": 279,
         "stt_idx_processed": 293,
         "end_idx_processed": 310
       },
       "best_label": "DATE_INTERVAL",
       "labels": {
         "DATE_INTERVAL": 0.9393
       }
     },
     {
       "processed_text": "CONDITION_1",
       "text": "COVID",
       "location": {
         "stt_idx": 291,
         "end_idx": 296,
         "stt_idx_processed": 322,
         "end_idx_processed": 335
       },
       "best_label": "CONDITION",
       "labels": {
         "CONDITION": 0.9327
       }
     },
     {
       "processed_text": "NAME_1",
       "text": "Nessa Jonsson",
       "location": {
         "stt_idx": 361,
         "end_idx": 374,
         "stt_idx_processed": 400,
         "end_idx_processed": 408
       },
       "best_label": "NAME",
       "labels": {
         "NAME": 0.903,
         "NAME_GIVEN": 0.3583,
         "NAME_FAMILY": 0.5411
       }
     },
     {
       "processed_text": "NAME_1",
       "text": "N-E-S-S-A J-O-N-S-S-O-N",
       "location": {
         "stt_idx": 376,
         "end_idx": 399,
         "stt_idx_processed": 410,
         "end_idx_processed": 418
       },
       "best_label": "NAME",
       "labels": {
         "NAME_GIVEN": 0.3708,
         "NAME": 0.907,
         "NAME_FAMILY": 0.5271
       }
     }
   ],
   "entities_present": true,
   "characters_processed": 400,
   "languages_detected": {
     "en": 0.9167966246604919
   }
 }
]

Notice how, Nessa Jonsson and the spelled-out mention N-E-S-S-A J-O-N-S-S-O-N have being replaced with the same marker index NAME_1. Similarly, two mentions of Icarus have been replaced with ORGANIZATION_2. When creating the markers with the default settings, the de-identification service will use a unique marker index unless the entity was previously seen in the text. If the entity is repeated more than once in the text, the service will do its best to assign the same unique marker. This way, it is easier to identify the different mentions of a person's name, an organization's name and so on.

A note about linking entities

The linking of entities are currently based on string matching. It is therefore capable of linking entities that are mentioned in the same way. For example, the entities Mary and mary will be linked together because they match except for a small difference in casing.

However, entities John A. Smith and John Smith will not be linked together even if they are referring to the same person. As a consequence, the two entities will have different unique markers (e.g. NAME_1 and NAME_2). Improved linking is something that is being actively working on and will be released soon

Note that the linking of the entities is only performed within a request. Identical entities across different requests will usually be assigned different markers. If you are processing related text fragments, you may consider passing them as a batch in a single request and setting link_batch to True. This will allow the de-identification service to link entities across these fragments.

You can create your own marker by providing a format containing one of the marker keywords below (e.g. [ENTITY_TYPE]):

Marker keywords Description
ENTITY_TYPE Replace the entity with the type that best describes the entity (e.g. John -> NAME_GIVEN)
ALL_ENTITY_TYPES Replace the entity with all the labels that applies (e.g. John -> NAME_GIVEN,NAME)
UNIQUE_ENTITY_TYPE (default) Replace the entity with the type that best describes the entity and append a number so that different entities have different markers (e.g. John -> NAME_GIVEN_1, Mary -> NAME_GIVEN_2)
UNIQUE_HASHED_ENTITY_TYPE Similar to UNIQUE_ENTITY_TYPE except that to make the markers unique the service appends a hash value instead of an sequential integer.

Private AI will be replacing detected entities with the provided format using the above keyword as a format specification. Here are some redacted examples with the configuration that generated them.

Replacing entities will all applicable entity types:

ALL_ENTITY_TYPES redacted textALL_ENTITY_TYPES request
Copy
Copied
"Hi! This is <ORGANIZATION>. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my <ORGANIZATION> Frequent Voyager points. I checked my account today. I had 5,000 points in <DATE_INTERVAL>. Now it’s just 3,500. I haven’t flown in <ORGANIZATION> since <DATE_INTERVAL> because of <CONDITION>. What happened? May I have your name and account number, ma’am? <NAME,NAME_FAMILY,NAME_GIVEN>, <NAME,NAME_FAMILY,NAME_GIVEN>."
Copy
Copied
{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MARKER",
        "pattern": "<ALL_ENTITY_TYPES>"
    }
 }

Replacing with unique hashed markers:

UNIQUE_HASHED_ENTITY_TYPE redacted textUNIQUE_HASHED_ENTITY_TYPE request
Copy
Copied
"Hi! This is -->ORGANIZATION_6SjIF<--. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my -->ORGANIZATION_5AP6v<-- Frequent Voyager points. I checked my account today. I had 5,000 points in -->DATE_INTERVAL_m8wzC<--. Now it’s just 3,500. I haven’t flown in -->ORGANIZATION_5AP6v<-- since -->DATE_INTERVAL_m8wzC<-- because of -->CONDITION_FCj0D<--. What happened? May I have your name and account number, ma’am? -->NAME_HBtjJ<--, -->NAME_HBtjJ<--."
Copy
Copied
{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "MARKER",
        "pattern": "-->UNIQUE_HASHED_ENTITY_TYPE<--"
    }
 }

Synthetic PII

You may choose to replace the entities in your text with fake or synthetic entities instead of markers and masks. There are a few reasons to do so. For example, if you train an AI model on your data, synthetic replacements might provide a more realistic input to train your model.

Generating synthetic PII is done by setting processed_text.type to SYNTHETIC.

Copy
Copied
{
    "text": [
       "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
    ],
    "processed_text": {
        "type": "SYNTHETIC"
    }
 }

The synthetic text output will be similar to:

Synthetic TextSynthetic Response
Copy
Copied
"Hi! This is United Federal Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my United Federal Customer Service Frequent Voyager points. I checked my account today. I had 5,000 points in 2012. Now it’s just 3,500. I haven’t flown in United Federal Customer Service since 2012 because of COVID. What happened? May I have your name and account number, ma’am? Maria Carlotta, Maria Carlotta."
Copy
Copied
[
  {
    "processed_text": "Hi! This is United Federal Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my United Federal Customer Service Frequent Voyager points. I checked my account today. I had 5,000 points in 2012. Now it’s just 3,500. I haven’t flown in United Federal Customer Service since 2012 because of COVID. What happened? May I have your name and account number, ma’am? Maria Carlotta, Maria Carlotta.",
    "entities": [
      {
        "processed_text": "United Federal Customer Service",
        "text": "Icarus Airways Customer Service",
        "location": {
          "stt_idx": 12,
          "end_idx": 43,
          "stt_idx_processed": 12,
          "end_idx_processed": 43
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8451
        }
      },
      {
        "processed_text": "United Federal Customer Service",
        "text": "Icarus",
        "location": {
          "stt_idx": 134,
          "end_idx": 140,
          "stt_idx_processed": 134,
          "end_idx_processed": 165
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.7969
        }
      },
      {
        "processed_text": "2012",
        "text": "2019",
        "location": {
          "stt_idx": 216,
          "end_idx": 220,
          "stt_idx_processed": 241,
          "end_idx_processed": 245
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9384
        }
      },
      {
        "processed_text": "United Federal Customer Service",
        "text": "Icarus",
        "location": {
          "stt_idx": 262,
          "end_idx": 268,
          "stt_idx_processed": 287,
          "end_idx_processed": 318
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8285
        }
      },
      {
        "processed_text": "2012",
        "text": "2019",
        "location": {
          "stt_idx": 275,
          "end_idx": 279,
          "stt_idx_processed": 325,
          "end_idx_processed": 329
        },
        "best_label": "DATE_INTERVAL",
        "labels": {
          "DATE_INTERVAL": 0.9393
        }
      },
      {
        "processed_text": "COVID",
        "text": "COVID",
        "location": {
          "stt_idx": 291,
          "end_idx": 296,
          "stt_idx_processed": 341,
          "end_idx_processed": 346
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9327
        }
      },
      {
        "processed_text": "Maria Carlotta",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 411,
          "end_idx_processed": 425
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.903,
          "NAME_GIVEN": 0.3583,
          "NAME_FAMILY": 0.5411
        }
      },
      {
        "processed_text": "Maria Carlotta",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 427,
          "end_idx_processed": 441
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3708,
          "NAME": 0.907,
          "NAME_FAMILY": 0.5271
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

Note how the PII has been replaced with similar looking fake entities. Also you should know that each synthetic data request may have a different response as the synthetic data generation is non-deterministic.

You can optionally configure the language in which the text is generated using the synthetic_entity_accuracy field. For English generation, set this parameter to standard for best results. For other languages, set it to standard_multilingual and the synthetic model will attempt to predict entities matching the input text language. The default accuracy is standard_automatic which will determine the appropriate model (i.e. standard or standard_multilingual) from the input language.

Copy
Copied
{
    "text": [
       "Publié le 03/01/2017 de la baie de Vaitupa, Polynésie française, GPS 17 34.06 S 149 37.1 W\n Nous nous sommes probablement rencontrés chez Yan Labrosse … il ya longtemps. Je suis ton periple avec … beaucoup d’envie!! Yan m’a dit que tu comptais rejoindre l’indonésie."
    ],
    "processed_text": {
        "type": "SYNTHETIC",
        "synthetic_entity_accuracy": "standard_multilingual"
    }
 }

The response show how the entities were replaced with French locations and country.

Multilingual Synthetic (text only)Multilingual Synthetic (full response)
Copy
Copied
 "Publié le 31/08/2014 de la gare de Memphis, Tennessee américain, GPS 20 9sn.10 N ec   15.5 S\n Nous nous sommes probablement rencontrés chez Max Fontaine … il ya longtemps. Je suis ton periple avec … beaucoup d’envie!! Ben m’a dit que tu comptais rejoindre l’Argentine."
Copy
Copied
[
  {
    "processed_text": "Publié le 31/08/2014 de la gare de Memphis, Tennessee américain, GPS 20 9sn.10 N ec   15.5 S\n Nous nous sommes probablement rencontrés chez Max Fontaine … il ya longtemps. Je suis ton periple avec … beaucoup d’envie!! Ben m’a dit que tu comptais rejoindre l’Argentine.",
    "entities": [
      {
        "processed_text": "31/08/2014",
        "text": "03/01/2017",
        "location": {
          "stt_idx": 10,
          "end_idx": 20,
          "stt_idx_processed": 10,
          "end_idx_processed": 20
        },
        "best_label": "DATE",
        "labels": {
          "DATE": 0.9961
        }
      },
      {
        "processed_text": "gare de Memphis, Tennessee américain",
        "text": "baie de Vaitupa, Polynésie française",
        "location": {
          "stt_idx": 27,
          "end_idx": 63,
          "stt_idx_processed": 27,
          "end_idx_processed": 63
        },
        "best_label": "LOCATION",
        "labels": {
          "LOCATION": 0.8305,
          "LOCATION_CITY": 0.0939,
          "LOCATION_STATE": 0.0878,
          "ORIGIN": 0.0745
        }
      },
      {
        "processed_text": "20 9sn.10 N ec   15.5 S",
        "text": "17 34.06 S 149 37.1 W",
        "location": {
          "stt_idx": 69,
          "end_idx": 90,
          "stt_idx_processed": 69,
          "end_idx_processed": 92
        },
        "best_label": "LOCATION_COORDINATE",
        "labels": {
          "LOCATION_COORDINATE": 0.989,
          "LOCATION": 0.9506
        }
      },
      {
        "processed_text": "Max Fontaine",
        "text": "Yan Labrosse",
        "location": {
          "stt_idx": 138,
          "end_idx": 150,
          "stt_idx_processed": 140,
          "end_idx_processed": 152
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.9953,
          "NAME_GIVEN": 0.2486,
          "NAME_FAMILY": 0.7461
        }
      },
      {
        "processed_text": "Ben",
        "text": "Yan",
        "location": {
          "stt_idx": 216,
          "end_idx": 219,
          "stt_idx_processed": 218,
          "end_idx_processed": 221
        },
        "best_label": "NAME_GIVEN",
        "labels": {
          "NAME": 0.9941,
          "NAME_GIVEN": 0.9915
        }
      },
      {
        "processed_text": "l’Argentine",
        "text": "l’indonésie",
        "location": {
          "stt_idx": 254,
          "end_idx": 265,
          "stt_idx_processed": 256,
          "end_idx_processed": 267
        },
        "best_label": "LOCATION_COUNTRY",
        "labels": {
          "LOCATION": 0.9858,
          "LOCATION_COUNTRY": 0.9682
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 266,
    "languages_detected": {
      "fr": 0.9757143259048462
    }
  }
]

See the Process Text route documentation for additional configuration options for synthetic data generation.

Custom redaction using the NER Text route

As we have seen above, the Process Text route offers a lot of flexibility in how text and files are redacted.

In the event that you have a specific use case that is not completely covered by the API, it is possible to create your own custom redaction function. This section shows how the NER Text route that was introduced in 3.9 can be used to create a custom redaction function with more "fine-grained" labels.

Process Text route redaction

Let's say that you want to redact this fragment of text:

Copy
Copied
"ERIC G. BADORREK was born in 1960 and registered to vote on 10 February 2012, giving the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. BADORREK is registered to vote in the Republican Party. Voter ID number: 100917654"

Using the Process Text route, the redacted content will look like:

Copy
Copied
"[NAME] was born in [DOB] and registered to vote on [DATE], giving the address [LOCATION_ADDRESS]. [NAME_FAMILY] is registered to vote in the [ORGANIZATION]. Voter ID number: [ACCOUNT_NUMBER]"

Notice how all the parts of the name ERIC G. BADORREK including the first name, initial and last name were combined into a single NAME marker. This grouping of words into a single marker is even more apparent on the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. which is redacted as a single LOCATION_ADDRESS label. This is certainly making the redacted contents more readable but it is hiding some information that may be useful for your use case. For example, you might want to know if the provided address was containing a zip code or a country which is impossible to determine from the current redacted output.

Using the NER Text route to create your own redacted content

Unlike the Process Text route, the NER Text route does not provide a redacted output. However, the entities it returns can be used to create one. Let's see how.

Consider this piece of code which is processing the same sample text but with the NER Text route this time.

Copy
Copied
import requests
from itertools import groupby

text = "ERIC G. BADORREK was born in 1960 and registered to vote on 10 February 2012, giving the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. BADORREK is registered to vote in the Republican Party. Voter ID number: 100917654"

request = {
    "text": [text]
}

# TODO - you should be updating this part to point to your local instance of Private AI or to one of the Private AI cloud API.
resp = requests.post("http://localhost:8999/v3/ner/text", json=request).json()

# sort the entities so that entities with longest spans are first
entities = sorted(resp[0]["entities"], key=lambda e: (e["location"]["stt_idx"], -e["location"]["end_idx"], len(e["label"])))

class NotComparable(str):
    """Turns a string (e.g. a string literal like "e") which would otherwise compare equal to itself non-comparable"""
    def __init__(self, value: str):
        self.value = value

redacted_chunks = [NotComparable(c) for c in text]

for entity in entities:
    start = entity["location"]["stt_idx"]
    end = entity["location"]["end_idx"]
    redacted_chunks[start:end] = [f"""[{entity["label"]}]"""] * (end - start)

print("".join(key for key, _ in groupby(redacted_chunks)))

We first make a request to the NER Text route endpoint, passing the text to analyse. Then we extract and sort the entities from the response.

Copy
Copied
# TODO - you should be updating this part to point to your local instance of Private AI or to one of the Private AI cloud API.
resp = requests.post("http://localhost:8080/v3/ner/text", json=request).json()

# sort the entities so that entities with longest spans are first
entities = sorted(resp[0]["entities"], key=lambda e: (e["location"]["stt_idx"], -e["location"]["end_idx"], len(e["label"])))

We are going to use these entities to create a redacted text containing more details about the original text (e.g. whether an address was containing a COUNTRY). Because we are interested in showing "fine-grained" entities (i.e.the one with smaller spans) over "coarser" entities, we are sorting the overlapping entities from the longest to the shortest. The following code will contruct the redacted text by iterating over the list of sorted entities.

While doing so, it is easier to turn the input text in to a list of characters. This allows us to more easily replace the sensitive contents (i.e. the characters covered by an entity span) with a redaction marker. The following code is converting the input text to a list of characters and then replace each character that is part of an entity with the entity label. A small utility, NotComparable, is created to ensure that identical strings are not comparable (i.e. NotComparable("e") != NotComparable("e")). This will be useful when outputting the redacted text.

Copy
Copied
class NotComparable(str):
    """Turns a string (e.g. a string literal like "e") which would otherwise compare equal to itself non-comparable"""
    def __init__(self, value: str):
        self.value = value

redacted_chunks = [NotComparable(c) for c in text]

for entity in entities:
    start = entity["location"]["stt_idx"]
    end = entity["location"]["end_idx"]
    redacted_chunks[start:end] = [f"""[{entity["label"]}]"""] * (end - start)

The last step is simply to join all the characters of the original text and the redaction markers into a redacted contents. Since the markers are repeated for each entity characters that were replaced, we use the groupby function to only output it once. This is where the NotComparable utility plays its role by preventing consecutive identical characters (e.g. the two R in BADORREK) to be grouped together.

Copy
Copied
print("".join(key for key, _ in groupby(redacted_chunks)))

The result is a redacted text with all the necessary details.

Copy
Copied
"[NAME_GIVEN][NAME][NAME_FAMILY] was born in [DOB] and registered to vote on [DATE], giving the address [LOCATION_ADDRESS_STREET][LOCATION_ADDRESS][LOCATION_CITY][LOCATION_ADDRESS][LOCATION_STATE][LOCATION_ADDRESS][LOCATION_COUNTRY]. [NAME_FAMILY] is registered to vote in the [POLITICAL_AFFILIATION]. Voter ID number: [ACCOUNT_NUMBER]"

See how the name ERIC G. BADORREK has been replace with [NAME_GIVEN][NAME][NAME_FAMILY] instead of a single NAME marker and how the address 35933 COLLINS LN, FENWICK WEST, SELBYVILLE, Sussex County, Delaware, U.S.A. was redacted with much more details [LOCATION_ADDRESS_STREET][LOCATION_ADDRESS][LOCATION_CITY][LOCATION_ADDRESS][LOCATION_STATE][LOCATION_ADDRESS][LOCATION_COUNTRY]. From the above redacted text, it becomes clear that the original address contained a city, a state and a country but no zip code.

A parting note about privacy

You may wonder if the redacted results achieved in this section could have been obtained in a simpler way by disabling the NAME, LOCATION and LOCATION_ADDRESS entity types when making the request. While disabling entity types has its use, the technique described above has the advantage to lower the chance of leaking sensitive data.

Consider for example the words Sussex County part of the provided address. These words are part of the LOCATION_ADDRESS but not part of any other sub-entities. As a result, these words would be left unredacted if both LOCATION and LOCATION_ADDRESS were disabled. This is applicable to many other entities like Mount Everest which is a LOCATION but does not match any other location sub-entities. By disabling the LOCATION label, we let these entities unredacted. This might not be desirable for some use cases.

© Copyright 2024 Private AI.