Customizing Detection

info

In order to run the example code in this guide, please sign up for your free test API key here or run the container.

Each use case is different and you may sometimes need to adjust the types of entities that Private AI detects to address your specific requirements. This guide introduces a few techniques to modify and extend the detection engine to get the most from your data and is organized into two parts:

Part 1: Enabling and Disabling Entity Types covers enabling and disabling the types of entities that are detected.
Part 2: Filters covers allow & block list functionality and including regexes.

Enabling and Disabling Entity Types

Private AI detects over 50 unique entity types ranging from personal, credit card and medical information. By default, all non-beta supported types are detected but this can be easily customized via entity selectors. If you need to comply to an existing legislation like GDPR or HIPAA, you may want to de-identify only the entities covered by this regulation. This can easily be done with preset entity groups. Or you may prefer to detect your own set of entities. This can also be done using entity selectors.

info

An example Python script showing how to use entity selectors with Private AI's Python client can be found here.

Configuring Entity Selectors

Entity selectors let you enable or disable entity types as part of your API request. You can, for example, enable NAME and ORGANIZATION while ignoring all other entity types by using the ENABLE selector as shown in this request:

Copy

Copied

{
   "text": [
      "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
   ],
   "entity_detection": {
      "entity_types": [
         {
         "type": "ENABLE",
         "value": ["ORGANIZATION", "NAME"]
         }
      ]
   }
}

As expected, the name and organization mentions are redacted while other entities like dates (i.e. 2019) and conditions (i.e. COVID) are left untouched in the de-identified text:

Redacted TextFull Response

Copy

Copied

"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of COVID. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."

Copy

Copied

[
  {
    "processed_text": "Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of COVID. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
    "entities": [
      {
        "processed_text": "ORGANIZATION_1",
        "text": "Icarus Airways Customer Service",
        "location": {
          "stt_idx": 12,
          "end_idx": 43,
          "stt_idx_processed": 12,
          "end_idx_processed": 28
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8451
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Icarus",
        "location": {
          "stt_idx": 134,
          "end_idx": 140,
          "stt_idx_processed": 119,
          "end_idx_processed": 135
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.6382
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Icarus",
        "location": {
          "stt_idx": 262,
          "end_idx": 268,
          "stt_idx_processed": 257,
          "end_idx_processed": 273
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.6274
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 366,
          "end_idx_processed": 374
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.8899
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 376,
          "end_idx_processed": 384
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.8863
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

It is sometimes simpler to specify the entities to disable. This can be done using a DISABLE selector:

Copy

Copied

{
   "text": [
      "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
   ],
   "entity_detection": {
      "entity_types": [
         {
         "type": "DISABLE",
         "value": ["DATE", "DATE_INTERVAL"]
         }
      ]
   }
}

The above request will redact all entity types except dates and date intervals as shown by the corresponding output:

Redacted TextFull Response

Copy

Copied

"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."

Copy

Copied

[
  {
    "processed_text": "Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
    "entities": [
      {
        "processed_text": "ORGANIZATION_1",
        "text": "Icarus Airways Customer Service",
        "location": {
          "stt_idx": 12,
          "end_idx": 43,
          "stt_idx_processed": 12,
          "end_idx_processed": 28
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8451
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Icarus",
        "location": {
          "stt_idx": 134,
          "end_idx": 140,
          "stt_idx_processed": 119,
          "end_idx_processed": 135
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.6382
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Icarus",
        "location": {
          "stt_idx": 262,
          "end_idx": 268,
          "stt_idx_processed": 257,
          "end_idx_processed": 273
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.6274
        }
      },
      {
        "processed_text": "CONDITION_1",
        "text": "COVID",
        "location": {
          "stt_idx": 291,
          "end_idx": 296,
          "stt_idx_processed": 296,
          "end_idx_processed": 309
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9187
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 374,
          "end_idx_processed": 382
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3601,
          "NAME": 0.8899,
          "NAME_FAMILY": 0.5475
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 384,
          "end_idx_processed": 392
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3714,
          "NAME": 0.8863,
          "NAME_FAMILY": 0.53
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

Preset Entity Groups

If you need to comply with a specific legislation like HIPAA, the de-identification service makes it easy for you. You can simply choose from the list of preset entity groups: ['GDPR', 'GDPR_SENSITIVE', 'HIPAA', 'CPRA', 'QUEBEC_PRIVACY_ACT', 'APPI', 'APPI_SENSITIVE', 'PCI', 'PHI']. For details of what is contained in each group, please consult our entities page.

This is an example on how to enable all entities covered by the GDPR legislation:

Copy

Copied

{
   "text": [
      "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
   ],
   "entity_detection": {
      "entity_types": [
         {
         "type": "ENABLE",
         "value": ["GDPR"]
         }
      ]
   }
}

In the response below, the GDPR entities like NAME and CONDITION are redacted while ORGANIZATION mentions are not since there are not part of GDPR:

Redacted TextFull Response

Copy

Copied

 "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."

Copy

Copied

[
  {
    "processed_text": "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
    "entities": [
      {
        "processed_text": "CONDITION_1",
        "text": "COVID",
        "location": {
          "stt_idx": 291,
          "end_idx": 296,
          "stt_idx_processed": 291,
          "end_idx_processed": 304
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9187
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "Nessa Jonsson",
        "location": {
          "stt_idx": 361,
          "end_idx": 374,
          "stt_idx_processed": 369,
          "end_idx_processed": 377
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3601,
          "NAME": 0.8899,
          "NAME_FAMILY": 0.5475
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "N-E-S-S-A J-O-N-S-S-O-N",
        "location": {
          "stt_idx": 376,
          "end_idx": 399,
          "stt_idx_processed": 379,
          "end_idx_processed": 387
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.3714,
          "NAME": 0.8863,
          "NAME_FAMILY": 0.53
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 400,
    "languages_detected": {
      "en": 0.9167966246604919
    }
  }
]

Security Considerations

There are several reasons that may motivate the use of selectors to limit the set of entities:

You are working with data from a specific domain which may not contain some of the supported entities. For example, medical data are unlikely to contain PCI information like credit card number. Disabling these entities will prevent potential false-positives.
Some entities, while present in your data, may not be regarded as sensitive in your use case. For example, your data may contain generic URLs and filenames that can't be used to identify individuals.

It is, however, important to understand that disabling entities may increase the risk that sensitive information is leaked. When selecting the list of entities to redact, we encourage you to take extra time and care to think on all the possible implications. A good practice is to have an expert validation to confirm that the redacted contents is free of PII or other sensitive information. Following this validation, the list may be adjusted according to the findings.

Advanced Topics

This section presents advanced techniques to help you get the most of the Private AI service.

Combining Selectors

It is possible to combine selectors to help create the desired subset of entities. For example, you may be interested in complying with GDPR and HIPAA legislations. This can be done by listing these two groups in an ENABLE selector.

Copy

Copied

{
   "text": [
      "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
   ],
   "entity_detection": {
      "entity_types": [
         {
         "type": "ENABLE",
         "value": ["GDPR", "HIPAA"]
         }
      ]
   }
}

You can also pick and choose the list the entities from groups and individual entity types.

Copy

Copied

{
   "text": [
      "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
   ],
   "entity_detection": {
      "entity_types": [
         {
         "type": "ENABLE",
         "value": ["GDPR", "ORGANIZATION"]
         }
      ]
   }
}

The above request will redact all GDPR entities as well as ORGANIZATION.

It is also possible to combine ENABLE and DISABLE selectors.

Copy

Copied

{
   "text": [
      "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
   ],
   "entity_detection": {
      "entity_types": [
         {
         "type": "ENABLE",
         "value": ["GDPR"]
         },
         {
            "type": "DISABLE",
            "value": ["DATE", "DATE_INTERVAL"]
         }
      ]
   }
}

This request is enabling all GDPR entities except DATE and DATE_INTERVAL.

Selector Precedence

When combining several selectors, the list of enable entities is first computed by expanding all entity types and groups in the ENABLE selectors. The entity types and groups listed in DISABLE selectors are then expended and removed from that list to form the final list of entities.

When no ENABLE selectors are specified, it is assumed that the all supported entities are enabled. In this case, DISABLE selectors will remove from the list of all supported entities.

Selective redaction

If you want to redact only one or two entities out of 50+ entity types that we support, you can use the ENABLE selector. For example, to redact only NAME_FAMILY and LOCATION_ADDRESS_STREET in the text below, we use ENABLE selector and specify these two entities:

Copy

Copied

{
   "text": [
      "Hello there! I am Alice McGee, residing at 325 Sophia St, Port Coquitlam. You can reach me at 235 123-9876 or via email at alicemcgee@gmail.com. I work at SFU."
   ], 
   "entity_detection": {
      "entity_types": [
         {
         "type": "ENABLE",
         "value": ["NAME_FAMILY", "LOCATION_ADDRESS_STREET"]
         }
      ]
   }
}

The above request will keep NAME_GIVEN, LOCATION_CITY, PHONE_NUMBER, EMAIL_ADDRESS and ORGANIZATION visible and redact only NAME_FAMILY and LOCATION_STREET_ADDRESS.

In other cases, you may only want to redact PCI but keep all other entities unredacted, specifically ACCOUNT_NUMBER in the following example:

Copy

Copied

{
   "text": [
      "His account number at WaveNow Digital is 6787655, and it is connected to his bank account at First National Bank, 987654321. But he also has a credit card on file, ending with 7876, expires on 10/25, and his billing address is 456 41st Avenue, Lower Valley."
   ], 
   "entity_detection": {
      "entity_types": [
         {
         "type": "ENABLE",
         "value": ["PCI"]
         }
      ],
      "enable_non_max_suppression": true
   }
}

In this example, BANK_ACCOUNT, CREDIT_CARD, CREDIT_CARD_EXPIRATION and CVV will be redacted, while ACCOUNT_NUMBER and LOCATION_ADDRESS will be visible.

Understanding Multi-Label Predictions

At the core of the Private AI service lies a model that performs named entity recognition (NER). In essence, NER models seek to classify each word in a text into a fixed set of classes: the entity types. NER is often framed as a multi-class multi-label problem since a single word can have several labels. For example, in Simon Fraser University the word Simon is both part of a name (i.e. Simon Fraser) and an organization (i.e. Simon Fraser University).

To create the de-identified or redacted output, the service must select among the predicted labels the one that best represent the entity. To do so, it will often prefer longer entities over shorter ones. For example, the service will redact Simon Fraser University as an ORGANIZATION:

Copy

Copied

I study at Simon Fraser University -> I study at [ORGANIZATION]

instead of a combination of a NAME and an ORGANIZATION:

Copy

Copied

I study at Simon Fraser University -> I study at [NAME] [ORGANIZATION]

This leads to a much more natural output for the users.

Disabling entities has direct impact on that behaviour. Let's suppose that ORGANIZATION has been disabled when redacting the above text.

Copy

Copied

{
   "text": [
      "I study at Simon Fraser University"
   ],
   "entity_detection": {
      "entity_types": [
         {
         "type": "DISABLE",
         "value": ["ORGANIZATION"]
         }
      ]
   }
}

The resulting response might be surprising at first.

Redacted TextFull Response

Copy

Copied

"I study at [NAME_1] University"

Copy

Copied

[
  {
    "processed_text": "I study at [NAME_1] University",
    "entities": [
      {
        "processed_text": "NAME_1",
        "text": "Simon Fraser",
        "location": {
          "stt_idx": 11,
          "end_idx": 23,
          "stt_idx_processed": 11,
          "end_idx_processed": 19
        },
        "best_label": "NAME",
        "labels": {
          "NAME_GIVEN": 0.2466,
          "NAME": 0.4709,
          "NAME_FAMILY": 0.2037
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 34,
    "languages_detected": {
      "en": 0.8937596678733826
    }
  }
]

Although ORGANIZATION was disabled, a part of the Simon Fraser University was redacted. This is explained by the fact that Simon Fraser is both part of an organization name but also a person name. Given that ORGANIZATION was disabled, the de-identification service picked the second best label for these words which is NAME.

Depending on your use case, you may prefer to keep the full organization name in the output. This can be done with the enable_non_max_suppression flag.

Copy

Copied

{
   "text": [
      "I study at Simon Fraser University"
   ],
   "entity_detection": {
      "entity_types": [
         {
         "type": "DISABLE",
         "value": ["ORGANIZATION"]
         }
      ],
      "enable_non_max_suppression": true
   }
}

When the enable_non_max_suppression flag is set to true, the service will ignore labels with lower likelihoods (i.e. NAME in the above example) therefore preventing the redaction of the ORGANIZATION as shown below.

Redacted TextFull Response

Copy

Copied

I study at Simon Fraser University

Copy

Copied

[
  {
    "processed_text": "I study at Simon Fraser University",
    "entities": [],
    "entities_present": false,
    "characters_processed": 34,
    "languages_detected": {
      "en": 0.8937596678733826
    }
  }
]

Note that, as with disabled entities, one should use the enable_non_max_suppression cautiously. Setting this flag to true may increase the chance of leaking sensitive information.

Understanding Hierarchical Types

Some of the supported entities in the de-identification service are structured into hierarchies. The labels NAME and LOCATION are good examples.

The NAME hierachy includes NAME_GIVEN and NAME_FAMILY but also NAME_MEDICAL_PROFESSIONAL while the LOCATION hierarchy contains LOCATION_COUNTRY, LOCATION_STATE, LOCATION_CITY and so on. Entities forming hierarchies are easily identifiable as they share the same prefix (i.e. NAME or LOCATIONin the above examples) followed by an underscore _.

When creating the redacted text, the de-identification service will prefer to use the most specific label in a hierarchy instead of the root label. For example, I live in Canada will be redacted as I live in [LOCATION_COUNTRY] instead of the more generic I live in [LOCATION]. This behaviour improves the usability of the data. If you are not interested in getting this level of granulity you can leverage the fact that labels in hierarchies use the same prefix. This can easily be done as a post-processing step where tokens like NAME_GIVEN and NAME_FAMILY are replaced with the root label NAME.

Filters

note

This guide assumes that you have a working knowledge of regular expressions. Private AI is using the Python regular expression syntax. You can find more details on the Python re module documentation.

Sometimes referred to as whitelists and blacklists, filters are specifically designed to allow entities (i.e. leave them in the text) or block entities (i.e. redact them from the text) when the entity text follows an expected format. Filters are built using regular expressions.

info

An example Python script showing how to use filters with Private AI's Python client can be found here.

Allow Filter

How can you redact regular phone numbers while keeping companies' toll-free numbers in clear? How would you prevent document ID numbers from being detected as a sensitive numerical number? These are two examples of use cases that can be addressed with Allow filters.

Allow filters instruct the detection engine to ignore entities when the entity text match a specific pattern. To create an Allow filter, a regular expression pattern is first created. This pattern is then added to the filter list in the entity_detection object of your REST request. Let's look at a couple of examples.

Allow List

It is possible to feed lists of terms as well as regex patterns to filters. For example, if you want to prevent the detection engine from removing country names you can whitelist them with:

Copy

Copied

"entity_detection": {
      "filter": [
        {
          "type": "ALLOW",
          "pattern": "Canada|Brazil|Italy"
        }
      ]
    }

Allowing toll-free numbers

This is an example of a process/text request containing an Allow filter. When run this request will detect and redact phone numbers unless they follow the specific format for toll-free phones:

Copy

Copied

{
    "text": [
      "Call me at 438-555-7343 or at work at 1-800-555-1423"
    ],
    "entity_detection": {
      "filter": [
        {
          "type": "ALLOW",
          "pattern": "(1-)?(800|888|877|866|855|844|833)-\\d{3}-\\d{4}$"
        }
      ]
    }
}

Gives the following:

Redacted TextFull Response

Copy

Copied

Call me at [PHONE_NUMBER_1] or at work at 1-800-555-1423

Copy

Copied

[
  {
    "processed_text": "Call me at [PHONE_NUMBER_1] or at work at 1-800-555-1423",
    "entities": [
      {
        "processed_text": "PHONE_NUMBER_1",
        "text": "438-555-7343",
        "location": {
          "stt_idx": 11,
          "end_idx": 23,
          "stt_idx_processed": 11,
          "end_idx_processed": 27
        },
        "best_label": "PHONE_NUMBER",
        "labels": {
          "PHONE_NUMBER": 0.9093
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 52,
    "languages_detected": {
      "en": 0.695495069026947
    }
  }
]

As expected, the first phone number is redacted while the second toll-free number is left in the processed text.

Escaping Regular Expressions

Many regular expressions contain slashes \ and other special characters. It is important to note that the slash \ is a reserved character in json. As such, slashes in string must be escaped as \\ to retain its original meaning. As demonstrated in the example above, the regular expression r"(1-)?(800|888|877|866|855|844|833)-\d{3}-\d{4}$" was escaped to "(1-)?(800|888|877|866|855|844|833)-\\d{3}-\\d{4}$" in the json request body.

Allowing IDs

Let's look at a different example. Suppose that you are de-identifying contracts of the form:

Copy

Copied

CCT-2022-09-12321: Contract between John Doe and Acme Corp.

THIS AGREEMENT is made ...

It might be difficult for the detection engine to determine if the ID CCT-2022-09-12321 in the document header is sensitive. The sensitivity may depend for example on other information being publicly available. In this case, the detection engine will flag the IDs. However, if you know that these numbers are not sensitive you may prefer to instruct the detection engine to allow such entities:

Copy

Copied

{
    "text": [
      "CCT-2022-09-12321: Contract between John Doe and Acme Corp.\n\nTHIS AGREEMENT is made ..."
    ],
    "entity_detection": {
      "filter": [
        {
          "type": "ALLOW",
          "pattern": "CCT-\\d{4}-\\d{2}-\\d+"
        }
      ]
    }
}

This is the process/text response to the above request:

Redacted TextFull Response

Copy

Copied

CCT-2022-09-12321: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...

Copy

Copied

[
  {
    "processed_text": "CCT-2022-09-12321: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...",
    "entities": [
      {
        "processed_text": "NAME_1",
        "text": "John Doe",
        "location": {
          "stt_idx": 36,
          "end_idx": 44,
          "stt_idx_processed": 36,
          "end_idx_processed": 44
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.9287,
          "NAME_GIVEN": 0.3926,
          "NAME_FAMILY": 0.2851
        }
      },
      {
        "processed_text": "ORGANIZATION_1",
        "text": "Acme Corp",
        "location": {
          "stt_idx": 49,
          "end_idx": 58,
          "stt_idx_processed": 49,
          "end_idx_processed": 65
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.885
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 89,
    "languages_detected": {
      "en": 0.9532792568206787
    }
  }
]

Without the Allow filter above, the ID CCT-2022-09-12321 would be redacted as NUMERICAL_PII but as expected it was not redacted thanks to the Allow filter.

Block Filter

Let's say that you want to detect some codes or ids sharing a common format in your data. You can rely on the de-identification service to perform the redaction for you, but it may sometimes be preferable to create your own detection logic and provide a specific label for these entities. This is exactly what block filters are for.

Block List

Similar to the allow list or whitelist, you can create a block list or blacklist to ensure that some common keywords are always detected and removed like so:

Copy

Copied

"entity_detection": {
      "filter": [
        {
          "type": "BLOCK",
          "pattern": "Android|iPhone|Pixel",
          "entity_type": "CELL_TYPE"
        }
      ]
    }

Blocking IDs

Let's look at our contract example above. With the help of block filters, you can redact the contract id as CONTRACT_ID in the document above:

Copy

Copied

{
    "text": [
      "CCT-2022-09-12321: Contract between John Doe and Acme Corp.\n\nTHIS AGREEMENT is made ..."
    ],
    "entity_detection": {
      "filter": [
        {
          "type": "BLOCK",
          "pattern": "CCT-\\d{4}-\\d{2}-\\d+",
          "entity_type": "CONTRACT_ID"
        }
      ]
    }
}

This is the process/text response to the above request:

Redacted TextFull Response

Copy

Copied

[CONTRACT_ID_1]: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...

Copy

Copied

[
  {
    "processed_text": "[CONTRACT_ID_1]: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...",
    "entities": [
      {
        "processed_text": "CONTRACT_ID_1",
        "text": "CCT-2022-09-12321",
        "location": {
          "stt_idx": 0,
          "end_idx": 17,
          "stt_idx_processed": 0,
          "end_idx_processed": 15
        },
        "best_label": "CONTRACT_ID",
        "labels": {
          "CONTRACT_ID": 1
        }
      },
      {
        "processed_text": "NAME_1",
        "text": "John Doe",
        "location": {
          "stt_idx": 36,
          "end_idx": 44,
          "stt_idx_processed": 34,
          "end_idx_processed": 42
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.9203,
          "NAME_GIVEN": 0.3381,
          "NAME_FAMILY": 0.1802
        }
      },
      {
        "processed_text": "ORGANIZATION_1",
        "text": "Acme Corp",
        "location": {
          "stt_idx": 49,
          "end_idx": 58,
          "stt_idx_processed": 47,
          "end_idx_processed": 63
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.7899
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 87,
    "languages_detected": {
      "en": 0.9520388245582581
    }
  }
]

As expected, the contract id in the text has been redacted with our own custom marker. Here is another example with a more complex pattern to match.

Augmenting existing entity type

attention

This is provided as an example and not as a complete solution to redact all ICD numbers.

In this example, we are detecting ICD-10 numbers and adding these entities to the existing CONDITION entity type:

Copy

Copied

{
  "text": [
    "ICD-10 References\nJ18.9 | Pneumonia\nE11.52 | Type 2 diabetes mellitus with certain circulatory complications"
  ],
  "entity_detection": {
    "filter": [
      {
        "type": "BLOCK",
        "pattern": "(?i)([a-t]|[v-z])\\d[a-z0-9](\\.[a-z0-9]{1,4})?",
        "entity_type": "CONDITION"
      }
    ]
  }
}

This is the process/text response to the above request:

Redacted TextFull Response

Copy

Copied

ICD-10 References\n[CONDITION_1] | [CONDITION_2]\n[CONDITION_3] | [CONDITION_4] with certain circulatory complications

Copy

Copied

[
  {
    "processed_text": "ICD-10 References\n[CONDITION_1] | [CONDITION_2]\n[CONDITION_3] | [CONDITION_4] with certain circulatory complications",
    "entities": [
      {
        "processed_text": "CONDITION_1",
        "text": "J18.9",
        "location": {
          "stt_idx": 18,
          "end_idx": 23,
          "stt_idx_processed": 18,
          "end_idx_processed": 31
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 1
        }
      },
      {
        "processed_text": "CONDITION_2",
        "text": "Pneumonia",
        "location": {
          "stt_idx": 26,
          "end_idx": 35,
          "stt_idx_processed": 34,
          "end_idx_processed": 47
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.8982
        }
      },
      {
        "processed_text": "CONDITION_3",
        "text": "E11.52",
        "location": {
          "stt_idx": 36,
          "end_idx": 42,
          "stt_idx_processed": 48,
          "end_idx_processed": 61
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 1
        }
      },
      {
        "processed_text": "CONDITION_4",
        "text": "Type 2 diabetes mellitus",
        "location": {
          "stt_idx": 45,
          "end_idx": 69,
          "stt_idx_processed": 64,
          "end_idx_processed": 77
        },
        "best_label": "CONDITION",
        "labels": {
          "CONDITION": 0.9196
        }
      }
    ],
    "entities_present": true,
    "characters_processed": 108,
    "languages_detected": {
      "en": 0.5311049818992615
    }
  }
]

You can see that the results from the block filter results and detection engine have been combined together to create a more comprehensive CONDITION entity type.

Allow Text Filter (new in 3.7)

Allow text filters are similar to Allow filters but instead of allowing individual entities, they "mark" sections of your document as safe so that no entities are detected and nothing is redacted or de-identified.

Let's consider a simple example.

Allowing a section of a document

Suppose that you have a document which contains a References section with public information only:

Copy

Copied

Conclusion
A section with sensitive information like name (e.g. John Doe) and organization (e.g. Acme Corp).

References
Berfin Akta¸s, Veronika Solopova, Annalena Kohnert, and Manfred Stede. 2020. Adapting Coreference Resolution to Twitter Conversations. In Findings of EMNLP.
Rahul Aralikatte, Heather Lent, Ana Valeria Gonzalez, Daniel Herschcovich, Chen Qiu, Anders Sandholm, Michael Ringaard, and Anders Søgaard. 2019. Rewarding Coreference Resolvers for Being Consistent with World Knowledge. In EMNLP-IJCNLP.

By default, this document would be redacted as:

Copy

Copied

Conclusion
A section with sensitive information like name (e.g. [NAME_1]) and organization (e.g. [ORGANIZATION_1]).

References
[NAME_2], [NAME_3], [NAME_4],and [NAME_5]. [DATE_INTERVAL_1]. Adapting Coreference Resolution to Twitter Conversations. In Findings of EMNLP.
[NAME_6], [NAME_7], [NAME_8],[NAME_9], [NAME_10], [NAME_11],[NAME_12], and [NAME_13]. [DATE_INTERVAL_2]. Rewarding Coreference Resolvers for Being Consistent with World Knowledge. In EMNLP-IJCNLP.

But you may prefer to not de-identify the References section since it is not sensitive. This could be done with the Allow Text filter (keeping only the filter in the request for readability):

Copy

Copied

{
  "text": [ "..." ],
  "entity_detection": {
    "filter": [
      {
        "type": "ALLOW_TEXT",
        "pattern": "References\\s+([\\S\\s]+)",
      }
    ]
  }
}

Which would result in this processed text:

Copy

Copied

"Conclusion
A section with sensitive information like name (e.g. [NAME_1]) and organization (e.g. [ORGANIZATION_1]).

References
Berfin Akta¸s, Veronika Solopova, Annalena Kohnert, and Manfred Stede. 2020. Adapting Coreference Resolution to Twitter Conversations. In Findings of EMNLP.
Rahul Aralikatte, Heather Lent, Ana Valeria Gonzalez, Daniel Herschcovich, Chen Qiu, Anders Sandholm, Michael Ringaard, and Anders Søgaard. 2019. Rewarding Coreference Resolvers for Being Consistent with World Knowledge. In EMNLP-IJCNLP.",

where the References section was not de-identified.

Allow Text filters also support capturing groups in regular expressions.

Using capturing groups

Capturing groups are a very useful feature of regular expressions. By adding capturing groups to your regular expression, you can effectively dissect a matched text into the sections of interest.

Consider this document including an audit trail with the editor name and the date of the changes:

Copy

Copied

[Part 1] [John Doe: Fri Mar 10 16:09:20 GMT 2023]
[Conclusion] [John Hancock: March 14, 2023]

Let's say you want to de-identify the author name but keep the dates of the audit trail in your processed text. One approach is to use Allow filters. However, it might be difficult to create a proper regular expression to allow all possible date formats. Moreover, all date entities would be allowed and not only those in the audit trail. This is where Allow Text filters and capturing groups become useful.

The following request contains an Allow Text filter for the audit trail above:

Copy

Copied

{
    "text": ["[Part 1] [John Doe: Fri Mar 10 16:09:20 GMT 2023]\n[Conclusion] [John Hancock: March 14, 2023]"],
    "entity_detection": {
        "filter": [{
            "type": "ALLOW_TEXT",
            "pattern": "\\[[^:]*:([^\\]]*)\\]"
        }]
    }
}

Notice the capturing group ([^\]]*) in the second part of the pattern. This group is selecting the date, that is, the section of text from the colon : up to the closing square bracket ]. This informs the Allow Text filter that only this section has to be allowed. This produces this processed text:

Copy

Copied

[Part 1] [[NAME_1]: Fri Mar 10 16:09:20 GMT 2023]
[Conclusion] [[NAME_2]: March 14, 2023]

where names are masked but dates are shown.

When groups are present, Allow Text filters will only allow the text matching the groups. This provides the flexibility you need to allow the section of text you want.

A word of caution

The regular expression pattern in filters can be as complex as it needs to be in order to capture the specific text of interest. However, one should be careful to not create filter patterns that are too generic risking to de-identify unnecessary sections of your document or worse to leave sensitive information unredacted.