Customizing Detection
info
In order to run the example code in this guide, please sign up for your free test API key here or run the container.
Each use case is different and you may sometimes need to adjust the types of entities that Private AI detects to address your specific requirements. This guide introduces a few techniques to modify and extend the detection engine to get the most from your data and is organized into two parts:
- Part 1: Enabling and Disabling Entity Types covers enabling and disabling the types of entities that are detected.
- Part 2: Filters covers allow & block list functionality and including regexes.
The techniques described in the following sections apply to most of the Private AI APIs. In particular, they can be used to customize the detection in the NER route, the Process Text route, the File URI route and the File Base64 route. See the specific route documentations for details. The following description will be using example requests and responses from the Process Text route for simplicity.
Enabling and Disabling Entity Types
Private AI detects over 50 unique entity types ranging from personal, credit card and medical information. By default, all non-beta supported types are detected but this can be easily customized via entity selectors. If you need to comply to an existing legislation like GDPR or HIPAA, you may want to de-identify only the entities covered by this regulation. This can easily be done with preset entity groups. Or you may prefer to detect your own set of entities. This can also be done using entity selectors.
info
An example Python script showing how to use entity selectors with Private AI's Python client can be found here.
Configuring Entity Selectors
Entity selectors let you enable or disable entity types as part of your API request. You can, for example, enable NAME
and ORGANIZATION
while ignoring all other entity types by using the ENABLE
selector as shown in this request:
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"entity_detection": {
"entity_types": [
{
"type": "ENABLE",
"value": ["ORGANIZATION", "NAME"]
}
]
}
}
As expected, the name and organization mentions are redacted while other entities like dates (i.e. 2019) and conditions (i.e. COVID) are left untouched in the de-identified text:
"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of COVID. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."
[
{
"processed_text": "Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of COVID. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
"entities": [
{
"processed_text": "ORGANIZATION_1",
"text": "Icarus Airways Customer Service",
"location": {
"stt_idx": 12,
"end_idx": 43,
"stt_idx_processed": 12,
"end_idx_processed": 28
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8451
}
},
{
"processed_text": "ORGANIZATION_2",
"text": "Icarus",
"location": {
"stt_idx": 134,
"end_idx": 140,
"stt_idx_processed": 119,
"end_idx_processed": 135
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.6382
}
},
{
"processed_text": "ORGANIZATION_2",
"text": "Icarus",
"location": {
"stt_idx": 262,
"end_idx": 268,
"stt_idx_processed": 257,
"end_idx_processed": 273
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.6274
}
},
{
"processed_text": "NAME_1",
"text": "Nessa Jonsson",
"location": {
"stt_idx": 361,
"end_idx": 374,
"stt_idx_processed": 366,
"end_idx_processed": 374
},
"best_label": "NAME",
"labels": {
"NAME": 0.8899
}
},
{
"processed_text": "NAME_1",
"text": "N-E-S-S-A J-O-N-S-S-O-N",
"location": {
"stt_idx": 376,
"end_idx": 399,
"stt_idx_processed": 376,
"end_idx_processed": 384
},
"best_label": "NAME",
"labels": {
"NAME": 0.8863
}
}
],
"entities_present": true,
"characters_processed": 400,
"languages_detected": {
"en": 0.9167966246604919
}
}
]
It is sometimes simpler to specify the entities to disable. This can be done using a DISABLE
selector:
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"entity_detection": {
"entity_types": [
{
"type": "DISABLE",
"value": ["DATE", "DATE_INTERVAL"]
}
]
}
}
The above request will redact all entity types except dates and date intervals as shown by the corresponding output:
"Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."
[
{
"processed_text": "Hi! This is [ORGANIZATION_1]. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my [ORGANIZATION_2] Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in [ORGANIZATION_2] since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
"entities": [
{
"processed_text": "ORGANIZATION_1",
"text": "Icarus Airways Customer Service",
"location": {
"stt_idx": 12,
"end_idx": 43,
"stt_idx_processed": 12,
"end_idx_processed": 28
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8451
}
},
{
"processed_text": "ORGANIZATION_2",
"text": "Icarus",
"location": {
"stt_idx": 134,
"end_idx": 140,
"stt_idx_processed": 119,
"end_idx_processed": 135
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.6382
}
},
{
"processed_text": "ORGANIZATION_2",
"text": "Icarus",
"location": {
"stt_idx": 262,
"end_idx": 268,
"stt_idx_processed": 257,
"end_idx_processed": 273
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.6274
}
},
{
"processed_text": "CONDITION_1",
"text": "COVID",
"location": {
"stt_idx": 291,
"end_idx": 296,
"stt_idx_processed": 296,
"end_idx_processed": 309
},
"best_label": "CONDITION",
"labels": {
"CONDITION": 0.9187
}
},
{
"processed_text": "NAME_1",
"text": "Nessa Jonsson",
"location": {
"stt_idx": 361,
"end_idx": 374,
"stt_idx_processed": 374,
"end_idx_processed": 382
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.3601,
"NAME": 0.8899,
"NAME_FAMILY": 0.5475
}
},
{
"processed_text": "NAME_1",
"text": "N-E-S-S-A J-O-N-S-S-O-N",
"location": {
"stt_idx": 376,
"end_idx": 399,
"stt_idx_processed": 384,
"end_idx_processed": 392
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.3714,
"NAME": 0.8863,
"NAME_FAMILY": 0.53
}
}
],
"entities_present": true,
"characters_processed": 400,
"languages_detected": {
"en": 0.9167966246604919
}
}
]
Preset Entity Groups
If you need to comply with a specific legislation like HIPAA, the de-identification service makes it easy for you. You can simply choose from the list of preset entity groups: ['GDPR', 'GDPR_SENSITIVE', 'HIPAA_SAFE_HARBOR', 'CPRA', 'QUEBEC_PRIVACY_ACT', 'APPI', 'APPI_SENSITIVE', 'PCI', 'HEALTH_INFORMATION']
. For details of what is contained in each group, please consult our entities page.
This is an example on how to enable all entities covered by the GDPR legislation:
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"entity_detection": {
"entity_types": [
{
"type": "ENABLE",
"value": ["GDPR"]
}
]
}
}
In the response below, the GDPR entities like NAME
and CONDITION
are redacted while ORGANIZATION
mentions are not since there are not part of GDPR:
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1]."
[
{
"processed_text": "Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of [CONDITION_1]. What happened? May I have your name and account number, ma’am? [NAME_1], [NAME_1].",
"entities": [
{
"processed_text": "CONDITION_1",
"text": "COVID",
"location": {
"stt_idx": 291,
"end_idx": 296,
"stt_idx_processed": 291,
"end_idx_processed": 304
},
"best_label": "CONDITION",
"labels": {
"CONDITION": 0.9187
}
},
{
"processed_text": "NAME_1",
"text": "Nessa Jonsson",
"location": {
"stt_idx": 361,
"end_idx": 374,
"stt_idx_processed": 369,
"end_idx_processed": 377
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.3601,
"NAME": 0.8899,
"NAME_FAMILY": 0.5475
}
},
{
"processed_text": "NAME_1",
"text": "N-E-S-S-A J-O-N-S-S-O-N",
"location": {
"stt_idx": 376,
"end_idx": 399,
"stt_idx_processed": 379,
"end_idx_processed": 387
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.3714,
"NAME": 0.8863,
"NAME_FAMILY": 0.53
}
}
],
"entities_present": true,
"characters_processed": 400,
"languages_detected": {
"en": 0.9167966246604919
}
}
]
Security Considerations
There are several reasons that may motivate the use of selectors to limit the set of entities:
- You are working with data from a specific domain which may not contain some of the supported entities. For example, medical data are unlikely to contain PCI information like credit card number. Disabling these entities will prevent potential false-positives.
- Some entities, while present in your data, may not be regarded as sensitive in your use case. For example, your data may contain generic URLs and filenames that can't be used to identify individuals.
It is, however, important to understand that disabling entities may increase the risk that sensitive information is leaked. When selecting the list of entities to redact, we encourage you to take extra time and care to think on all the possible implications. A good practice is to have an expert validation to confirm that the redacted contents is free of PII or other sensitive information. Following this validation, the list may be adjusted according to the findings.
Advanced Topics
This section presents advanced techniques to help you get the most of the Private AI service.
Combining Selectors
It is possible to combine selectors to help create the desired subset of entities. For example, you may be interested in complying with GDPR and HIPAA legislations. This can be done by listing these two groups in an ENABLE
selector.
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"entity_detection": {
"entity_types": [
{
"type": "ENABLE",
"value": ["GDPR", "HIPAA_SAFE_HARBOR"]
}
]
}
}
You can also pick and choose the list the entities from groups and individual entity types.
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"entity_detection": {
"entity_types": [
{
"type": "ENABLE",
"value": ["GDPR", "ORGANIZATION"]
}
]
}
}
The above request will redact all GDPR entities as well as ORGANIZATION
.
It is also possible to combine ENABLE
and DISABLE
selectors.
{
"text": [
"Hi! This is Icarus Airways Customer Service. How may I be of assistance to you? Hello! I’d like to complain about a discrepancy in my Icarus Frequent Voyager points. I checked my account today. I had 5,000 points in 2019. Now it’s just 3,500. I haven’t flown in Icarus since 2019 because of COVID. What happened? May I have your name and account number, ma’am? Nessa Jonsson, N-E-S-S-A J-O-N-S-S-O-N."
],
"entity_detection": {
"entity_types": [
{
"type": "ENABLE",
"value": ["GDPR"]
},
{
"type": "DISABLE",
"value": ["DATE", "DATE_INTERVAL"]
}
]
}
}
This request is enabling all GDPR entities except DATE
and DATE_INTERVAL
.
Selector Precedence
When combining several selectors, the list of enable entities is first computed by expanding all entity types and groups in the ENABLE
selectors. The entity types and groups listed in DISABLE
selectors are then expended and removed from that list to form the final list of entities.
When no ENABLE
selectors are specified, it is assumed that the all supported entities are enabled. In this case, DISABLE
selectors will remove from the list of all supported entities.
Selective redaction
If you want to redact only one or two entities out of 50+ entity types that we support, you can use the ENABLE
selector. For example, to redact only NAME_FAMILY
and LOCATION_ADDRESS_STREET
in the text below, we use ENABLE
selector and specify these two entities:
{
"text": [
"Hello there! I am Alice McGee, residing at 325 Sophia St, Port Coquitlam. You can reach me at 235 123-9876 or via email at alicemcgee@gmail.com. I work at SFU."
],
"entity_detection": {
"entity_types": [
{
"type": "ENABLE",
"value": ["NAME_FAMILY", "LOCATION_ADDRESS_STREET"]
}
]
}
}
The above request will keep NAME_GIVEN
, LOCATION_CITY
, PHONE_NUMBER
, EMAIL_ADDRESS
and ORGANIZATION
visible and redact only NAME_FAMILY
and LOCATION_STREET_ADDRESS
.
In other cases, you may only want to redact PCI but keep all other entities unredacted, specifically ACCOUNT_NUMBER
in the following example:
{
"text": [
"His account number at WaveNow Digital is 6787655, and it is connected to his bank account at First National Bank, 987654321. But he also has a credit card on file, ending with 7876, expires on 10/25, and his billing address is 456 41st Avenue, Lower Valley."
],
"entity_detection": {
"entity_types": [
{
"type": "ENABLE",
"value": ["PCI"]
}
],
"enable_non_max_suppression": true
}
}
In this example, BANK_ACCOUNT
, CREDIT_CARD
, CREDIT_CARD_EXPIRATION
and CVV
will be redacted, while ACCOUNT_NUMBER
and LOCATION_ADDRESS
will be visible.
Understanding Multi-Label Predictions
At the core of the Private AI service lies a model that performs named entity recognition (NER). In essence, NER models seek to classify each word in a text into a fixed set of classes: the entity types. NER is often framed as a multi-class multi-label problem since a single word can have several labels. For example, in Simon Fraser University the word Simon is both part of a name (i.e. Simon Fraser) and an organization (i.e. Simon Fraser University).
To create the de-identified or redacted output, the service must select among the predicted labels the one that best represent the entity. To do so, it will often prefer longer entities over shorter ones. For example, the service will redact Simon Fraser University as an ORGANIZATION:
I study at Simon Fraser University -> I study at [ORGANIZATION]
instead of a combination of a NAME and an ORGANIZATION:
I study at Simon Fraser University -> I study at [NAME] [ORGANIZATION]
This leads to a much more natural output for the users.
Disabling entities has direct impact on that behaviour. Let's suppose that ORGANIZATION
has been disabled when redacting the above text.
{
"text": [
"I study at Simon Fraser University"
],
"entity_detection": {
"entity_types": [
{
"type": "DISABLE",
"value": ["ORGANIZATION"]
}
]
}
}
The resulting response might be surprising at first.
"I study at [NAME_1] University"
[
{
"processed_text": "I study at [NAME_1] University",
"entities": [
{
"processed_text": "NAME_1",
"text": "Simon Fraser",
"location": {
"stt_idx": 11,
"end_idx": 23,
"stt_idx_processed": 11,
"end_idx_processed": 19
},
"best_label": "NAME",
"labels": {
"NAME_GIVEN": 0.2466,
"NAME": 0.4709,
"NAME_FAMILY": 0.2037
}
}
],
"entities_present": true,
"characters_processed": 34,
"languages_detected": {
"en": 0.8937596678733826
}
}
]
Although ORGANIZATION
was disabled, a part of the Simon Fraser University was redacted. This is explained by the fact that Simon Fraser is both part of an organization name but also a person name. Given that ORGANIZATION
was disabled, the de-identification service picked the second best label for these words which is NAME
.
Depending on your use case, you may prefer to keep the full organization name in the output. This can be done with the enable_non_max_suppression
flag.
{
"text": [
"I study at Simon Fraser University"
],
"entity_detection": {
"entity_types": [
{
"type": "DISABLE",
"value": ["ORGANIZATION"]
}
],
"enable_non_max_suppression": true
}
}
When the enable_non_max_suppression
flag is set to true, the service will ignore labels with lower likelihoods (i.e. NAME
in the above example) therefore preventing the redaction of the ORGANIZATION
as shown below.
I study at Simon Fraser University
[
{
"processed_text": "I study at Simon Fraser University",
"entities": [],
"entities_present": false,
"characters_processed": 34,
"languages_detected": {
"en": 0.8937596678733826
}
}
]
Note that, as with disabled entities, one should use the enable_non_max_suppression
cautiously. Setting this flag to true may increase the chance of leaking sensitive information.
Understanding Hierarchical Types
Some of the supported entities in the de-identification service are structured into hierarchies. The labels NAME
and LOCATION
are good examples.
The NAME
hierachy includes NAME_GIVEN
and NAME_FAMILY
but also NAME_MEDICAL_PROFESSIONAL
while the LOCATION
hierarchy contains LOCATION_COUNTRY
, LOCATION_STATE
, LOCATION_CITY
and so on. Entities forming hierarchies are easily identifiable as they share the same prefix (i.e. NAME
or LOCATION
in the above examples) followed by an underscore _
.
When creating the redacted text, the de-identification service will prefer to use the most specific label in a hierarchy instead of the root label. For example, I live in Canada will be redacted as I live in [LOCATION_COUNTRY]
instead of the more generic I live in [LOCATION]
. This behaviour improves the usability of the data. If you are not interested in getting this level of granulity you can leverage the fact that labels in hierarchies use the same prefix. This can easily be done as a post-processing step where tokens like NAME_GIVEN
and NAME_FAMILY
are replaced with the root label NAME
.
Filters
note
This guide assumes that you have a working knowledge of regular expressions. Private AI is using the Python regular expression syntax. You can find more details on the Python re
module documentation.
Sometimes referred to as whitelists and blacklists, filters are specifically designed to allow entities (i.e. leave them in the text) or block entities (i.e. redact them from the text) when the entity text follows an expected format. Filters are built using regular expressions.
info
An example Python script showing how to use filters with Private AI's Python client can be found here.
Allow Filter
How can you redact regular phone numbers while keeping companies' toll-free numbers in clear? How would you prevent document ID numbers from being detected as a sensitive numerical number? These are two examples of use cases that can be addressed with Allow filters.
Allow filters instruct the detection engine to ignore entities when the entity text match a specific pattern. To create an Allow filter, a regular expression pattern is first created. This pattern is then added to the filter
list in the entity_detection
object of your REST request. Let's look at a couple of examples.
Allow List
It is possible to feed lists of terms as well as regex patterns to filters. For example, if you want to prevent the detection engine from removing country names you can whitelist them with:
"entity_detection": {
"filter": [
{
"type": "ALLOW",
"pattern": "Canada|Brazil|Italy"
}
]
}
Allowing toll-free numbers
This is an example of a process/text
request containing an Allow filter. When run this request will detect and redact phone numbers unless they follow the specific format for toll-free phones:
{
"text": [
"Call me at 438-555-7343 or at work at 1-800-555-1423"
],
"entity_detection": {
"filter": [
{
"type": "ALLOW",
"pattern": "(1-)?(800|888|877|866|855|844|833)-\\d{3}-\\d{4}$"
}
]
}
}
Gives the following:
Call me at [PHONE_NUMBER_1] or at work at 1-800-555-1423
[
{
"processed_text": "Call me at [PHONE_NUMBER_1] or at work at 1-800-555-1423",
"entities": [
{
"processed_text": "PHONE_NUMBER_1",
"text": "438-555-7343",
"location": {
"stt_idx": 11,
"end_idx": 23,
"stt_idx_processed": 11,
"end_idx_processed": 27
},
"best_label": "PHONE_NUMBER",
"labels": {
"PHONE_NUMBER": 0.9093
}
}
],
"entities_present": true,
"characters_processed": 52,
"languages_detected": {
"en": 0.695495069026947
}
}
]
As expected, the first phone number is redacted while the second toll-free number is left in the processed text.
Escaping Regular Expressions
Many regular expressions contain slashes \
and other special characters. It is important to note that the slash \
is a reserved character in json. As such, slashes in string must be escaped as \\
to retain its original meaning. As demonstrated in the example above, the regular expression r"(1-)?(800|888|877|866|855|844|833)-\d{3}-\d{4}$"
was escaped to "(1-)?(800|888|877|866|855|844|833)-\\d{3}-\\d{4}$"
in the json request body.
Allowing IDs
Let's look at a different example. Suppose that you are de-identifying contracts of the form:
CCT-2022-09-12321: Contract between John Doe and Acme Corp.
THIS AGREEMENT is made ...
It might be difficult for the detection engine to determine if the ID CCT-2022-09-12321
in the document header is sensitive. The sensitivity may depend for example on other information being publicly available. In this case, the detection engine will flag the IDs. However, if you know that these numbers are not sensitive you may prefer to instruct the detection engine to allow such entities:
{
"text": [
"CCT-2022-09-12321: Contract between John Doe and Acme Corp.\n\nTHIS AGREEMENT is made ..."
],
"entity_detection": {
"filter": [
{
"type": "ALLOW",
"pattern": "CCT-\\d{4}-\\d{2}-\\d+"
}
]
}
}
This is the process/text
response to the above request:
CCT-2022-09-12321: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...
[
{
"processed_text": "CCT-2022-09-12321: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...",
"entities": [
{
"processed_text": "NAME_1",
"text": "John Doe",
"location": {
"stt_idx": 36,
"end_idx": 44,
"stt_idx_processed": 36,
"end_idx_processed": 44
},
"best_label": "NAME",
"labels": {
"NAME": 0.9287,
"NAME_GIVEN": 0.3926,
"NAME_FAMILY": 0.2851
}
},
{
"processed_text": "ORGANIZATION_1",
"text": "Acme Corp",
"location": {
"stt_idx": 49,
"end_idx": 58,
"stt_idx_processed": 49,
"end_idx_processed": 65
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.885
}
}
],
"entities_present": true,
"characters_processed": 89,
"languages_detected": {
"en": 0.9532792568206787
}
}
]
Without the Allow filter above, the ID CCT-2022-09-12321
would be redacted as NUMERICAL_PII
but as expected it was not redacted thanks to the Allow filter.
Block Filter
Let's say that you want to detect some codes or ids sharing a common format in your data. You can rely on the de-identification service to perform the redaction for you, but it may sometimes be preferable to create your own detection logic and provide a specific label for these entities. This is exactly what block filters are for.
Block List
Similar to the allow list or whitelist, you can create a block list or blacklist to ensure that some common keywords are always detected and removed like so:
"entity_detection": {
"filter": [
{
"type": "BLOCK",
"pattern": "Android|iPhone|Pixel",
"entity_type": "CELL_TYPE"
}
]
}
Blocking IDs
Let's look at our contract example above. With the help of block filters, you can redact the contract id as CONTRACT_ID
in the document above:
{
"text": [
"CCT-2022-09-12321: Contract between John Doe and Acme Corp.\n\nTHIS AGREEMENT is made ..."
],
"entity_detection": {
"filter": [
{
"type": "BLOCK",
"pattern": "CCT-\\d{4}-\\d{2}-\\d+",
"entity_type": "CONTRACT_ID"
}
]
}
}
This is the process/text
response to the above request:
[CONTRACT_ID_1]: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...
[
{
"processed_text": "[CONTRACT_ID_1]: Contract between [NAME_1] and [ORGANIZATION_1].\n\nTHIS AGREEMENT is made ...",
"entities": [
{
"processed_text": "CONTRACT_ID_1",
"text": "CCT-2022-09-12321",
"location": {
"stt_idx": 0,
"end_idx": 17,
"stt_idx_processed": 0,
"end_idx_processed": 15
},
"best_label": "CONTRACT_ID",
"labels": {
"CONTRACT_ID": 1
}
},
{
"processed_text": "NAME_1",
"text": "John Doe",
"location": {
"stt_idx": 36,
"end_idx": 44,
"stt_idx_processed": 34,
"end_idx_processed": 42
},
"best_label": "NAME",
"labels": {
"NAME": 0.9203,
"NAME_GIVEN": 0.3381,
"NAME_FAMILY": 0.1802
}
},
{
"processed_text": "ORGANIZATION_1",
"text": "Acme Corp",
"location": {
"stt_idx": 49,
"end_idx": 58,
"stt_idx_processed": 47,
"end_idx_processed": 63
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.7899
}
}
],
"entities_present": true,
"characters_processed": 87,
"languages_detected": {
"en": 0.9520388245582581
}
}
]
As expected, the contract id in the text has been redacted with our own custom marker. Here is another example with a more complex pattern to match.
Augmenting existing entity type
attention
This is provided as an example and not as a complete solution to redact all ICD numbers.
In this example, we are detecting ICD-10 numbers and adding these entities to the existing CONDITION
entity type:
{
"text": [
"ICD-10 References\nJ18.9 | Pneumonia\nE11.52 | Type 2 diabetes mellitus with certain circulatory complications"
],
"entity_detection": {
"filter": [
{
"type": "BLOCK",
"pattern": "(?i)([a-t]|[v-z])\\d[a-z0-9](\\.[a-z0-9]{1,4})?",
"entity_type": "CONDITION"
}
]
}
}
This is the process/text
response to the above request:
ICD-10 References\n[CONDITION_1] | [CONDITION_2]\n[CONDITION_3] | [CONDITION_4] with certain circulatory complications
[
{
"processed_text": "ICD-10 References\n[CONDITION_1] | [CONDITION_2]\n[CONDITION_3] | [CONDITION_4] with certain circulatory complications",
"entities": [
{
"processed_text": "CONDITION_1",
"text": "J18.9",
"location": {
"stt_idx": 18,
"end_idx": 23,
"stt_idx_processed": 18,
"end_idx_processed": 31
},
"best_label": "CONDITION",
"labels": {
"CONDITION": 1
}
},
{
"processed_text": "CONDITION_2",
"text": "Pneumonia",
"location": {
"stt_idx": 26,
"end_idx": 35,
"stt_idx_processed": 34,
"end_idx_processed": 47
},
"best_label": "CONDITION",
"labels": {
"CONDITION": 0.8982
}
},
{
"processed_text": "CONDITION_3",
"text": "E11.52",
"location": {
"stt_idx": 36,
"end_idx": 42,
"stt_idx_processed": 48,
"end_idx_processed": 61
},
"best_label": "CONDITION",
"labels": {
"CONDITION": 1
}
},
{
"processed_text": "CONDITION_4",
"text": "Type 2 diabetes mellitus",
"location": {
"stt_idx": 45,
"end_idx": 69,
"stt_idx_processed": 64,
"end_idx_processed": 77
},
"best_label": "CONDITION",
"labels": {
"CONDITION": 0.9196
}
}
],
"entities_present": true,
"characters_processed": 108,
"languages_detected": {
"en": 0.5311049818992615
}
}
]
You can see that the results from the block filter results and detection engine have been combined together to create a more comprehensive CONDITION
entity type.
Allow Text Filter (new in 3.7)
Allow text filters are similar to Allow filters but instead of allowing individual entities, they "mark" sections of your document as safe so that no entities are detected and nothing is redacted or de-identified.
Let's consider a simple example.
Allowing a section of a document
Suppose that you have a document which contains a References section with public information only:
Conclusion
A section with sensitive information like name (e.g. John Doe) and organization (e.g. Acme Corp).
References
Berfin Akta¸s, Veronika Solopova, Annalena Kohnert, and Manfred Stede. 2020. Adapting Coreference Resolution to Twitter Conversations. In Findings of EMNLP.
Rahul Aralikatte, Heather Lent, Ana Valeria Gonzalez, Daniel Herschcovich, Chen Qiu, Anders Sandholm, Michael Ringaard, and Anders Søgaard. 2019. Rewarding Coreference Resolvers for Being Consistent with World Knowledge. In EMNLP-IJCNLP.
By default, this document would be redacted as:
Conclusion
A section with sensitive information like name (e.g. [NAME_1]) and organization (e.g. [ORGANIZATION_1]).
References
[NAME_2], [NAME_3], [NAME_4],and [NAME_5]. [DATE_INTERVAL_1]. Adapting Coreference Resolution to Twitter Conversations. In Findings of EMNLP.
[NAME_6], [NAME_7], [NAME_8],[NAME_9], [NAME_10], [NAME_11],[NAME_12], and [NAME_13]. [DATE_INTERVAL_2]. Rewarding Coreference Resolvers for Being Consistent with World Knowledge. In EMNLP-IJCNLP.
But you may prefer to not de-identify the References section since it is not sensitive. This could be done with the Allow Text filter (keeping only the filter in the request for readability):
{
"text": [ "..." ],
"entity_detection": {
"filter": [
{
"type": "ALLOW_TEXT",
"pattern": "References\\s+([\\S\\s]+)",
}
]
}
}
Which would result in this processed text:
"Conclusion
A section with sensitive information like name (e.g. [NAME_1]) and organization (e.g. [ORGANIZATION_1]).
References
Berfin Akta¸s, Veronika Solopova, Annalena Kohnert, and Manfred Stede. 2020. Adapting Coreference Resolution to Twitter Conversations. In Findings of EMNLP.
Rahul Aralikatte, Heather Lent, Ana Valeria Gonzalez, Daniel Herschcovich, Chen Qiu, Anders Sandholm, Michael Ringaard, and Anders Søgaard. 2019. Rewarding Coreference Resolvers for Being Consistent with World Knowledge. In EMNLP-IJCNLP.",
where the References section was not de-identified.
Allow Text filters also support capturing groups in regular expressions.
Using capturing groups
Capturing groups are a very useful feature of regular expressions. By adding capturing groups to your regular expression, you can effectively dissect a matched text into the sections of interest.
Consider this document including an audit trail with the editor name and the date of the changes:
[Part 1] [John Doe: Fri Mar 10 16:09:20 GMT 2023]
[Conclusion] [John Hancock: March 14, 2023]
Let's say you want to de-identify the author name but keep the dates of the audit trail in your processed text. One approach is to use Allow filters. However, it might be difficult to create a proper regular expression to allow all possible date formats. Moreover, all date entities would be allowed and not only those in the audit trail. This is where Allow Text filters and capturing groups become useful.
The following request contains an Allow Text filter for the audit trail above:
{
"text": ["[Part 1] [John Doe: Fri Mar 10 16:09:20 GMT 2023]\n[Conclusion] [John Hancock: March 14, 2023]"],
"entity_detection": {
"filter": [{
"type": "ALLOW_TEXT",
"pattern": "\\[[^:]*:([^\\]]*)\\]"
}]
}
}
Notice the capturing group ([^\]]*)
in the second part of the pattern. This group is selecting the date, that is, the section of text from the colon :
up to the closing square bracket ]
. This informs the Allow Text filter that only this section has to be allowed. This produces this processed text:
[Part 1] [[NAME_1]: Fri Mar 10 16:09:20 GMT 2023]
[Conclusion] [[NAME_2]: March 14, 2023]
where names are masked but dates are shown.
When groups are present, Allow Text filters will only allow the text matching the groups. This provides the flexibility you need to allow the section of text you want.
A word of caution
The regular expression pattern in filters can be as complex as it needs to be in order to capture the specific text of interest. However, one should be careful to not create filter patterns that are too generic risking to de-identify unnecessary sections of your document or worse to leave sensitive information unredacted.