Python Client
This document provides information about how to use Private AI's Python client to interact with the container or cloud API. In addition to this guide, you might find the Github repository helpful. It contains further examples and usage options.
Installation
The Python client is available for download on pypi.org or with pip:
pip install privateai_client
Quick Start
from privateai_client import PAIClient
from privateai_client import request_objects
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
text_request = request_objects.process_text_obj(text=["My sample name is John Smith"])
response = client.process_text(text_request)
print(text_request.text)
print(response.processed_text)
Output:
['My sample name is John Smith']
['My sample name is [NAME_1]']
Working with the Client
Initializing the Client for self-hosted container
The PAI client requires a scheme, host, and optional port to initialize.
Alternatively, a full url can be used.
Once created, the connection can be tested with the client's ping
function
from privateai_client import PAIClient
scheme = 'http'
host = 'localhost'
port= '8080'
client = PAIClient(scheme, host, port)
client.ping()
url = "http://localhost:8080"
client = PAIClient(url=url)
client.ping()
Output:
True
True
Note: The container is hosted with your provisioned application license and does not manage authentication to the API or authorization of API requests. Access to the container is at the discretion of the user. For recommendations on how to deploy in an enterprise context including authorized use, please contact us.
Initializing the Client for our cloud-API offering
To access the cloud API, you need to authenticate with your API key. You can get one from the customer portal.
from privateai_client import PAIClient
# Adding credentials on initialization
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
# Adding credentials after initialization
client = PAIClient(url="https://api.private-ai.com/community/v4/")
client.ping()
client.add_api_key('<YOUR API KEY>')
client.ping()
Output:
The request returned with a 401 Unauthorized
True
Making Requests
Once initialized the client can be used to make any request listed in the API documentation.
Available requests:
Client Function | Endpoint |
---|---|
get_version() |
/ |
ping() |
/healthz |
get_metrics() |
/metrics |
get_diagnostics() |
/diagnostics |
ner_text() |
/ner/text |
process_text() |
/process/text |
analyze_text() |
/analyze/text |
process_files_uri() |
/process/files/uri |
process_files_base64() |
/process/files/base64 |
bleep() |
/bleep |
Requests can be made using dictionaries:
sample_text = ["This is John Smith's sample dictionary request"]
text_dict_request = {"text": sample_text}
response = client.process_text(text_dict_request)
print(response.processed_text)
Output:
["This is [NAME_1]'s sample dictionary request"]
or using built-in request objects:
from privateai_client import request_objects
sample_text = "This is John Smith's sample process text object request"
text_request_object = request_objects.process_text_obj(text=[sample_text])
response = client.process_text(text_request_object)
print(response.processed_text)
Output:
["This is [NAME_1]'s sample process text object request"]
Request Objects
Request objects are a simple way of creating request bodies without the tediousness of writing dictionaries. Every POST request (as listed in the Private AI API documentation) has its own request own request object.
from privateai_client import request_objects
sample_obj = request_objects.file_uri_obj(uri='path/to/file.jpg')
sample_obj.uri
Output:
'path/to/file.jpg'
Additionally there are request objects for each nested dictionary of a request:
from privateai_client import request_objects
sample_text = "This is John Smith's sample process text object request where names won't be removed"
# sub-dictionary of entity_detection
sample_entity_type_selector = request_objects.entity_type_selector_obj(type="DISABLE", value=['NAME', 'NAME_GIVEN', 'NAME_FAMILY'])
# sub-dictionary of a process text request
sample_entity_detection = request_objects.entity_detection_obj(entity_types=[sample_entity_type_selector])
# request object created using the sub-dictionaries
sample_request = request_objects.process_text_obj(text=[sample_text], entity_detection=sample_entity_detection)
response = client.process_text(sample_request)
print(response.processed_text)
Output:
["This is John Smith's sample process text object request where names won't be removed"]
Building Request Objects
Request objects can initialized by passing in all the required values needed for the request as arguments or from a dictionary, using the object's fromdict()
function:
# Passing arguments
sample_data = "JVBERi0xLjQKJdPr6eEKMSAwIG9iago8PC9UaXRsZSAoc2FtcGxlKQovUHJvZHVj..."
sample_content_type = "application/pdf"
sample_file_obj = request_objects.file_obj(data=sample_data, content_type=sample_content_type)
# Passing a dictionary using .fromdict()
sample_dict = {"data": "JVBERi0xLjQKJdPr6eEKMSAwIG9iago8PC9UaXRsZSAoc2FtcGxlKQovUHJvZHVj...",
"content_type": "application/pdf"}
sample_file_obj2 = request_objects.file_obj.fromdict(sample_dict)
Request objects also can be formatted as dictionaries, using the request object's to_dict()
function:
from privateai_client import request_objects
sample_text = "Sample text."
# Create the nested request objects
sample_entity_type_selector = request_objects.entity_type_selector_obj(type="DISABLE", value=['HIPAA_SAFE_HARBOR'])
sample_entity_detection = request_objects.entity_detection_obj(entity_types=[sample_entity_type_selector])
# Create the request object
sample_request = request_objects.process_text_obj(text=[sample_text], entity_detection=sample_entity_detection)
# All nested request objects are also formatted
print(sample_request.to_dict())
Output:
{
'text': ['Sample text.'],
'link_batch': False,
'entity_detection': {'accuracy': 'high', 'entity_types': [{'type': 'DISABLE', 'value': ['HIPAA_SAFE_HARBOR']}], 'filter': [], 'return_entity': True},
'processed_text': {'type': 'MARKER', 'pattern': '[UNIQUE_NUMBERED_ENTITY_TYPE]'}
}
Sample Use
Processing a directory of files with URI route
from privateai_client import PAIClient
from privateai_client.objects import request_objects
import os
import logging
file_dir = "/path/to/file/directory"
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
for file_name in os.listdir(file_dir):
filepath = os.path.join(file_dir, file_name)
if not os.path.isfile(filepath):
continue
req_obj = request_objects.file_uri_obj(uri=filepath)
# NOTE this method of file processing requires the container to have an the input and output directories mounted
resp = client.process_files_uri(req_obj)
if not resp.ok:
logging.error(f"response for file {file_name} returned with {resp.status_code}")
Processing a file with Base64 route
from privateai_client import PAIClient
from privateai_client.objects import request_objects
import base64
import os
import logging
file_dir = "/path/to/your/file"
file_name = 'sample_file.pdf'
filepath = os.path.join(file_dir,file_name)
file_type= "type/of_file" #eg. application/pdf
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
# Read from file
with open(filepath, "rb") as b64_file:
file_data = base64.b64encode(b64_file.read())
file_data = file_data.decode("ascii")
# Make the request
file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)
if not resp.ok:
logging.error(f"response for file {file_name} returned with {resp.status_code}")
# Write to file
with open(os.path.join(file_dir,f"redacted-{file_name}"), 'wb') as redacted_file:
processed_file = resp.processed_file.encode("ascii")
processed_file = base64.b64decode(processed_file, validate=True)
redacted_file.write(processed_file)
Bleep an audio file
from privateai_client import PAIClient
from privateai_client.objects import request_objects
import base64
import os
import logging
file_dir = "/path/to/your/file"
file_name = 'sample_file.pdf'
filepath = os.path.join(file_dir,file_name)
file_type= "type/of_file" #eg. audio/mp3 or audio/wav
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
file_dir = "/home/adam/workstation/file_processing/test_audio"
file_name = "test_audio.mp3"
filepath = os.path.join(file_dir,file_name)
file_type = "audio/mp3"
with open(filepath, "rb") as b64_file:
file_data = base64.b64encode(b64_file.read())
file_data = file_data.decode("ascii")
file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
timestamp = request_objects.timestamp_obj(start=1.12, end=2.14)
request_obj = request_objects.bleep_obj(file=file_obj, timestamps=[timestamp])
resp = client.bleep(request_object=request_obj)
if not resp.ok:
logging.error(f"response for file {file_name} returned with {resp.status_code}")
with open(os.path.join(file_dir,f"redacted-{file_name}"), 'wb') as redacted_file:
processed_file = resp.bleeped_file.encode("ascii")
processed_file = base64.b64decode(processed_file, validate=True)
redacted_file.write(processed_file)
Analyze Text Post-Processing
The analyze/text
route returns rich, structured detections you can post-process with the Private AI Python client. It is a route specifically developed for text understanding.
For more details on its capabilities, refer to the analyze/text documentation.
In this section, we describe how the Python client can be used to post-process the analyze text response.
The Python client provides utilities to iterate through detected entities and apply transformation rules, such as masking, pseudonymizing, validating, or normalizing values.
The following example introduces the required pieces for post-processing, which we describe in detail.
# This code assumes that you have the Private AI deidentification service running locally on port 8080.
# It also assumes that you have installed the Private AI python client.
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
client = PAIClient(
url="https://api.private-ai.com/community/v4/", api_key="<YOUR-API-KEY>"
)
text = [
"Jenna is a 32 year old female diagnosed with asthma."
]
request = {
"text": text,
"locale": "en-US",
"entity_detection": {
"accuracy": "high",
"entity_types": [{"type": "ENABLE", "value": ["AGE", "NAME"]}],
},
}
text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)
# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
class AgeBucketEntityProcessor:
def __init__(self, bucket_size: int = 5):
self.bucket_size = bucket_size
def __call__(self, entity: dict) -> str:
age = entity["analysis_result"].get("formatted")
if not age:
return "[%-%]"
start = (age // self.bucket_size) * self.bucket_size
end = start + self.bucket_size
return f"[{start}-{end}]"
entity_processors = {"AGE": AgeBucketEntityProcessor(bucket_size=10)}
deidentified_texts = deidentify_text(
text,
resp,
entity_processors=entity_processors,
default_processor=MarkerEntityProcessor(),
)
for t in deidentified_texts:
print(t)
The output of this code replaces the age with the corresponding range.
[NAME_1] is a [30-40] year old female diagnosed with asthma.
At the core of this workflow is the deidentify_text
function which allows for entity replacements by invoking various entity processors. Each processor defines the exact behavior for a given entity type, making it easy to implement custom redaction tailored to your use case.
The function deidentify_text(...)
takes the original texts plus the analyze/text
response, walks through every detected entity in left-to-right order, and replaces each entity span using the appropriate processor. It also automatically adjusts the character offsets of the entity locations after their replacements.
from typing import Callable
from privateai_client.components import AnalyzeTextResponse
EntityProcessor = Callable[[dict], str]
def deidentify_text(
text: list[str],
response: AnalyzeTextResponse,
entity_processors: dict[str, EntityProcessor],
default_processor: EntityProcessor,
) -> list[str]:
...
-
text
- The original list of text messages that were passed intoPAIClient.analyze_text()
-
response
- The structured response returned by theanalyze_text
call -
entity_processors
- Mapping of entity type to entity processor, e.g.{"DATE": redact_date, "CREDIT_CARD": redact_credit_card}
- Each processor is a callable that accepts an entity dictionary and returns the replacement string for that entity.
-
Invoked when the entity
best_label
matches a key in this dictionary.
-
default_processor
- A fallback processor applied to all entity types not explicitly listed inentity_processors
. This ensures every entity is handled, even if you only configure custom processors for subset of the enabled entities.
The response is a list of de-identified text strings.
Entity Processors
The processors are callables (Callable[[dict], str]
) that take a detected entity dictionary and return the replacement text for that span. It can be as simple as a function, or a class which implements the __call__
method. In the example above we created the AgeBucketEntityProcessor
, which puts the entity AGE
into a bucket.
The potential use cases are broad. A few common examples include:
-
Hide all but the last 4 digits in a
CREDIT_CARD
number; -
Keep only the year in a
DATE
entity; -
Shift all dates by an offset in a
DATE
entity; - Replace names with initials only;
-
Preserve email domain, mask the username in an
EMAIL_ADDRESS
entity; -
Leave only the less sensitive characters in a
LOCATION_ZIP
code; - Redact entities based on fuzzy similarity to a list of identifiable terms;
Built-in processors
In addition to writing your own processors, the client ships with three built-in entity processors, with more planned in future releases:
-
MaskEntityProcessor
andMarkerEntityProcessor
- intended to be used for default processing. -
FuzzyMatchEntityProcessor
- configurable processor that matches entities against a list of known words using Damerau–Levenshtein distance. It can automatically catch misspellings or near-duplicates, and be set to allow or block specific entities while doing the opposite for all others of the same type. A complete example is provided below.
The sections below showcase how some of these can be implemented in more detail.
Custom redaction of credit card numbers
# This code assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
text = [
"Okay, hang on just a second because I got to get it. Okay, it is 6578-7790-4346-2237. Expiration. 1224.",
"All right, I'm ready. 800 678-457-7896. Expiration is one. 224.",
"CC_type: Diners Club International RuPay Visa JCB Amex CCN: 30569309025904 4242424242424242 4222222222222 6172873484776530 378282246310005 CC_CVC: 480 902 182 765 143 CC_Expiredate: 5/28 6/67 12/67 11/29 9/70",
]
request = {"text": text, "locale": "en-US", "entity_detection": {"accuracy": "high", "entity_types": [{"type": "ENABLE", "value": ["CREDIT_CARD"]}]}}
text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)
# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_credit_card(entity) -> str:
"""Redacts credit card numbers"""
analysis_result = entity["analysis_result"]
for assertion in analysis_result["validation_assertions"]:
if assertion["provider"] == "luhn":
if assertion["status"] == "valid":
return f"[{'*' * 12}{analysis_result['formatted'][-4:]}]"
else:
return f"{analysis_result['formatted']} [INVALID]"
return f"{entity['text']}"
entity_processors = {"CREDIT_CARD": redact_credit_card}
deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
print(example)
The redact_credit_card
function contains the necessary logic to redact credit card numbers as follows:
- if the credit card number is valid, hide it except for the last four characters (which could include spaces).
- if the credit card number is parsed correctly but it fails the Luhn check it means that the number is invalid. In this case, don't hide the number and add an INVALID tag after the number. This could be used to more easily identify invalid credit card numbers in text for a later review.
- if the number fails to parse as a credit card number then do nothing. This code is assuming that this is not a credit card number.
The above code output looks like this:
Okay, hang on just a second because I got to get it. Okay, it is 6578 7790 4346 2237 [INVALID]. Expiration. 1224.
All right, I'm ready. 800 678-457-7896. Expiration is one. 224.
CC_type: Diners Club International RuPay Visa JCB Amex CCN: [************ 904] [************4242] [************ 222] 6172 8734 8477 6530 [INVALID] [************ 005] CC_CVC: 480 902 182 765 143 CC_Expiredate: 5/28 6/67 12/67 11/29 9/70
Notice how the credit card number on the first line example was not redacted but an INVALID marker was added right after it instead. On the second line, the 800 678-457-7896
entity was left unredacted as expected. This entity is possibly a phone number and not a credit card number. Finally, the last line shows several examples of valid credit card numbers and a single invalid one. The valid credit card numbers were masked except for their last characters as expected.
Custom redaction of dates
# This code assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
text = [
"$MDT $MRK $QRVO $TSS & 5 more stock picks for LONG swings: https://t.co/CbkieXxqoR (July 10 2018) https://t.co/eit53RUY4g",
"Short sale volume (not short interest) for $KBE on 2018-07-09 is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%",
"$WLTW high OI range is 160 to 155 for option expiration 07/20/2018 #options https://t.co/BnVElKBKkJ",
]
request = {
"text": text,
"locale": "en-US",
"entity_detection": {"accuracy": "high", "entity_types": [{"type": "ENABLE", "value": ["DATE", "DOB", "DAY", "MONTH", "YEAR"]}]},
}
text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)
# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_date(entity) -> str:
"""Redacts days and months from dates"""
offset = entity["location"]["stt_idx"]
text = entity["text"]
for subtype in entity["analysis_result"]["subtypes"]:
if subtype["label"] in ["DAY", "MONTH"] and "location" in subtype:
stt = subtype["location"]["stt_idx"] - offset
end = subtype["location"]["end_idx"] - offset
text = text[:stt] + "#" * (end - stt) + text[end:]
return text
entity_processors = {"DATE": redact_date, "DOB": redact_date}
deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
print(example)
The output of this request is provided below:
$MDT $MRK $QRVO $TSS & 5 more stock picks for LONG swings: https://t.co/CbkieXxqoR (#### ## 2018) https://t.co/eit53RUY4g
Short sale volume (not short interest) for $KBE on 2018-##-## is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%
$WLTW high OI range is 160 to 155 for option expiration ##/##/2018 #options https://t.co/BnVElKBKkJ
Notice how the dates have been partially redacted. A similar approach can be used to instead shift the dates. To do so, simply replace the date processor in the above code with this one:
def redact_date(entity) -> str:
"""Shifts date by a random number of weeks (0 to 20 weeks)"""
random_week_offset = random.randint(0, 20)
if "formatted" in entity["analysis_result"]:
formatted_datetime = datetime.fromisoformat(entity["analysis_result"]["formatted"])
return str((formatted_datetime+timedelta(weeks=random_week_offset)).date())
else:
return entity["text"]
This is an example output of this date processor.
$MDT $MRK $QRVO $TSS & 5 more stock picks for LONG swings: https://t.co/CbkieXxqoR (2018-07-17) https://t.co/eit53RUY4g
Short sale volume (not short interest) for $KBE on 2018-08-20 is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%
$WLTW high OI range is 160 to 155 for option expiration 2018-09-28 #options https://t.co/BnVElKBKkJ
Notice how the dates are replaced with dates that have been shifted by a random number of weeks.
Custom redaction of ages
# This code assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
text = [
"A 32-year old Black female German citizen living in Germany wants to travel to the United States for leisure.",
"West Point Public School division provides school-based preschool services for children from two through nine years of age who are children at risk and children with identified disabilities or delays.",
]
request = {"text": text, "locale": "en-US", "entity_detection": {"accuracy": "high", "entity_types": [{"type": "ENABLE", "value": ["AGE"]}]}}
text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)
# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_age(entity) -> str:
"""Round to the closest 10th"""
if "formatted" in entity["analysis_result"]:
age = entity["analysis_result"]["formatted"]
return str(int(round(age * 10, -2) / 10))
else:
"#"
entity_processors = {"AGE": redact_age}
deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
print(example)
The output of this code shows that ages have been bucketed to the closest multiple of ten.
A 30-year old Black female German citizen living in Germany wants to travel to the United States for leisure.
West Point Public School division provides school-based preschool services for children from 0 through 10 years of age who are children at risk and children with identified disabilities or delays.
Custom redaction of locations
# This code assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
text = [
"Please deliver this to 45, Clybaun Heights, Galway City, Ireland H91 AKK3",
"3255 M-A-D-D-A-M-S street, huntington, west virginia is his birthplace",
"My favorite city is San Francisco, California 94110, United States, 37.7749° N, 122.4194° W",
]
request = {"text": text, "locale": "en-US", "entity_detection": {"accuracy": "high"}}
text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)
# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_address(entity) -> str:
"""Redacts address to hide the most sensitive info"""
analysis_result = entity["analysis_result"]
subtypes = sorted(analysis_result["subtypes"], key=lambda x: x["location"]["stt_idx"])
address_parts = []
for subtype in subtypes:
if subtype["label"] in ["LOCATION_COUNTRY", "LOCATION_STATE", "LOCATION_CITY"]:
address_parts.append(subtype["text"])
elif subtype["label"] in ["LOCATION_ZIP"]:
address_parts.append(subtype["text"][:3] + "#" * (len(subtype["text"]) - 3))
else:
address_parts.append(f"""[{subtype["label"]}]""")
return " ".join(address_parts)
entity_processors = {"LOCATION": redact_address, "LOCATION_ADDRESS": redact_address}
deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
print(example)
The output of the code above provides the redacted addresses. As you can see, only the first 3 characters of the postal code and zip code are kept and addresses, when present, are redacted. The last example shows that GPS coordinates are also redacted.
Please deliver this to [LOCATION_ADDRESS_STREET] Galway City Ireland H91#####
[LOCATION_ADDRESS_STREET] huntington west virginia is his birthplace
My favorite city is San Francisco California 941## United States [LOCATION_COORDINATE]
Custom redaction of coreferenced names
# This code assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
text = [
"Nikola Jokić is a basketball player. LeBron James is also a basketball player. "
"Jokić and James played against each other. Jokić led his team with a triple-double performance. "
"After the game, Nikola praised his teammates for their effort. "
"Many fans consider Nikola Jokić one of the best centers in NBA history."
]
request = {
"text": text,
"locale": "en-US",
"entity_detection": {"accuracy": "high"},
"relation_detection": {"coreference_resolution": "model_prediction"},
}
text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)
# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
coref_to_initials: dict[str, str] = {}
def replace_with_initials(entity: dict) -> str:
"""Replace any detected person with initials in the style A.B."""
coref_id = entity.get("coreference_id")
original_text = entity["text"]
if not coref_id:
return original_text
if coref_id in coref_to_initials:
return coref_to_initials[coref_id]
parts = original_text.split()
initials = "".join(p[0].upper() + "." for p in parts if p)
coref_to_initials[coref_id] = initials
return initials
entity_processors = {
"NAME": replace_with_initials,
"NAME_GIVEN": replace_with_initials,
"NAME_FAMILY": replace_with_initials,
}
deidentified_text = deidentify_text(
text,
resp,
entity_processors=entity_processors,
default_processor=lambda entity: entity["text"]
)
for example in deidentified_text:
print(example)
The output of running this code replaces names with the corresponding initials of the people mentioned in the text.
N.J. is a basketball player. L.J. is also a basketball player. N.J. and L.J. played against each other. N.J. led his team with a triple-double performance. After the game, N.J. praised his teammates for their effort. Many fans consider N.J. one of the best centers in NBA history.
In the following example, we explore the capabilities of the built-in FuzzyMatchEntityProcessor
in more depth.
Fuzzy matching against list of known words
# This code assumes that you have installed the Private AI python client.
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import (
MaskEntityProcessor,
FuzzyMatchEntityProcessor,
)
client = PAIClient(
url="https://api.private-ai.com/community/v4/",
api_key="<YOUR-API-KEY>",
)
text = [
"Private IA released a new API.",
"Our partners include ExampleSoft, OpenAI, and PAAI.",
"The conference in Toronto featured Google and PrivatAI on stage.",
]
request = {
"text": text,
"locale": "en",
"entity_detection": {
"accuracy": "high",
"entity_types": [{"type": "ENABLE", "value": ["ORGANIZATION"]}],
},
}
request_object = AnalyzeTextRequest.fromdict(request)
analyze_text_rsp = client.analyze_text(request_object)
default_mask_processor = MaskEntityProcessor()
fuzzy_processor = FuzzyMatchEntityProcessor(
known_words_list=["Private AI", "PAI"],
threshold=2,
strategy="ALLOW",
process_type="MASK",
ignore_casing=True,
)
text_out = deidentify_text(
text=text,
response=analyze_text_rsp,
entity_processors={"ORGANIZATION": fuzzy_processor},
default_processor=default_mask_processor,
)
for t in text_out:
print(t)
The output of running this code is:
########## released a new API.
Our partners include ExampleSoft, OpenAI, and ####.
The conference in Toronto featured Google and ######## on stage.
This example contains intentional misspellings to demonstrate fuzzy matching. All variants of "Private AI" and "PAI" are consistently redacted with masked text. Other company names remain unchanged, since they are not in the known word list, which we intend to mask.