Python Client

This document provides information about how to use Private AI's Python client to interact with the container or cloud API. In addition to this guide, you might find the Github repository helpful. It contains further examples and usage options.

Installation

The Python client is available for download on pypi.org or with pip:

Copy

Copied

pip install privateai_client

Quick Start

Copy

Copied

from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
text_request = request_objects.process_text_obj(text=["My sample name is John Smith"])
response = client.process_text(text_request)

print(text_request.text)
print(response.processed_text)

Output:

Copy

Copied

['My sample name is John Smith']
['My sample name is [NAME_1]']

Working with the Client

Initializing the Client for self-hosted container

The PAI client requires a scheme, host, and optional port to initialize. Alternatively, a full url can be used. Once created, the connection can be tested with the client's ping function

Copy

Copied

from privateai_client import PAIClient
scheme = 'http'
host = 'localhost'
port= '8080'
client = PAIClient(scheme, host, port)

client.ping()


url = "http://localhost:8080"
client = PAIClient(url=url)

client.ping()

Output:

Copy

Copied

True
True

Note: The container is hosted with your provisioned application license and does not manage authentication to the API or authorization of API requests. Access to the container is at the discretion of the user. For recommendations on how to deploy in an enterprise context including authorized use, please contact us.

Initializing the Client for our cloud-API offering

To access the cloud API, you need to authenticate with your API key. You can get one from the customer portal.

Copy

Copied

from privateai_client import PAIClient
# Adding credentials on initialization
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')

# Adding credentials after initialization
client = PAIClient(url="https://api.private-ai.com/community/v4/")
client.ping()
client.add_api_key('<YOUR API KEY>')
client.ping()

Output:

Copy

Copied

The request returned with a 401 Unauthorized
True

Making Requests

Once initialized the client can be used to make any request listed in the API documentation.

Available requests:

Client Function	Endpoint
`get_version()`	`/`
`ping()`	`/healthz`
`get_metrics()`	`/metrics`
`get_diagnostics()`	`/diagnostics`
`ner_text()`	`/ner/text`
`process_text()`	`/process/text`
`analyze_text()`	`/analyze/text`
`process_files_uri()`	`/process/files/uri`
`process_files_base64()`	`/process/files/base64`
`bleep()`	`/bleep`

Requests can be made using dictionaries:

Copy

Copied

sample_text = ["This is John Smith's sample dictionary request"]
text_dict_request = {"text": sample_text}

response = client.process_text(text_dict_request)
print(response.processed_text)

Output:

Copy

Copied

["This is [NAME_1]'s sample dictionary request"]

or using built-in request objects:

Copy

Copied

from privateai_client import request_objects

sample_text = "This is John Smith's sample process text object request"
text_request_object =  request_objects.process_text_obj(text=[sample_text])

response = client.process_text(text_request_object)
print(response.processed_text)

Output:

Copy

Copied

["This is [NAME_1]'s sample process text object request"]

Request Objects

Request objects are a simple way of creating request bodies without the tediousness of writing dictionaries. Every POST request (as listed in the Private AI API documentation) has its own request own request object.

Copy

Copied

from privateai_client import request_objects

sample_obj = request_objects.file_uri_obj(uri='path/to/file.jpg')
sample_obj.uri

Output:

Copy

Copied

'path/to/file.jpg'

Additionally there are request objects for each nested dictionary of a request:

Copy

Copied

from privateai_client import request_objects

sample_text = "This is John Smith's sample process text object request where names won't be removed"

# sub-dictionary of entity_detection
sample_entity_type_selector = request_objects.entity_type_selector_obj(type="DISABLE", value=['NAME', 'NAME_GIVEN', 'NAME_FAMILY'])

# sub-dictionary of a process text request
sample_entity_detection = request_objects.entity_detection_obj(entity_types=[sample_entity_type_selector])

# request object created using the sub-dictionaries
sample_request = request_objects.process_text_obj(text=[sample_text], entity_detection=sample_entity_detection)
response = client.process_text(sample_request)
print(response.processed_text)

Output:

Copy

Copied

["This is John Smith's sample process text object request where names won't be removed"]

Building Request Objects

Request objects can initialized by passing in all the required values needed for the request as arguments or from a dictionary, using the object's fromdict() function:

Copy

Copied

# Passing arguments
sample_data = "JVBERi0xLjQKJdPr6eEKMSAwIG9iago8PC9UaXRsZSAoc2FtcGxlKQovUHJvZHVj..."
sample_content_type = "application/pdf"

sample_file_obj = request_objects.file_obj(data=sample_data, content_type=sample_content_type)

# Passing a dictionary using .fromdict()
sample_dict = {"data": "JVBERi0xLjQKJdPr6eEKMSAwIG9iago8PC9UaXRsZSAoc2FtcGxlKQovUHJvZHVj...",
               "content_type": "application/pdf"}

sample_file_obj2 = request_objects.file_obj.fromdict(sample_dict)

Request objects also can be formatted as dictionaries, using the request object's to_dict() function:

Copy

Copied

from privateai_client import request_objects

sample_text = "Sample text."
# Create the nested request objects
sample_entity_type_selector = request_objects.entity_type_selector_obj(type="DISABLE", value=['HIPAA_SAFE_HARBOR'])
sample_entity_detection = request_objects.entity_detection_obj(entity_types=[sample_entity_type_selector])
# Create the request object
sample_request = request_objects.process_text_obj(text=[sample_text], entity_detection=sample_entity_detection)

# All nested request objects are also formatted
print(sample_request.to_dict())

Output:

Copy

Copied

{
 'text': ['Sample text.'],
 'link_batch': False,
 'entity_detection': {'accuracy': 'high', 'entity_types': [{'type': 'DISABLE', 'value': ['HIPAA_SAFE_HARBOR']}], 'filter': [], 'return_entity': True},
 'processed_text': {'type': 'MARKER', 'pattern': '[UNIQUE_NUMBERED_ENTITY_TYPE]'}
}

Sample Use

Processing a directory of files with URI route

Copy

Copied

from privateai_client import PAIClient
from privateai_client.objects import request_objects
import os
import logging

file_dir = "/path/to/file/directory"
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')
for file_name in os.listdir(file_dir):
    filepath = os.path.join(file_dir, file_name)
    if not os.path.isfile(filepath):
        continue
    req_obj = request_objects.file_uri_obj(uri=filepath)
    # NOTE this method of file processing requires the container to have an the input and output directories mounted
    resp = client.process_files_uri(req_obj)
    if not resp.ok:
        logging.error(f"response for file {file_name} returned with {resp.status_code}")

Processing a file with Base64 route

Copy

Copied

from privateai_client import PAIClient
from privateai_client.objects import request_objects
import base64
import os
import logging

file_dir = "/path/to/your/file"
file_name = 'sample_file.pdf'
filepath = os.path.join(file_dir,file_name)
file_type= "type/of_file" #eg. application/pdf
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')

# Read from file
with open(filepath, "rb") as b64_file:
    file_data = base64.b64encode(b64_file.read())
    file_data = file_data.decode("ascii")

# Make the request
file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)
if not resp.ok:
    logging.error(f"response for file {file_name} returned with {resp.status_code}")

# Write to file
with open(os.path.join(file_dir,f"redacted-{file_name}"), 'wb') as redacted_file:
    processed_file = resp.processed_file.encode("ascii")
    processed_file = base64.b64decode(processed_file, validate=True)
    redacted_file.write(processed_file)

Bleep an audio file

Copy

Copied

from privateai_client import PAIClient
from privateai_client.objects import request_objects
import base64
import os
import logging

file_dir = "/path/to/your/file"
file_name = 'sample_file.pdf'
filepath = os.path.join(file_dir,file_name)
file_type= "type/of_file" #eg. audio/mp3 or audio/wav
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')


file_dir = "/home/adam/workstation/file_processing/test_audio"
file_name = "test_audio.mp3"
filepath = os.path.join(file_dir,file_name)
file_type = "audio/mp3"
with open(filepath, "rb") as b64_file:
    file_data = base64.b64encode(b64_file.read())
    file_data = file_data.decode("ascii")

file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
timestamp = request_objects.timestamp_obj(start=1.12, end=2.14)
request_obj = request_objects.bleep_obj(file=file_obj, timestamps=[timestamp])

resp = client.bleep(request_object=request_obj)
if not resp.ok:
    logging.error(f"response for file {file_name} returned with {resp.status_code}")
with open(os.path.join(file_dir,f"redacted-{file_name}"), 'wb') as redacted_file:
    processed_file = resp.bleeped_file.encode("ascii")
    processed_file = base64.b64decode(processed_file, validate=True)
    redacted_file.write(processed_file)

Analyze Text Post-Processing

The analyze/text route returns rich, structured detections you can post-process with the Private AI Python client. It is a route specifically developed for text understanding. For more details on its capabilities, refer to the analyze/text documentation. In this section, we describe how the Python client can be used to post-process the analyze text response. The Python client provides utilities to iterate through detected entities and apply transformation rules, such as masking, pseudonymizing, validating, or normalizing values.

The following example introduces the required pieces for post-processing, which we describe in detail.

Copy

Copied

# This code assumes that you have the Private AI deidentification service running locally on port 8080.
# It also assumes that you have installed the Private AI python client.
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor

client = PAIClient(
    url="https://api.private-ai.com/community/v4/", api_key="<YOUR-API-KEY>"
)

text = [
    "Jenna is a 32 year old female diagnosed with asthma."
]

request = {
    "text": text,
    "locale": "en-US",
    "entity_detection": {
        "accuracy": "high",
        "entity_types": [{"type": "ENABLE", "value": ["AGE", "NAME"]}],
    },
}

text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)

# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
class AgeBucketEntityProcessor:
    def __init__(self, bucket_size: int = 5):
        self.bucket_size = bucket_size

    def __call__(self, entity: dict) -> str:
        age = entity["analysis_result"].get("formatted")
        if not age:
            return "[%-%]"
        start = (age // self.bucket_size) * self.bucket_size
        end = start + self.bucket_size
        return f"[{start}-{end}]"


entity_processors = {"AGE": AgeBucketEntityProcessor(bucket_size=10)}

deidentified_texts = deidentify_text(
    text,
    resp,
    entity_processors=entity_processors,
    default_processor=MarkerEntityProcessor(),
)
for t in deidentified_texts:
    print(t)

The output of this code replaces the age with the corresponding range.

Copy

Copied

[NAME_1] is a [30-40] year old female diagnosed with asthma.

At the core of this workflow is the deidentify_text function which allows for entity replacements by invoking various entity processors. Each processor defines the exact behavior for a given entity type, making it easy to implement custom redaction tailored to your use case.

The function deidentify_text(...) takes the original texts plus the analyze/text response, walks through every detected entity in left-to-right order, and replaces each entity span using the appropriate processor. It also automatically adjusts the character offsets of the entity locations after their replacements.

Copy

Copied

from typing import Callable
from privateai_client.components import AnalyzeTextResponse

EntityProcessor = Callable[[dict], str]

def deidentify_text(
    text: list[str],
    response: AnalyzeTextResponse,
    entity_processors: dict[str, EntityProcessor],
    default_processor: EntityProcessor,
) -> list[str]:
    ...

text - The original list of text messages that were passed into PAIClient.analyze_text()
response - The structured response returned by the analyze_text call
entity_processors - Mapping of entity type to entity processor, e.g. {"DATE": redact_date, "CREDIT_CARD": redact_credit_card}
- Each processor is a callable that accepts an entity dictionary and returns the replacement string for that entity.
- Invoked when the entity best_label matches a key in this dictionary.
default_processor - A fallback processor applied to all entity types not explicitly listed in entity_processors . This ensures every entity is handled, even if you only configure custom processors for subset of the enabled entities.

The response is a list of de-identified text strings.

Entity Processors

The processors are callables (Callable[[dict], str]) that take a detected entity dictionary and return the replacement text for that span. It can be as simple as a function, or a class which implements the __call__ method. In the example above we created the AgeBucketEntityProcessor, which puts the entity AGE into a bucket.

The potential use cases are broad. A few common examples include:

Hide all but the last 4 digits in a CREDIT_CARD number;
Keep only the year in a DATE entity;
Shift all dates by an offset in a DATE entity;
Replace names with initials only;
Preserve email domain, mask the username in an EMAIL_ADDRESS entity;
Leave only the less sensitive characters in a LOCATION_ZIP code;
Redact entities based on fuzzy similarity to a list of identifiable terms;

Built-in processors

In addition to writing your own processors, the client ships with three built-in entity processors, with more planned in future releases:

MaskEntityProcessor and MarkerEntityProcessor - intended to be used for default processing.
FuzzyMatchEntityProcessor - configurable processor that matches entities against a list of known words using Damerau–Levenshtein distance. It can automatically catch misspellings or near-duplicates, and be set to allow or block specific entities while doing the opposite for all others of the same type. A complete example is provided below.

The sections below showcase how some of these can be implemented in more detail.

Custom redaction of credit card numbers

Copy

Copied

# This code assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest

client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')

text = [
    "Okay, hang on just a second because I got to get it. Okay, it is 6578-7790-4346-2237. Expiration. 1224.",
    "All right, I'm ready. 800 678-457-7896. Expiration is one. 224.",
    "CC_type: Diners Club International RuPay Visa JCB Amex CCN: 30569309025904 4242424242424242 4222222222222 6172873484776530 378282246310005 CC_CVC: 480 902 182 765 143 CC_Expiredate: 5/28 6/67 12/67 11/29 9/70",
]

request = {"text": text, "locale": "en-US", "entity_detection": {"accuracy": "high", "entity_types": [{"type": "ENABLE", "value": ["CREDIT_CARD"]}]}}

text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)

# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_credit_card(entity) -> str:
    """Redacts credit card numbers"""

    analysis_result = entity["analysis_result"]
    for assertion in analysis_result["validation_assertions"]:
        if assertion["provider"] == "luhn":
            if assertion["status"] == "valid":
                return f"[{'*' * 12}{analysis_result['formatted'][-4:]}]"
            else:
                return f"{analysis_result['formatted']} [INVALID]"
    return f"{entity['text']}"


entity_processors = {"CREDIT_CARD": redact_credit_card}

deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
    print(example)

The redact_credit_card function contains the necessary logic to redact credit card numbers as follows:

if the credit card number is valid, hide it except for the last four characters (which could include spaces).
if the credit card number is parsed correctly but it fails the Luhn check it means that the number is invalid. In this case, don't hide the number and add an INVALID tag after the number. This could be used to more easily identify invalid credit card numbers in text for a later review.
if the number fails to parse as a credit card number then do nothing. This code is assuming that this is not a credit card number.

The above code output looks like this:

Copy

Copied

Okay, hang on just a second because I got to get it. Okay, it is 6578 7790 4346 2237 [INVALID]. Expiration. 1224.
All right, I'm ready. 800 678-457-7896. Expiration is one. 224.
CC_type: Diners Club International RuPay Visa JCB Amex CCN: [************ 904] [************4242] [************ 222] 6172 8734 8477 6530 [INVALID] [************ 005] CC_CVC: 480 902 182 765 143 CC_Expiredate: 5/28 6/67 12/67 11/29 9/70

Notice how the credit card number on the first line example was not redacted but an INVALID marker was added right after it instead. On the second line, the 800 678-457-7896 entity was left unredacted as expected. This entity is possibly a phone number and not a credit card number. Finally, the last line shows several examples of valid credit card numbers and a single invalid one. The valid credit card numbers were masked except for their last characters as expected.

Custom redaction of dates

Copy

Copied

# This code assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest

client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')

text = [
    "$MDT $MRK $QRVO $TSS &amp; 5 more stock picks for LONG swings:  https://t.co/CbkieXxqoR (July 10 2018) https://t.co/eit53RUY4g",
    "Short sale volume (not short interest) for $KBE on 2018-07-09 is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%",
    "$WLTW high OI range is 160 to 155 for option expiration 07/20/2018 #options https://t.co/BnVElKBKkJ",
]

request = {
    "text": text,
    "locale": "en-US",
    "entity_detection": {"accuracy": "high", "entity_types": [{"type": "ENABLE", "value": ["DATE", "DOB", "DAY", "MONTH", "YEAR"]}]},
}

text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)


# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_date(entity) -> str:
   """Redacts days and months from dates"""

    offset = entity["location"]["stt_idx"]
    text = entity["text"]
    for subtype in entity["analysis_result"]["subtypes"]:
        if subtype["label"] in ["DAY", "MONTH"] and "location" in subtype:
            stt = subtype["location"]["stt_idx"] - offset
            end = subtype["location"]["end_idx"] - offset
            text = text[:stt] + "#" * (end - stt) + text[end:]

    return text


entity_processors = {"DATE": redact_date, "DOB": redact_date}

deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
    print(example)

The output of this request is provided below:

Copy

Copied

$MDT $MRK $QRVO $TSS &amp; 5 more stock picks for LONG swings:  https://t.co/CbkieXxqoR (#### ## 2018) https://t.co/eit53RUY4g
Short sale volume (not short interest) for $KBE on 2018-##-## is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%
$WLTW high OI range is 160 to 155 for option expiration ##/##/2018 #options https://t.co/BnVElKBKkJ

Notice how the dates have been partially redacted. A similar approach can be used to instead shift the dates. To do so, simply replace the date processor in the above code with this one:

Copy

Copied

def redact_date(entity) -> str:
   """Shifts date by a random number of weeks (0 to 20 weeks)"""

    random_week_offset = random.randint(0, 20)
    if "formatted" in entity["analysis_result"]:
        formatted_datetime = datetime.fromisoformat(entity["analysis_result"]["formatted"])
        return str((formatted_datetime+timedelta(weeks=random_week_offset)).date())
    else:
        return entity["text"]

This is an example output of this date processor.

Copy

Copied

$MDT $MRK $QRVO $TSS &amp; 5 more stock picks for LONG swings:  https://t.co/CbkieXxqoR (2018-07-17) https://t.co/eit53RUY4g
Short sale volume (not short interest) for $KBE on 2018-08-20 is 42%. https://t.co/7pWbgjJ8Ag $FOXA 38% $TVIX 34% $LITE 54% $HIG 60%
$WLTW high OI range is 160 to 155 for option expiration 2018-09-28 #options https://t.co/BnVElKBKkJ

Notice how the dates are replaced with dates that have been shifted by a random number of weeks.

Custom redaction of ages

Copy

Copied

# This code assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest

client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')

text = [
    "A 32-year old Black female German citizen living in Germany wants to travel to the United States for leisure.",
    "West Point Public School division provides school-based preschool services for children from two through nine years of age who are children at risk and children with identified disabilities or delays.",
]

request = {"text": text, "locale": "en-US", "entity_detection": {"accuracy": "high", "entity_types": [{"type": "ENABLE", "value": ["AGE"]}]}}

text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)


# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_age(entity) -> str:
    """Round to the closest 10th"""

    if "formatted" in entity["analysis_result"]:
        age = entity["analysis_result"]["formatted"]
        return str(int(round(age * 10, -2) / 10))
    else:
        "#"


entity_processors = {"AGE": redact_age}

deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
    print(example)

The output of this code shows that ages have been bucketed to the closest multiple of ten.

Copy

Copied

A 30-year old Black female German citizen living in Germany wants to travel to the United States for leisure.
West Point Public School division provides school-based preschool services for children from 0 through 10 years of age who are children at risk and children with identified disabilities or delays.

Custom redaction of locations

Copy

Copied

# This code assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import MarkerEntityProcessor
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest

client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')

text = [
    "Please deliver this to 45, Clybaun Heights, Galway City, Ireland H91 AKK3",
    "3255 M-A-D-D-A-M-S street, huntington, west virginia is his birthplace",
    "My favorite city is San Francisco, California 94110, United States, 37.7749° N, 122.4194° W",
]

request = {"text": text, "locale": "en-US", "entity_detection": {"accuracy": "high"}}

text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)

# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
def redact_address(entity) -> str:
    """Redacts address to hide the most sensitive info"""

    analysis_result = entity["analysis_result"]
    subtypes = sorted(analysis_result["subtypes"], key=lambda x: x["location"]["stt_idx"])
    address_parts = []
    for subtype in subtypes:
        if subtype["label"] in ["LOCATION_COUNTRY", "LOCATION_STATE", "LOCATION_CITY"]:
            address_parts.append(subtype["text"])
        elif subtype["label"] in ["LOCATION_ZIP"]:
            address_parts.append(subtype["text"][:3] + "#" * (len(subtype["text"]) - 3))
        else:
            address_parts.append(f"""[{subtype["label"]}]""")
    return " ".join(address_parts)


entity_processors = {"LOCATION": redact_address, "LOCATION_ADDRESS": redact_address}

deidentified_text = deidentify_text(text, resp, entity_processors=entity_processors, default_processor=MarkerEntityProcessor())
for example in deidentified_text:
    print(example)

The output of the code above provides the redacted addresses. As you can see, only the first 3 characters of the postal code and zip code are kept and addresses, when present, are redacted. The last example shows that GPS coordinates are also redacted.

Copy

Copied

Please deliver this to [LOCATION_ADDRESS_STREET] Galway City Ireland H91#####
[LOCATION_ADDRESS_STREET] huntington west virginia is his birthplace
My favorite city is San Francisco California 941## United States [LOCATION_COORDINATE]

Custom redaction of coreferenced names

Copy

Copied

# This code assumes that you have installed the Private AI python client.
from privateai_client.post_processing import deidentify_text
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest

client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key='<YOUR API KEY>')

text = [
    "Nikola Jokić is a basketball player. LeBron James is also a basketball player. "
    "Jokić and James played against each other. Jokić led his team with a triple-double performance. "
    "After the game, Nikola praised his teammates for their effort. "
    "Many fans consider Nikola Jokić one of the best centers in NBA history."
]

request = {
    "text": text,
    "locale": "en-US",
    "entity_detection": {"accuracy": "high"},
    "relation_detection": {"coreference_resolution": "model_prediction"},
}

text_request = AnalyzeTextRequest.fromdict(request)
resp = client.analyze_text(text_request)

# THIS IS THE CUSTOM LOGIC TO IMPLEMENT
coref_to_initials: dict[str, str] = {}

def replace_with_initials(entity: dict) -> str:
    """Replace any detected person with initials in the style A.B."""
    coref_id = entity.get("coreference_id")
    original_text = entity["text"]

    if not coref_id:
        return original_text

    if coref_id in coref_to_initials:
        return coref_to_initials[coref_id]

    parts = original_text.split()
    initials = "".join(p[0].upper() + "." for p in parts if p)

    coref_to_initials[coref_id] = initials
    return initials

entity_processors = {
    "NAME": replace_with_initials,
    "NAME_GIVEN": replace_with_initials,
    "NAME_FAMILY": replace_with_initials,
}

deidentified_text = deidentify_text(
    text,
    resp,
    entity_processors=entity_processors,
    default_processor=lambda entity: entity["text"]
)

for example in deidentified_text:
    print(example)

The output of running this code replaces names with the corresponding initials of the people mentioned in the text.

Copy

Copied

N.J. is a basketball player. L.J. is also a basketball player. N.J. and L.J. played against each other. N.J. led his team with a triple-double performance. After the game, N.J. praised his teammates for their effort. Many fans consider N.J. one of the best centers in NBA history.

In the following example, we explore the capabilities of the built-in FuzzyMatchEntityProcessor in more depth.

Fuzzy matching against list of known words

Copy

Copied

# This code assumes that you have installed the Private AI python client.
from privateai_client import PAIClient
from privateai_client.components import AnalyzeTextRequest
from privateai_client.post_processing import deidentify_text
from privateai_client.post_processing.processors import (
    MaskEntityProcessor,
    FuzzyMatchEntityProcessor,
)

client = PAIClient(
    url="https://api.private-ai.com/community/v4/",
    api_key="<YOUR-API-KEY>",
)


text = [
    "Private IA released a new API.",
    "Our partners include ExampleSoft, OpenAI, and PAAI.",
    "The conference in Toronto featured Google and PrivatAI on stage.",
]
request = {
    "text": text,
    "locale": "en",
    "entity_detection": {
        "accuracy": "high",
        "entity_types": [{"type": "ENABLE", "value": ["ORGANIZATION"]}],
    },
}
request_object = AnalyzeTextRequest.fromdict(request)
analyze_text_rsp = client.analyze_text(request_object)

default_mask_processor = MaskEntityProcessor()
fuzzy_processor = FuzzyMatchEntityProcessor(
    known_words_list=["Private AI", "PAI"],
    threshold=2,
    strategy="ALLOW",
    process_type="MASK",
    ignore_casing=True,
)

text_out = deidentify_text(
    text=text,
    response=analyze_text_rsp,
    entity_processors={"ORGANIZATION": fuzzy_processor},
    default_processor=default_mask_processor,
)
for t in text_out:
    print(t)

The output of running this code is:

Copy

Copied

########## released a new API.
Our partners include ExampleSoft, OpenAI, and ####.
The conference in Toronto featured Google and ######## on stage.

This example contains intentional misspellings to demonstrate fuzzy matching. All variants of "Private AI" and "PAI" are consistently redacted with masked text. Other company names remain unchanged, since they are not in the known word list, which we intend to mask.