Processing Files

info

In order to run the example code in this guide, please sign up for your free test api key here. Note that the Community API key will only be able process files via the base64 endpoint.

Private AI supports scanning a multitude of different file types for PII and creating de-identified or redacted copies. Private AI’s supported entity types function across each file type, with localized variants of different PII (Personally Identifiable Information) entities, PHI (Protected Health Information) entities, and PCI (Payment Card Industry) entities being detected. Our Supported Languages and Supported Entity Types page provides a more detailed look.

How Does It Work?

Private AI support for file processing comes in unified endpoints which works with either base64-encoded files or URIs: /v3/process/files/base64 and /v3/process/files/uri.

Base64

The base64 endpoint is the recommended way to process files for most users, as there is no need to mount a volume into the container and ensure that permissions are set correctly. To use the base64 endpoint, you first need to read the file in memory, encode its content with base64, and send it to the base64 endpoint. You also need to pass the MIME type of the file as a hint to the file processing pipeline. The Supported File Types page details extensions and MIME types for both endpoints

URI

Available on the container only, the uri endpoint is suitable for larger data volumes and has the following advantages over the base64 endpoint:

  • No overhead of base64 encoding.
  • No need to first read the file in memory.
  • The processed file is saved automatically by the container.

API calls are made by pointing to a file on a mounted drive. The redacted contents are automatically saved at a user-specified location with the .redacted suffix added to the original name. For example, the uri endpoint will access the file /some/path/my-doc.pdf, it will redact it and create a file my-doc.redacted.pdf with the redacted contents at the location specified by the user. When using the uri endpoint, the file extension is used to determine the file type.

Attention

Passing a file with no extension or with the wrong extension to the uri endpoint may lead to unexpected behavior.

Diving Deeper

Processing files with the base64 endpoint

When using the /v3/process/files/base64 endpoint, there is no need to mount a folder into the container.

Copy
Copied
docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>

The file is first read into memory to encode its contents then the encoded contents are passed to the file processing endpoint. On linux, this can be done with the base64 shell command. Assuming you have the file sample.pdf saved in the current folder:

cURLPythonPython Client
Copy
Copied
echo '{"file": {"data": "'$(base64 -w 0 sample.pdf)'", "content_type": "application/pdf"}}' \
| curl --request POST --url 'http://localhost:8080/v3/process/files/base64' \
-H 'Content-Type: application/json'  -d @-
Copy
Copied
import base64
import requests

filename_in = "sample.pdf"
filename_out = "sample.redacted.pdf"

# Read the file and do base64 encoding
with open(filename_in, "rb") as f:
    b64_file_content = base64.b64encode(f.read())
    b64_file_content = b64_file_content.decode("utf-8")

# Make the request and load the results as JSON
r = requests.post(url="http://localhost:8080/v3/process/files/base64", 
                  json={"file": {"data": b64_file_content, "content_type": "application/pdf"}})
results = r.json()

# Decode and write the file to disk
with open(filename_out, "wb") as f:
    f.write(base64.b64decode(results["processed_file"]))
Copy
Copied
from privateai_client import PAIClient
from privateai_client.objects import request_objects
import base64
import os

filepath = "sample.pdf"
file_type= "application/pdf"
client = PAIClient(url="http://localhost:8080")

# Read from file
with open(filepath, "rb") as b64_file:
    file_data = base64.b64encode(b64_file.read())
    file_data = file_data.decode("ascii")

# Make the request
file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)

# Write to file
with open(os.path.join(file_dir,f"redacted-{file_name}"), 'wb') as redacted_file:
    processed_file = resp.processed_file.encode("ascii")
    processed_file = base64.b64decode(processed_file, validate=True)
    redacted_file.write(processed_file)

This command will redact the file contents and return the redacted document as a base64-encoded string.

info

An example Python script showing how to process files with Private AI's Python client using the base64 route can be found here.

Attention

It is important that the proper MIME type is provided with the base64-encoded string. Failing to pass the proper MIME type may lead to unexpected behavior. Check out the Supported File Types page for proper MIME types.

Check out the API reference for more details on the base64 endpoint.

Processing files with the uri endpoint

To process files with the /v3/process/files/uri endpoint you are required to mount a volume when starting the container.

In addition, the service requires access to a folder where the redacted files will be stored. This is done with the PAI_OUTPUT_FILE_DIR environment variable. This variable must point to a folder that is mounted into the container or an existing subfolder to a mounted folder.

Copy
Copied
docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-v <full path to files>:<path in container> \
-e PAI_OUTPUT_FILE_DIR=<path to mounted folder in container> \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>

This is an example of a command mounting a files folder in the admin home folder.

Copy
Copied
docker run --rm -v /home/admin/license.json:/app/license/license.json \
-v /home/admin/files:/files \
-e PAI_OUTPUT_FILE_DIR=/files/output \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:3.1.0-cpu
Common Pitfall

Mounting a folder to an existing os or app folder in the container may lead to unexpected behavior.

info

An example Python script showing how to process files with Private AI's Python client using the URI route can be found here.

Once the container is running with the above command, you can redact files with:

Copy
Copied
import requests

PATH_TO_PDF_FILE = "/files/test.pdf"

response = requests.post(
    "http://localhost:8080/v3/process/files/uri",
    json={
        "uri": PATH_TO_PDF_FILE
    }
)

response.raise_for_status()
print(response.json())

Upon successful completion, the above command will save the redacted file under /home/admin/files/output/sample.redacted.pdf.

A note on permissions

Files created by the container will have the owner and permissions of the user running the docker service. This is commonly found to be root in default installations. However, you can change the user running the container using the docker --user option.

This command will run the same container with the current user.

Copy
Copied
docker run --rm -v /home/admin/license.json:/app/license/license.json \
-e PAI_OUTPUT_FILE_DIR=/files/output \
-v /home/admin/files:/files \
--user $(id -u):$(id -u) \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:3.1.0-cpu

Check out the API reference for more details on the uri endpoint.

© Copyright 2024 Private AI.