Processing Files

info

Connect with one of our privacy experts to run this code.

Private AI supports scanning a multitude of different file types for PII and creating de-identified or redacted copies. Private AI’s supported entity types function across each file type, with localized variants of different PII (Personally Identifiable Information) entities, PHI (Protected Health Information) entities, and PCI (Payment Card Industry) entities being detected. Our Supported Languages and Supported Entity Types page provides a more detailed look.

How Does It Work?

Private AI support for file processing comes in unified endpoints which works with either base64-encoded files or URIs: /process/files/base64 and /process/files/uri.

Base64

The base64 endpoint is the recommended way to process files for most users, as there is no need to mount a volume into the container and ensure that permissions are set correctly. To use the base64 endpoint, you first need to read the file in memory, encode its content with base64, and send it to the base64 endpoint. You also need to pass the MIME type of the file as a hint to the file processing pipeline. The Supported File Types page details extensions and MIME types for both endpoints

URI

Available on the container only, the uri endpoint is suitable for larger data volumes and has the following advantages over the base64 endpoint:

No overhead of base64 encoding.
No need to first read the file in memory.
The processed file is saved automatically by the container.

API calls are made by pointing to a file on a mounted drive. The redacted contents are automatically saved at a user-specified location with the .redacted suffix added to the original name. For example, the uri endpoint will access the file /some/path/my-doc.pdf, it will redact it and create a file my-doc.redacted.pdf with the redacted contents at the location specified by the user. When using the uri endpoint, the file extension is used to determine the file type.

Attention

Passing a file with no extension or with the wrong extension to the uri endpoint may lead to unexpected behavior.

Diving Deeper

Processing files with the `base64` endpoint

When using the /process/files/base64 endpoint, there is no need to mount a folder into the container.

Copy

Copied

docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>

The file is first read into memory to encode its contents then the encoded contents are passed to the file processing endpoint. On linux, this can be done with the base64 shell command. Assuming you have the file sample.pdf saved in the current folder:

Request BodycURLPythonPython Client

Copy

Copied

{
    "file": {
        "data": "'$(base64 -w 0 sample.pdf)'",
        "content_type": "application/pdf"
    }
}

Copy

Copied

echo '{"file": {"data": "'$(base64 -w 0 sample.pdf)'", "content_type": "application/pdf"}}' \
| curl --request POST --url 'http://localhost:8080/process/files/base64' \
-H 'Content-Type: application/json' -d @- | jq -r .processed_file | base64 -d > 'sample.redacted.pdf'

Copy

Copied

import base64
import requests

# Specify the input and output file paths
filename_in = "sample.pdf"
filename_out = "sample.redacted.pdf"

# Read the file and do base64 encoding
with open(filename_in, "rb") as f:
    b64_file_content = base64.b64encode(f.read())
    b64_file_content = b64_file_content.decode("utf-8")

# Make the request and load the results as JSON
r = requests.post(url="http://localhost:8080/process/files/base64", 
                  json={"file": {"data": b64_file_content, "content_type": "application/pdf"}})
results = r.json()

# Decode and write the file to disk
with open(filename_out, "wb") as f:
    f.write(base64.b64decode(results["processed_file"]))

Copy

Copied

from privateai_client import PAIClient
from privateai_client.objects import request_objects
import base64

# Specify the input and output file paths
filename_in = "sample.pdf"
filename_out = "sample.redacted.pdf"

file_type= "application/pdf"
client = PAIClient(url="http://localhost:8080/")

# Read from file
with open(filename_in, "rb") as b64_file:
    file_data = base64.b64encode(b64_file.read())
    file_data = file_data.decode("ascii")

# Make the request
file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)

# Write to file
with open(filename_out, 'wb') as redacted_file:
    processed_file = resp.processed_file.encode("ascii")
    processed_file = base64.b64decode(processed_file, validate=True)
    redacted_file.write(processed_file)

This command will redact the file contents and return the redacted document as a base64-encoded string.

info

An example Python script showing how to process files with Private AI's Python client using the base64 route can be found here.

Attention

It is important that the proper MIME type is provided with the base64-encoded string. Failing to pass the proper MIME type may lead to unexpected behavior. Check out the Supported File Types page for proper MIME types.

Check out the API reference for more details on the base64 endpoint.

Processing files with the `uri` endpoint

To process files with the /process/files/uri endpoint you are required to mount a volume when starting the container.

In addition, the service requires access to a folder where the redacted files will be stored. This is done with the PAI_OUTPUT_FILE_DIR environment variable. This variable must point to a folder that is mounted into the container as output folder.

Copy

Copied

docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-v <full path to files>:<path in container> \
-v <full path to output>:<path in container> \
-e PAI_OUTPUT_FILE_DIR=<path to mounted folder in container> \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>

This is an example of a command mounting a files folder in the admin home folder as input, output folder as output location.

Copy

Copied

docker run --rm -v /home/admin/license.json:/app/license/license.json \
-v /home/admin/files:/home/admin/files \
-v /home/admin/output:/home/admin/output \
-e PAI_OUTPUT_FILE_DIR=/home/admin/output \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:3.1.0-cpu

Common Pitfall

Mounting a folder to an existing os or app folder in the container may lead to unexpected behavior.

info

An example Python script showing how to process files with Private AI's Python client using the URI route can be found here.

Once the container is running with the above command, you can redact files with:

Request BodycURLPythonPython Client

Copy

Copied

{
    "uri": "/home/admin/files/sample.pdf"
}

Copy

Copied

echo '{"uri": "/home/admin/files/sample.pdf"}' \
| curl --request POST --url 'http://localhost:8080/process/files/uri' \
-H 'Content-Type: application/json' -d @- | jq -r .processed_file | base64 -d > 'sample.redacted.pdf'

Copy

Copied

import requests

PATH_TO_PDF_FILE = "/home/admin/files/sample.pdf"

response = requests.post(
    "http://localhost:8080/process/files/uri",
    json={
        "uri": PATH_TO_PDF_FILE
    }
)

Copy

Copied

from privateai_client import PAIClient
from privateai_client.objects import request_objects

client = PAIClient(url="http://localhost:8080")
filepath = "/home/admin/files/sample.pdf"
req_obj = request_objects.file_uri_obj(uri=filepath)
resp = client.process_files_uri(req_obj)

response.raise_for_status()
print(response.json())

Upon successful completion, the above command will save the redacted file under /home/admin/files/output/sample.redacted.pdf.

A note on permissions

Files created by the container will have the owner and permissions of the user running the docker service. This is commonly found to be root in default installations. However, you can change the user running the container using the docker --user option.

This command will run the same container with the current user.

Copy

Copied

docker run --rm -v /home/admin/license.json:/app/license/license.json \
-e PAI_OUTPUT_FILE_DIR=/home/admin/output \
-v /home/admin/files:/home/admin/files \
-v /home/admin/output:/home/admin/output \
--user $(id -u):$(id -u) \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:3.1.0-cpu

Check out the API reference for more details on the uri endpoint.

Processing Files

info

How Does It Work?

Base64

URI

Attention

Diving Deeper

Processing files with the base64 endpoint

info

Attention

Processing files with the uri endpoint

Common Pitfall

info

A note on permissions

Processing files with the `base64` endpoint

Processing files with the `uri` endpoint