Processing Files

[New in version 3.0]

Attention

In order to run the example code in this guide, please sign up for your free test api key here. Note that the demo api key will only be able process files as base64 through.

Private AI file processing automatically extracts text from files, documents and images and redacts it to return a redacted document in the same format as the original file.

How Does It Work?

Private AI support for file redaction comes in unified endpoints which works with URIs or base64-encoded files: /v3/process/files/uri and /v3/process/files/base64.

The uri endpoint gives you the ease of pointing to a file on a mounted drive and redact it without having to first read the file in memory or worry about saving the redacted content. The redacted contents are automatically saved at a user-specified location with the .redacted suffix added to the original name. For example, the uri endpoint will access the file /some/path/my-doc.pdf, it will redact it and create a file my-doc.redacted.pdf with the redacted contents at the location specified by the user. When using the uri endpoint, the file extension is used to determine the file type.

Common Pitfall

Passing a file with no extension or with the wrong extension to the uri endpoint may lead to unexpected behavior.

The base64 endpoint, on the other hand, is ideal if you want to save the redacted file yourself. To use the base64 endpoint, you first need to read the file in memory, encode its content with base64, and send it to the base64 endpoint. You also need to pass the MIME type of the file as a hint to the file processing pipeline.

Check out the Supported File Types page for extensions and MIME types for both endpoints.

Diving Deeper

Processing files with the uri endpoint

To process files with the /v3/process/files/uri endpoint you are required to mount a volume when starting the container.

Copy
Copied
docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-v <full path to files>:<path in container> \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>

In addition, the service requires access to a folder where the redacted files will be stored. This is done with the PAI_OUTPUT_FILE_DIR environment variable. This variable must point to a folder that is mounted into the container or an existing subfolder to a mounted folder.

Copy
Copied
docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-v <full path to files>:<path in container> \
-e PAI_OUTPUT_FILE_DIR=<path to mounted folder in container> \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>

This is an example of a command mounting a files folder in the admin home folder.

Copy
Copied
docker run --rm -v /home/admin/license.json:/app/license/license.json \
-v /home/admin/files:/files \
-e PAI_OUTPUT_FILE_DIR=/files/output \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:3.1.0-cpu
Common Pitfall

Mounting a folder to an existing os or app folder in the container may lead to unexpected behavior.

Once the container is running with the above command, you can redact files with:

Copy
Copied
import base64
import requests

PATH_TO_PDF_FILE = "test.pdf"

with open(PATH_TO_FILE, "rb") as pdf_file:
    pdf_file_in_base64 = base64.b64encode(pdf_file.read())

response = requests.post(
    "http://localhost:8080/v3/process/files/base64",
    json={
        "file": {
            "data": pdf_file_in_base64.decode(),
            "content_type": "application/pdf"
        }
    }
)

response.raise_for_status()
print(response.json())

Upon successful completion, the above command will save the redacted file under /home/admin/files/output/sample.redacted.pdf.

A note on permissions

Files created by the container will have the owner and permissions of the user running the docker service. This is commonly found to be root in default installations. However, you can change the user running the container using the docker --user option.

This command will run the same container with the current user.

Copy
Copied
docker run --rm -v /home/admin/license.json:/app/license/license.json \
-e PAI_OUTPUT_FILE_DIR=/files/output \
-v /home/admin/files:/files \
--user $(id -u):$(id -u) \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:3.1.0-cpu

Check out the API reference for more details on the uri endpoint.

Processing files with the base64 endpoint

When using the /v3/process/files/base64 endpoint, there is no need to mount a folder into the container.

Copy
Copied
docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>

The file is first read into memory to encode its contents then the encoded contents are passed to the file processing endpoint. On linux, this can be done with the base64 shell command. Assuming you have the file sample.pdf saved in the current folder:

Copy
Copied
echo '{"file": {"data": "'$(base64 -w 0 sample.pdf)'", \
"content_type": "application/pdf"}}' \
| curl --request POST --url 'http://localhost:8080/v3/process/files/base64' \
-H 'Content-Type: application/json'  -d @-

This command will redact the file contents and return the redacted document as a base64-encoded string.

Common Pitfall

It is important that the proper MIME type is provided with the base64-encoded string. Failing to pass the proper MIME type may lead to unexpected behavior. Check out the Supported File Types page for proper MIME types.

Check out the API reference for more details on the base64 endpoint.

© Copyright 2022, 2023 Private AI.