Processing Files
info
In order to run the example code in this guide, please sign up for your free test api key here. Note that the Community API key will only be able process files via the base64
endpoint.
Private AI supports scanning a multitude of different file types for PII and creating de-identified or redacted copies. Private AI’s supported entity types function across each file type, with localized variants of different PII (Personally Identifiable Information) entities, PHI (Protected Health Information) entities, and PCI (Payment Card Industry) entities being detected. Our Supported Languages and Supported Entity Types page provides a more detailed look.
How Does It Work?
Private AI support for file processing comes in unified endpoints which works with either base64-encoded files or URIs: /process/files/base64
and /process/files/uri
.
Base64
The base64
endpoint is the recommended way to process files for most users, as there is no need to mount a volume into the container and ensure that permissions are set correctly. To use the base64
endpoint, you first need to read the file in memory, encode its content with base64, and send it to the base64
endpoint. You also need to pass the MIME type of the file as a hint to the file processing pipeline. The Supported File Types page details extensions and MIME types for both endpoints
URI
Available on the container only, the uri
endpoint is suitable for larger data volumes and has the following advantages over the base64 endpoint:
- No overhead of base64 encoding.
- No need to first read the file in memory.
- The processed file is saved automatically by the container.
API calls are made by pointing to a file on a mounted drive. The redacted contents are automatically saved at a user-specified location with the .redacted
suffix added to the original name. For example, the uri
endpoint will access the file /some/path/my-doc.pdf
, it will redact it and create a file my-doc.redacted.pdf
with the redacted contents at the location specified by the user. When using the uri
endpoint, the file extension is used to determine the file type.
Attention
Passing a file with no extension or with the wrong extension to the uri
endpoint may lead to unexpected behavior.
Diving Deeper
Processing files with the base64
endpoint
When using the /process/files/base64
endpoint, there is no need to mount a folder into the container.
docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>
The file is first read into memory to encode its contents then the encoded contents are passed to the file processing endpoint. On linux, this can be done with the base64
shell command. Assuming you have the file sample.pdf
saved in the current folder:
{
"file": {
"data": "'$(base64 -w 0 sample.pdf)'",
"content_type": "application/pdf"
}
}
echo '{"file": {"data": "'$(base64 -w 0 sample.pdf)'", "content_type": "application/pdf"}}' \
| curl --request POST --url 'http://localhost:8080/process/files/base64' \
-H 'Content-Type: application/json' -d @- | jq -r .processed_file | base64 -d > 'sample.redacted.pdf'
import base64
import requests
# Specify the input and output file paths
filename_in = "sample.pdf"
filename_out = "sample.redacted.pdf"
# Read the file and do base64 encoding
with open(filename_in, "rb") as f:
b64_file_content = base64.b64encode(f.read())
b64_file_content = b64_file_content.decode("utf-8")
# Make the request and load the results as JSON
r = requests.post(url="http://localhost:8080/process/files/base64",
json={"file": {"data": b64_file_content, "content_type": "application/pdf"}})
results = r.json()
# Decode and write the file to disk
with open(filename_out, "wb") as f:
f.write(base64.b64decode(results["processed_file"]))
from privateai_client import PAIClient
from privateai_client.objects import request_objects
import base64
# Specify the input and output file paths
filename_in = "sample.pdf"
filename_out = "sample.redacted.pdf"
file_type= "application/pdf"
client = PAIClient(url="http://localhost:8080/")
# Read from file
with open(filename_in, "rb") as b64_file:
file_data = base64.b64encode(b64_file.read())
file_data = file_data.decode("ascii")
# Make the request
file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)
# Write to file
with open(filename_out, 'wb') as redacted_file:
processed_file = resp.processed_file.encode("ascii")
processed_file = base64.b64decode(processed_file, validate=True)
redacted_file.write(processed_file)
This command will redact the file contents and return the redacted document as a base64-encoded string.
info
An example Python script showing how to process files with Private AI's Python client using the base64 route can be found here.
Attention
It is important that the proper MIME type is provided with the base64-encoded string. Failing to pass the proper MIME type may lead to unexpected behavior. Check out the Supported File Types page for proper MIME types.
Check out the API reference for more details on the base64 endpoint.
Processing files with the uri
endpoint
To process files with the /process/files/uri
endpoint you are required to mount a volume when starting the container.
In addition, the service requires access to a folder where the redacted files will be stored. This is done with the PAI_OUTPUT_FILE_DIR
environment variable. This variable must point to a folder that is mounted into the container as output folder.
docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-v <full path to files>:<path in container> \
-v <full path to output>:<path in container> \
-e PAI_OUTPUT_FILE_DIR=<path to mounted folder in container> \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>
This is an example of a command mounting a files
folder in the admin
home folder as input, output
folder as output location.
docker run --rm -v /home/admin/license.json:/app/license/license.json \
-v /home/admin/files:/home/admin/files \
-v /home/admin/output:/home/admin/output \
-e PAI_OUTPUT_FILE_DIR=/home/admin/output \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:3.1.0-cpu
Common Pitfall
Mounting a folder to an existing os or app folder in the container may lead to unexpected behavior.
info
An example Python script showing how to process files with Private AI's Python client using the URI route can be found here.
Once the container is running with the above command, you can redact files with:
{
"uri": "/home/admin/files/sample.pdf"
}
echo '{"uri": "/home/admin/files/sample.pdf"}' \
| curl --request POST --url 'http://localhost:8080/process/files/uri' \
-H 'Content-Type: application/json' -d @- | jq -r .processed_file | base64 -d > 'sample.redacted.pdf'
import requests
PATH_TO_PDF_FILE = "/home/admin/files/sample.pdf"
response = requests.post(
"http://localhost:8080/process/files/uri",
json={
"uri": PATH_TO_PDF_FILE
}
)
from privateai_client import PAIClient
from privateai_client.objects import request_objects
client = PAIClient(url="http://localhost:8080")
filepath = "/home/admin/files/sample.pdf"
req_obj = request_objects.file_uri_obj(uri=filepath)
resp = client.process_files_uri(req_obj)
response.raise_for_status()
print(response.json())
Upon successful completion, the above command will save the redacted file under /home/admin/files/output/sample.redacted.pdf
.
A note on permissions
Files created by the container will have the owner and permissions of the user running the docker service. This is commonly found to be root
in default installations. However, you can change the user running the container using the docker --user
option.
This command will run the same container with the current user.
docker run --rm -v /home/admin/license.json:/app/license/license.json \
-e PAI_OUTPUT_FILE_DIR=/home/admin/output \
-v /home/admin/files:/home/admin/files \
-v /home/admin/output:/home/admin/output \
--user $(id -u):$(id -u) \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:3.1.0-cpu
Check out the API reference for more details on the uri endpoint.