Processing Files
[New in version 3.0]
Private AI file processing automatically extracts text from files, documents and images and redacts it to return a redacted document in the same format as the original file.
How Does It Work?
Private AI support for file redaction comes in unified endpoints which works with URIs or base64-encoded files: /v3/process/files/uri
and /v3/process/files/base64
.
The uri
endpoint gives you the ease of pointing to a file on a mounted drive and redact it without having to first read the file in memory or worry about saving the redacted content. The redacted contents are automatically saved at a user-specified location with the .redacted
suffix added to the original name. For example, the uri
endpoint will access the file /some/path/my-doc.pdf
, it will redact it and create a file my-doc.redacted.pdf
with the redacted contents at the location specified by the user. When using the uri
endpoint, the file extension is used to determine the file type.
Common Pitfall
Passing a file with no extension or with the wrong extension to the uri
endpoint may lead to unexpected behavior.
The base64
endpoint, on the other hand, is ideal if you want to save the redacted file yourself. To use the base64
endpoint, you first need to read the file in memory, encode its content with base64, and send it to the base64
endpoint. You also need to pass the MIME type of the file as a hint to the file processing pipeline.
Check out the Supported File Types page for extensions and MIME types for both endpoints.
Diving Deeper
Processing files with the uri
endpoint
To process files with the /v3/process/files/uri
endpoint you are required to mount a volume when starting the container.
docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-v <full path to files>:<path in container> \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>
In addition, the service requires access to a folder where the redacted files will be stored. This is done with the PAI_OUTPUT_FILE_DIR
environment variable. This variable must point to a folder that is mounted into the container or an existing subfolder to a mounted folder.
docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-v <full path to files>:<path in container> \
-e PAI_OUTPUT_FILE_DIR=<path to mounted folder in container> \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>
This is an example of a command mounting a files
folder in the admin
home folder.
docker run --rm -v /home/admin/license.json:/app/license/license.json \
-v /home/admin/files:/files \
-e PAI_OUTPUT_FILE_DIR=/files/output \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:3.1.0-cpu
Common Pitfall
Mounting a folder to an existing os or app folder in the container may lead to unexpected behavior.
Once the container is running with the above command, you can redact files with:
import base64
import requests
PATH_TO_PDF_FILE = "test.pdf"
with open(PATH_TO_FILE, "rb") as pdf_file:
pdf_file_in_base64 = base64.b64encode(pdf_file.read())
response = requests.post(
"http://localhost:8080/v3/process/files/base64",
json={
"file": {
"data": pdf_file_in_base64.decode(),
"content_type": "application/pdf"
}
}
)
response.raise_for_status()
print(response.json())
Upon successful completion, the above command will save the redacted file under /home/admin/files/output/sample.redacted.pdf
.
A note on permissions
Files created by the container will have the owner and permissions of the user running the docker service. This is commonly found to be root
in default installations. However, you can change the user running the container using the docker --user
option.
This command will run the same container with the current user.
docker run --rm -v /home/admin/license.json:/app/license/license.json \
-e PAI_OUTPUT_FILE_DIR=/files/output \
-v /home/admin/files:/files \
--user $(id -u):$(id -u) \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:3.1.0-cpu
Check out the API reference for more details on the uri endpoint.
Processing files with the base64
endpoint
When using the /v3/process/files/base64
endpoint, there is no need to mount a folder into the container.
docker run --rm -v <full path to your license.json file>:/app/license/license.json \
-p 8080:8080 -it crprivateaiprod.azurecr.io/deid:<version>
The file is first read into memory to encode its contents then the encoded contents are passed to the file processing endpoint. On linux, this can be done with the base64
shell command. Assuming you have the file sample.pdf
saved in the current folder:
echo '{"file": {"data": "'$(base64 -w 0 sample.pdf)'", \
"content_type": "application/pdf"}}' \
| curl --request POST --url 'http://localhost:8080/v3/process/files/base64' \
-H 'Content-Type: application/json' -d @-
This command will redact the file contents and return the redacted document as a base64-encoded string.
Common Pitfall
It is important that the proper MIME type is provided with the base64-encoded string. Failing to pass the proper MIME type may lead to unexpected behavior. Check out the Supported File Types page for proper MIME types.
Check out the API reference for more details on the base64 endpoint.