Processing PDF Files

Private AI supports scanning PDF files for PII and creating de-identified or redacted copies. Private AI’s supported entity types function across each file type, with localized variants of different PII (Personally Identifiable Information) entities, PHI (Protected Health Information) entities, and PCI (Payment Card Industry) entities being detected. Our Supported Languages and Supported Entity Types page provides a more detailed look.

info

If you'd like to try it yourself, please visit our free interactive web demo. No code or account is necessary.

How PDFs Are Processed

PDFs are processed as follows:

  1. First, each page in the PDF is rendered as an image. The result is similar to a PDF created by a photocopier scan. This is done to ensure that all PII is properly captured - PDF is a complicated format .
  2. Each page in the PDF is processed as an image .
  3. A new PDF is created using the redacted/de-identified images produced in the previous step.
  4. If specified, an invisible, de-identified text layer is created using the OCR system output. This ensures that the resulting PDF is searchable and allows for text to be copy & pasted.
info

Check out our OCR Guide to see the available OCR modes.

Constraints

  • Any attachments in a PDF file are removed.
  • If the PDF document to be de-identified already has an invisible text layer, it will be discarded and replaced with a new text-layer created through the use of OCR.

Parameters

Below are the parameters that control the behaviour of the PDF De-identifier. These parameters shall be specified under pdf_options.

Parameter Explanation Default
density PDFs are converted into images using this DPI value. Smaller values result in images with smaller resolutions, which will take up less storage space and process faster, at the cost of output quality & redaction accuracy. 200
max_resolution PDFs are converted into images using the density DPI value. Any resulting images with maximum size length larger than this will be resized to this value, whilst preserving aspect ratio. 3000

Support Matrix

CPU Container GPU Container Community API Professional API PrivateGPT UI
Supported? Yes Yes Base64 Only Yes No

Sample Request

info

Please sign up for a free API key to run this code.

Request BodycURLPythonPython Client
Copy
Copied
{
    "file": {
        "data": file_content_base64,
        "content_type": "application/pdf",
    },
    "entity_detection": {
        "return_entity": True
    }
}
Copy
Copied
echo '{
          "file": {"data": "'$(base64 -w 0 sample.pdf)'", 
          "content_type": "application/pdf"}, 
          "entity_detection": {"return_entity": "True"}
      }' \
| curl --request POST --url 'https://api.private-ai.com/community/v3/process/files/base64' \
       -H 'Content-Type: application/json' \
       -H 'x-api-key: <YOUR KEY HERE>' \
       -d @- \
       | jq -r .processed_file \
       | base64 -d > 'sample.redacted.pdf'
Copy
Copied
import requests
import base64

file_url = "https://paidocumentation.blob.core.windows.net/$web/sample.pdf"
filename_out = "/path/to/output/sample.redacted.pdf"
file_content = requests.get(file_url).content
file_content_base64 = base64.b64encode(file_content).decode()

url = "https://api.private-ai.com/community/v3/process/files/base64"

headers = {"Content-Type": "application/json", "x-api-key": "<INSERT API KEY>"}

payload = {
  "file":{
    "data": file_content_base64,
    "content_type": "application/pdf",
  },
  "entity_detection": {
    "return_entity": True
  }
}

response = requests.post(url, json=payload, headers=headers)
with open(filename_out, "wb") as f:
    f.write(base64.b64decode(response.json()["processed_file"]))
Copy
Copied
from privateai_client import PAIClient
from privateai_client.objects import request_objects
import base64

filename_in = "sample.pdf"
filename_out = "sample.redacted.pdf"

file_type= "application/pdf"
client = PAIClient(url="https://api.private-ai.com/community/", api_key="<YOUR API KEY>")

with open(filename_in, "rb") as b64_file:
    file_data = base64.b64encode(b64_file.read())
    file_data = file_data.decode("ascii")

file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)

with open(filename_out, 'wb') as redacted_file:
    processed_file = resp.processed_file.encode("ascii")
    processed_file = base64.b64decode(processed_file, validate=True)
    redacted_file.write(processed_file)

Sample Response

Copy
Copied
"processed_file": "Base64 Encoded File Content of the Redacted File",
"processed_text":"string",
"entities":"List[Entity]",
"entities_present":true,
"languages_detected":{"lang_1":0.67, "lang_2": 0.74}
© Copyright 2024 Private AI.