Processing DOCX Files

Private AI supports scanning Microsoft Word DOC & DOCX files for PII and creating de-identified or redacted copies. Private AI’s supported entity types function across each file type, with localized variants of different PII (Personally Identifiable Information) entities, PHI (Protected Health Information) entities, and PCI (Payment Card Industry) entities being detected. Our Supported Languages and Supported Entity Types page provides a more detailed look.

info

If you'd like to try it yourself, please visit our free interactive web demo. No code or account is necessary.

How DOCX Files Are Processed

attention

Word document support is a new feature. Depending on the complexity of the processed documents, some of their elements might not be properly de-identified. Whilst we are working on expanding support, please consider rendering and processing as a PDF. This will ensure all content is processed and redacted.

DOCX files are processed by extracting each element and processing according to the table below. The de-identified or redacted file is created according to the behaviour specified in the table.

Property Type Details Behaviour
Core properties Author, Category, Comments, Content Status, Identifier, Keywords, Language, Last Modified By, Subject, Title, Version Redact
Headers and footers Any content in headers and footers, such as text and images Redact
Tables Table objects with text and images Redact
Images The Images page provides a more detailed look at Image processing Redact, unsupported image types are removed
Text content Main body content Redact
Text boxes Floating text boxes Passthrough, will change to Remove in a future release
Embedded links Hyperlinks to internet pages or documents Remove
External elements Tables and charts embedded from another document or file, such as an Excel chart Passthrough, please process these separately
Embedded audio & video Videos and audio clips Passthrough, will change to Remove in a future release
Review comments Comments from document reviews Passthrough, will change to Remove in a future release
Shape objects Shapes containing text Passthrough, will change to Redact in a future release
info

Graphical content where text is present will be OCRed and then redacted. Check out our OCR Guide to see the available OCR modes.

Constraints

  • Some formatting of the document such as alignment of components or styling may not entirely be preserved due to how the redacted version of the document introduces labels

How DOC Files Are Processed

DOC files are processed by converting into DOCX files, followed the process described above and then converting back to DOC files.

Support Matrix

CPU Container GPU Container Demo API Prod API PrivateGPT UI
Supported? Yes Yes Base64 Only Yes No

Sample Request

info

Please sign up for a free API key to run this code.

Copy
Copied
import requests
import base64

file_url = "https://paidocumentation.blob.core.windows.net/$web/sample.docx"
file_content = requests.get(file_url).content
file_content_base64 = base64.b64encode(file_content).decode("ascii")

url = "https://api.private-ai.com/deid/v3/process/files/base64"

headers = {"Content-Type": "application/json", "x-api-key": "<INSERT API KEY>"}

payload = {
  "file":{
    "data": file_content_base64,
    "content_type": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
  },
  "entity_detection": {
    "accuracy": "high",
    "return_entity": True
  }
}

response = requests.post(url, json=payload, headers=headers)

Sample Response

Copy
Copied
"processed_file": "Base64 Encoded File Content of the Redacted File",
"processed_text":"string",
"entities":"List[Entity]",
"entities_present":true,
"languages_detected":{"lang_1":0.67, "lang_2", 0.74}
© Copyright 2024 Private AI.