Processing Excel (XLS/XLSX) Files

Private AI supports scanning Microsoft Excel XLS and XLSX files for PII and creating de-identified or redacted copies. Private AI’s supported entity types function across each file type, with localized variants of different PII (Personally Identifiable Information) entities, PHI (Protected Health Information) entities, and PCI (Payment Card Industry) entities being detected. Our Supported Languages and Supported Entity Types page provides a more detailed look.

How XLSX Files Are Processed

Similar to CSV files, cell contents of XLSX files are processed using the method described for Tabular Data in the Structured Data Guide. In addition to cell contents, the following elements are handled:

Property Type	Details	Behaviour
Core properties	Author, Category, Comments, Content Status, Identifier, Keywords, Language, Last Modified By, Subject, Title, Version	Redact
Headers and footers	Any content in headers and footers, such as text and images. Can appear when the document is printed	Passthrough, will change to Redact in a future release
Images	TheImages page provides a more detailed look at Image processing	Redact, unsupported image types are removed
Text boxes	Floating text boxes	Passthrough, will change to Remove in a future release
Embedded links	Hyperlinks to internet pages or documents	Remove
External elements	Tables and charts embedded from another document or file, such as an Excel chart or table object	Passthrough, please process these separately
Embedded audio & video	Videos and audio clips	Remove
Review comments	Comments from document reviews	Passthrough, will change to Remove in a future release
Shape objects	Shapes containing text	Passthrough, will change to Redact in a future release

info

Graphical content where text is present will be OCRed and then redacted. You can configure the OCR System by setting it as an Environment Variable or sending it in the request object. Check out our OCR Guide to further understand the OCR modes and their usage.

How XLS Files Are Processed

XLS files are processed by converting into XLSX files, followed the process described above and then converting back to XLS files.

Constraints

Cell contents of XLSX files are processed using the method described for Tabular Data in the Structured Data Guide . This requires the data to be column-oriented and the headers to be on the first non-empty row.
Shape objects will not be preserved.
Formulas may not be preserved after redaction.

Support Matrix

	CPU Container	GPU Container	Community API	Professional API
Supported	Yes	Yes	Up to 10 MiB	No

Sample Request

info

Connect with one of our privacy experts to run this code.

pythonshellpythonpython

Copy

Copied

{
    "file": {
        "data": file_content_base64,
        "content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
    },
    "entity_detection": {
        "return_entity": True
    }
}

Copy

Copied

echo '{
          "file": {"data": "'$(base64 -w 0 sample.xlsx)'", 
          "content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"}, 
          "entity_detection": {"return_entity": "True"}
      }' \
| curl --request POST --url 'https://api.private-ai.com/community/v4/process/files/base64' \
       -H 'Content-Type: application/json' \
       -H 'x-api-key: <YOUR KEY HERE>' \
       -d @- \
       | jq -r .processed_file \
       | base64 -d > 'sample.redacted.xlsx'

Copy

Copied

import requests
import base64

file_url = "https://paidocumentation.blob.core.windows.net/$web/sample.xlsx"
filename_out = "/path/to/output/sample.redacted.xlsx"
file_content = requests.get(file_url).content
file_content_base64 = base64.b64encode(file_content).decode()

url = "https://api.private-ai.com/community/v4/process/files/base64"

headers = {"Content-Type": "application/json", "x-api-key": "<INSERT API KEY>"}

payload = {
  "file":{
    "data": file_content_base64,
    "content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
  },
  "entity_detection": {
    "return_entity": True
  }
}

response = requests.post(url, json=payload, headers=headers)
with open(filename_out, "wb") as f:
    f.write(base64.b64decode(response.json()["processed_file"]))

Copy

Copied

from privateai_client import PAIClient
from privateai_client.objects import request_objects
import base64

filename_in = "sample.xlsx"
filename_out = "sample.redacted.xlsx"

file_type= "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")

with open(filename_in, "rb") as b64_file:
    file_data = base64.b64encode(b64_file.read())
    file_data = file_data.decode("ascii")

file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)

with open(filename_out, 'wb') as redacted_file:
    processed_file = resp.processed_file.encode("ascii")
    processed_file = base64.b64decode(processed_file, validate=True)
    redacted_file.write(processed_file)

Sample Response

Copy

Copied

"processed_file": "Base64 Encoded File Content of the Redacted File",
"processed_text":"string",
"entities":"List[Entity]",
"entities_present":true,
"languages_detected":{"lang_1":0.67, "lang_2": 0.74}