Processing Excel (XLS/XLSX) Files
Private AI supports scanning Microsoft Excel XLS and XLSX files for PII and creating de-identified or redacted copies. Private AI’s supported entity types function across each file type, with localized variants of different PII (Personally Identifiable Information) entities, PHI (Protected Health Information) entities, and PCI (Payment Card Industry) entities being detected. Our Supported Languages and Supported Entity Types page provides a more detailed look.
How XLSX Files Are Processed
Similar to CSV files, cell contents of XLSX files are processed using the method described for Tabular Data in the Structured Data Guide. In addition to cell contents, the following elements are handled:
Property Type | Details | Behaviour |
---|---|---|
Core properties | Author, Category, Comments, Content Status, Identifier, Keywords, Language, Last Modified By, Subject, Title, Version | Redact |
Headers and footers | Any content in headers and footers, such as text and images. Can appear when the document is printed | Passthrough, will change to Redact in a future release |
Images | The Images page provides a more detailed look at Image processing | Redact, unsupported image types are removed |
Text boxes | Floating text boxes | Passthrough, will change to Remove in a future release |
Embedded links | Hyperlinks to internet pages or documents | Remove |
External elements | Tables and charts embedded from another document or file, such as an Excel chart or table object | Passthrough, please process these separately |
Embedded audio & video | Videos and audio clips | Remove |
Review comments | Comments from document reviews | Passthrough, will change to Remove in a future release |
Shape objects | Shapes containing text | Passthrough, will change to Redact in a future release |
info
Graphical content where text is present will be OCRed and then redacted. You can configure the OCR System by setting it as an Environment Variable or sending it in the request object. Check out our OCR Guide to further understand the OCR modes and their usage.
How XLS Files Are Processed
XLS files are processed by converting into XLSX files, followed the process described above and then converting back to XLS files.
Constraints
- Cell contents of XLSX files are processed using the method described for Tabular Data in the Structured Data Guide . This requires the data to be column-oriented and the headers to be on the first non-empty row.
- Shape objects will not be preserved.
- Formulas may not be preserved after redaction.
Support Matrix
CPU Container | GPU Container | Community API | Professional API | PrivateGPT UI | |
---|---|---|---|---|---|
Supported? | Yes | Yes | No | Yes | No |
Sample Request
info
Please sign up for a free API key to run this code.
{
"file": {
"data": file_content_base64,
"content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
},
"entity_detection": {
"return_entity": True
}
}
echo '{
"file": {"data": "'$(base64 -w 0 sample.xlsx)'",
"content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"},
"entity_detection": {"return_entity": "True"}
}' \
| curl --request POST --url 'https://api.private-ai.com/community/v3/process/files/base64' \
-H 'Content-Type: application/json' \
-H 'x-api-key: <YOUR KEY HERE>' \
-d @- \
| jq -r .processed_file \
| base64 -d > 'sample.redacted.xlsx'
import requests
import base64
file_url = "https://paidocumentation.blob.core.windows.net/$web/sample.xlsx"
filename_out = "/path/to/output/sample.redacted.xlsx"
file_content = requests.get(file_url).content
file_content_base64 = base64.b64encode(file_content).decode()
url = "https://api.private-ai.com/community/v3/process/files/base64"
headers = {"Content-Type": "application/json", "x-api-key": "<INSERT API KEY>"}
payload = {
"file":{
"data": file_content_base64,
"content_type": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
},
"entity_detection": {
"return_entity": True
}
}
response = requests.post(url, json=payload, headers=headers)
with open(filename_out, "wb") as f:
f.write(base64.b64decode(response.json()["processed_file"]))
from privateai_client import PAIClient
from privateai_client.objects import request_objects
import base64
filename_in = "sample.xlsx"
filename_out = "sample.redacted.xlsx"
file_type= "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
client = PAIClient(url="https://api.private-ai.com/community/", api_key="<YOUR API KEY>")
with open(filename_in, "rb") as b64_file:
file_data = base64.b64encode(b64_file.read())
file_data = file_data.decode("ascii")
file_obj = request_objects.file_obj(data=file_data, content_type=file_type)
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)
with open(filename_out, 'wb') as redacted_file:
processed_file = resp.processed_file.encode("ascii")
processed_file = base64.b64decode(processed_file, validate=True)
redacted_file.write(processed_file)
Sample Response
"processed_file": "Base64 Encoded File Content of the Redacted File",
"processed_text":"string",
"entities":"List[Entity]",
"entities_present":true,
"languages_detected":{"lang_1":0.67, "lang_2": 0.74}