Detect Presence of PII in Files without Redaction
info
In order to run the example code in this guide, please sign up for your free test api key here.
While we excel at redaction on PII data, often customers are looking for simple detection use cases. In this guide we demonstrate a simple way to use our API to detect the entities in a file, and store those entities in a dictionary with the original file location. In this way we will operate as a data discovery tool.
First Run the File through our API
In this snippet we use our python sdk to process a file as base64
.
# Read from file
with open('./sample_pdfs/Letter-of-Intent-pdf.pdf', "rb") as file:
b64_file_data = base64.b64encode(file.read()).decode("ascii")
# Make the request
file_obj = request_objects.file_obj(data=b64_file_data, content_type='application/pdf')
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)
Grab the Entities and associate with a file
Here we simply take the .entities
object from the API response, and add it to a dictionary with the original file path set in the path
key. In this case we are creating one dictionary to map the file to the entities, but to process an entire directory of files you can build a list where each element is a dictionary as described below, or emit the dictionary to a datastore of your chosing.
ner_object: Dict[str,Any] = {}
ner_object["./sample_pdfs/Letter-of-Intent-pdf.pdf"] = resp.entities
View the Results
Now we have a nice clean dictionary with all the PII detected, and the file location for further inspection if necessary. In this case we have kept the full entities list of dictionaries for simplicity, but during your own implementation you can keep just the components you like.
[
{
"path": "./sample_pdfs/Letter-of-Intent-pdf.pdf",
"entities": [
{
"processed_text": "NAME_1",
"text": "Sarah Jackson",
"location": {
"page": 1,
"x0": 0.11588,
"x1": 0.23794,
"y0": 0.20727,
"y1": 0.22227
},
"best_label": "NAME",
"labels": {
"NAME": 0.9185,
"NAME_GIVEN": 0.4492,
"NAME_FAMILY": 0.4675
}
},
{
"processed_text": "ORGANIZATION_2",
"text": "Best Capital Corp",
"location": {
"page": 3,
"x0": 0.11706,
"x1": 0.27,
"y0": 0.87,
"y1": 0.88909
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8789
}
}
]
}
]
Wrap Up
The key in this guide is that we simply don't save the processed_file
from the API. It is still in the payload if you would like to use it, but in this case we simply discard it. It's that simple :)