Detect Presence of PII in Files without Redaction

info

In order to run the example code in this guide, please sign up for your free test api key here.

While we excel at redaction on PII data, often customers are looking for simple detection use cases. In this guide we demonstrate a simple way to use our API to detect the entities in a file, and store those entities in a dictionary with the original file location. In this way we will operate as a data discovery tool.

First Run the File through our API

In this snippet we use our python sdk to process a file as base64.

Copy
Copied
# Read from file
with open('./sample_pdfs/Letter-of-Intent-pdf.pdf', "rb") as file:
    b64_file_data = base64.b64encode(file.read()).decode("ascii")

# Make the request
file_obj = request_objects.file_obj(data=b64_file_data, content_type='application/pdf')
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)

Grab the Entities and associate with a file

Here we simply take the .entities object from the API response, and add it to a dictionary with the original file path set in the path key. In this case we are creating one dictionary to map the file to the entities, but to process an entire directory of files you can build a list where each element is a dictionary as described below, or emit the dictionary to a datastore of your chosing.

Copy
Copied
ner_object: Dict[str,Any] = {}
ner_object["./sample_pdfs/Letter-of-Intent-pdf.pdf"] = resp.entities

View the Results

Now we have a nice clean dictionary with all the PII detected, and the file location for further inspection if necessary. In this case we have kept the full entities list of dictionaries for simplicity, but during your own implementation you can keep just the components you like.

Copy
Copied
[
  {
    "path": "./sample_pdfs/Letter-of-Intent-pdf.pdf",
    "entities": [
      {
        "processed_text": "NAME_1",
        "text": "Sarah Jackson",
        "location": {
          "page": 1,
          "x0": 0.11588,
          "x1": 0.23794,
          "y0": 0.20727,
          "y1": 0.22227
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.9185,
          "NAME_GIVEN": 0.4492,
          "NAME_FAMILY": 0.4675
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Best Capital Corp",
        "location": {
          "page": 3,
          "x0": 0.11706,
          "x1": 0.27,
          "y0": 0.87,
          "y1": 0.88909
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8789
        }
      }
    ]
  }
]

Wrap Up

The key in this guide is that we simply don't save the processed_file from the API. It is still in the payload if you would like to use it, but in this case we simply discard it. It's that simple :)

© Copyright 2024 Private AI.