Using Private AI as an NER Engine to Detect PII in Text and Files

Detect Entities in Text or Files without Redaction

info

Connect with one of our privacy experts to run this code.

In addition to de-identification and redaction, Private AI also supports entity detection. This is useful for data discovery and also allows Private AI to be used as a general purpose Named Entity Recognition (NER) Engine. In this guide we demonstrate how to use the ner/text endpoint introduced in 3.9 to return entities in text and describe an approach to do the same in files.

Detect entities in text (new in 3.9)

The ner/text route introduced in 3.9 returns a list of detected entities. It can be thought of as a cut-down version of process/text that only returns the list of detected entities, with a key difference described in the next section. In this snippet we use our Python SDK to invoke the ner/text route on a short sentence and to return a list of detected entities:

Copy

Copied

text_request = request_objects.ner_text_obj(text=["My sample name is John Smith"])
resp = client.ner_text(text_request)

The list of detected entities is found in the entities field:

Copy

Copied

print(json.dumps(resp.entities, indent=4))

Yields:

Copy

Copied

[
    [
        {
            "text": "John Smith",
            "location": {
                "stt_idx": 18,
                "end_idx": 28
            },
            "label": "NAME",
            "likelihood": 0.9105876684188843
        },
        {
            "text": "John",
            "location": {
                "stt_idx": 18,
                "end_idx": 22
            },
            "label": "NAME_GIVEN",
            "likelihood": 0.9043319821357727
        },
        {
            "text": "Smith",
            "location": {
                "stt_idx": 23,
                "end_idx": 28
            },
            "label": "NAME_FAMILY",
            "likelihood": 0.9326320886611938
        }
    ]
]

Process vs Detect Entities

There is a key difference between the entities returned in process/text route and ner/text: process/text groups overlapping entity detections into a single entity object, while ner/text does not. This is evident from the previous example, where John Smith detected three different entities: John Smith, John and Smith. The corresponding process/text entity list is:

Copy

Copied

[
  {
    "processed_text": "NAME_1",
    "text": "John Smith",
    "location": {
      "stt_idx": 18,
      "end_idx": 28,
      "stt_idx_processed": 18,
      "end_idx_processed": 26
    },
    "best_label": "NAME",
    "labels": {
      "NAME": 0.9106,
      "NAME_GIVEN": 0.4522,
      "NAME_FAMILY": 0.4663
    }
  }
]

The ner/text provides the raw output of the entity detection engine and is recommended if details about all entities discovered in a text fragment, including overlapping ones are required. With the ner/text route you will be able to answer questions like Does this text contain zip codes? or Does it contain a complete address? This extra flexibility implies that you should be ready to implement your own post-processing logic.

You should use the process/text if non-overlapping logical entities are required, e.g. to count the number of detected entities.

Detect entities in files

While the ner/text route only supports text at this time, it is still possible to achieve a similar behaviour for files with the caveat mentioned in the previous section, only grouped entities are accessible for files.

In this snippet we use our python sdk to process a file as base64.

Copy

Copied

# Read from file
with open('./sample_pdfs/Letter-of-Intent-pdf.pdf', "rb") as file:
    b64_file_data = base64.b64encode(file.read()).decode("ascii")

# Make the request
file_obj = request_objects.file_obj(data=b64_file_data, content_type='application/pdf')
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)

Here again, we simply take the .entities object from the API response, and add it to a dictionary with the original file path set in the path key. In this case we are creating one dictionary to map the file to the entities, but to process an entire directory of files you can build a list where each element is a dictionary as described below, or emit the dictionary to a datastore of your chosing.

Copy

Copied

ner_objects: List[Dict[str, Any]] = []
ner_object = dict(path="./sample_pdfs/Letter-of-Intent-pdf.pdf", entities = resp.entities)
ner_objects.append(ner_object)
print(json.dumps(ner_objects, indent=4))

Now we have a nice clean dictionary with all the PII detected, and the file location for further inspection if necessary.

Copy

Copied

[
  {
    "path": "./sample_pdfs/Letter-of-Intent-pdf.pdf",
    "entities": [
      {
        "processed_text": "NAME_1",
        "text": "Sarah Jackson",
        "location": {
          "page": 1,
          "x0": 0.11588,
          "x1": 0.23794,
          "y0": 0.20727,
          "y1": 0.22227
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.9185,
          "NAME_GIVEN": 0.4492,
          "NAME_FAMILY": 0.4675
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Best Capital Corp",
        "location": {
          "page": 3,
          "x0": 0.11706,
          "x1": 0.27,
          "y0": 0.87,
          "y1": 0.88909
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8789
        }
      }
    ]
  }
]

See how the text Sarah Jackson resulted in a single grouped entity instead of three different ones for the example above.

In this case we have kept the full entities list of dictionaries for simplicity, but during your own implementation you can keep just the components you like.

Wrap Up

Getting a list of entities contained in a text input or in a file is equally simple. The key in this guide is to access the entities field in the response. It's that simple 😀. See the API Reference to learn more about the other response fields like processed_text and processed_file.