Detect Entities in Text or Files without Redaction

info

In order to run the example code in this guide, please sign up for your free test api key here.

In addition to de-identification and redaction, Private AI also supports entity detection. This is useful for data discovery and also allows Private AI to be used as a general purpose Named Entity Recognition (NER) Engine. In this guide we demonstrate how to use the ner/text endpoint introduced in 3.9 to return entities in text and describe an approach to do the same in files.

Detect entities in text (new in 3.9)

The ner/text route introduced in 3.9 returns a list of detected entities. It can be thought of as a cut-down version of process/text that only returns the list of detected entities, with a key difference described in the next section. In this snippet we use our Python SDK to invoke the ner/text route on a short sentence and to return a list of detected entities:

Copy
Copied
text_request = request_objects.ner_text_obj(text=["My sample name is John Smith"])
resp = client.ner_text(text_request)

The list of detected entities is found in the entities field:

Copy
Copied
print(json.dumps(resp.entities, indent=4))

Yields:

Copy
Copied
[
    [
        {
            "text": "John Smith",
            "location": {
                "stt_idx": 18,
                "end_idx": 28
            },
            "label": "NAME",
            "likelihood": 0.9105876684188843
        },
        {
            "text": "John",
            "location": {
                "stt_idx": 18,
                "end_idx": 22
            },
            "label": "NAME_GIVEN",
            "likelihood": 0.9043319821357727
        },
        {
            "text": "Smith",
            "location": {
                "stt_idx": 23,
                "end_idx": 28
            },
            "label": "NAME_FAMILY",
            "likelihood": 0.9326320886611938
        }
    ]
]

Process vs Detect Entities

Whilst similar, there is a key difference between the entities returned in process/text route and ner/text: process/text groups overlapping entity detections into a single entity object, whilst ner/text does not. This is evident from the previous example, where John Smith detected three different entities: John Smith, John and Smith. The corresponding process/text entity list is:

Copy
Copied
[
  {
    "processed_text": "NAME_1",
    "text": "John Smith",
    "location": {
      "stt_idx": 18,
      "end_idx": 28,
      "stt_idx_processed": 18,
      "end_idx_processed": 26
    },
    "best_label": "NAME",
    "labels": {
      "NAME": 0.9106,
      "NAME_GIVEN": 0.4522,
      "NAME_FAMILY": 0.4663
    }
  }
]

The ner/text provides the raw output of the entity detection engine and is recommended if details about all entities discovered in a text fragment, including overlapping ones are required. With the ner/text route you will be able to answer questions like Does this text contain zip codes? or Does it contain a complete address? This extra flexibility implies that you should be ready to implement your own post-processing logic.

You should use the process/text if non-overlapping logical entities are required, e.g. to count the number of detected entities.

Detect entities in files

While the ner/text route only supports text at this time, it is still possible to achieve a similar behaviour for files with the caveat mentioned in the previous section, only grouped entities are accessible for files.

In this snippet we use our python sdk to process a file as base64.

Copy
Copied
# Read from file
with open('./sample_pdfs/Letter-of-Intent-pdf.pdf', "rb") as file:
    b64_file_data = base64.b64encode(file.read()).decode("ascii")

# Make the request
file_obj = request_objects.file_obj(data=b64_file_data, content_type='application/pdf')
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)

Here again, we simply take the .entities object from the API response, and add it to a dictionary with the original file path set in the path key. In this case we are creating one dictionary to map the file to the entities, but to process an entire directory of files you can build a list where each element is a dictionary as described below, or emit the dictionary to a datastore of your chosing.

Copy
Copied
ner_objects: List[Dict[str, Any]] = []
ner_object = dict(path="./sample_pdfs/Letter-of-Intent-pdf.pdf", entities = resp.entities)
ner_objects.append(ner_object)
print(json.dumps(ner_objects, indent=4))

Now we have a nice clean dictionary with all the PII detected, and the file location for further inspection if necessary.

Copy
Copied
[
  {
    "path": "./sample_pdfs/Letter-of-Intent-pdf.pdf",
    "entities": [
      {
        "processed_text": "NAME_1",
        "text": "Sarah Jackson",
        "location": {
          "page": 1,
          "x0": 0.11588,
          "x1": 0.23794,
          "y0": 0.20727,
          "y1": 0.22227
        },
        "best_label": "NAME",
        "labels": {
          "NAME": 0.9185,
          "NAME_GIVEN": 0.4492,
          "NAME_FAMILY": 0.4675
        }
      },
      {
        "processed_text": "ORGANIZATION_2",
        "text": "Best Capital Corp",
        "location": {
          "page": 3,
          "x0": 0.11706,
          "x1": 0.27,
          "y0": 0.87,
          "y1": 0.88909
        },
        "best_label": "ORGANIZATION",
        "labels": {
          "ORGANIZATION": 0.8789
        }
      }
    ]
  }
]

See how the text Sarah Jackson resulted in a single grouped entity instead of three different ones for the example above.

In this case we have kept the full entities list of dictionaries for simplicity, but during your own implementation you can keep just the components you like.

Wrap Up

Getting a list of entities contained in a text input or in a file is equally simple. The key in this guide is to access the entities field in the response. It's that simple 😀. See the API Reference to learn more about the other response fields like processed_text and processed_file.

© Copyright 2024 Private AI.