Detect Entities in Text or Files without Redaction
info
In order to run the example code in this guide, please sign up for your free test api key here.
In addition to de-identification and redaction, Private AI also supports entity detection. This is useful for data discovery and also allows Private AI to be used as a general purpose Named Entity Recognition (NER) Engine. In this guide we demonstrate how to use the ner/text
endpoint introduced in 3.9
to return entities in text and describe an approach to do the same in files.
Detect entities in text (new in 3.9)
The ner/text
route introduced in 3.9 returns a list of detected entities. It can be thought of as a cut-down version of process/text
that only returns the list of detected entities, with a key difference described in the next section. In this snippet we use our Python SDK to invoke the ner/text
route on a short sentence and to return a list of detected entities:
text_request = request_objects.ner_text_obj(text=["My sample name is John Smith"])
resp = client.ner_text(text_request)
The list of detected entities is found in the entities
field:
print(json.dumps(resp.entities, indent=4))
Yields:
[
[
{
"text": "John Smith",
"location": {
"stt_idx": 18,
"end_idx": 28
},
"label": "NAME",
"likelihood": 0.9105876684188843
},
{
"text": "John",
"location": {
"stt_idx": 18,
"end_idx": 22
},
"label": "NAME_GIVEN",
"likelihood": 0.9043319821357727
},
{
"text": "Smith",
"location": {
"stt_idx": 23,
"end_idx": 28
},
"label": "NAME_FAMILY",
"likelihood": 0.9326320886611938
}
]
]
Process vs Detect Entities
Whilst similar, there is a key difference between the entities returned in process/text
route and ner/text
: process/text
groups overlapping entity detections into a single entity object, whilst ner/text
does not. This is evident from the previous example, where John Smith detected three different entities: John Smith
, John
and Smith
. The corresponding process/text
entity list is:
[
{
"processed_text": "NAME_1",
"text": "John Smith",
"location": {
"stt_idx": 18,
"end_idx": 28,
"stt_idx_processed": 18,
"end_idx_processed": 26
},
"best_label": "NAME",
"labels": {
"NAME": 0.9106,
"NAME_GIVEN": 0.4522,
"NAME_FAMILY": 0.4663
}
}
]
The ner/text
provides the raw output of the entity detection engine and is recommended if details about all entities discovered in a text fragment, including overlapping ones are required. With the ner/text
route you will be able to answer questions like Does this text contain zip codes? or Does it contain a complete address? This extra flexibility implies that you should be ready to implement your own post-processing logic.
You should use the process/text
if non-overlapping logical entities are required, e.g. to count the number of detected entities.
Detect entities in files
While the ner/text
route only supports text at this time, it is still possible to achieve a similar behaviour for files with the caveat mentioned in the previous section, only grouped entities are accessible for files.
In this snippet we use our python sdk to process a file as base64
.
# Read from file
with open('./sample_pdfs/Letter-of-Intent-pdf.pdf', "rb") as file:
b64_file_data = base64.b64encode(file.read()).decode("ascii")
# Make the request
file_obj = request_objects.file_obj(data=b64_file_data, content_type='application/pdf')
request_obj = request_objects.file_base64_obj(file=file_obj)
resp = client.process_files_base64(request_object=request_obj)
Here again, we simply take the .entities
object from the API response, and add it to a dictionary with the original file path set in the path
key. In this case we are creating one dictionary to map the file to the entities, but to process an entire directory of files you can build a list where each element is a dictionary as described below, or emit the dictionary to a datastore of your chosing.
ner_objects: List[Dict[str, Any]] = []
ner_object = dict(path="./sample_pdfs/Letter-of-Intent-pdf.pdf", entities = resp.entities)
ner_objects.append(ner_object)
print(json.dumps(ner_objects, indent=4))
Now we have a nice clean dictionary with all the PII detected, and the file location for further inspection if necessary.
[
{
"path": "./sample_pdfs/Letter-of-Intent-pdf.pdf",
"entities": [
{
"processed_text": "NAME_1",
"text": "Sarah Jackson",
"location": {
"page": 1,
"x0": 0.11588,
"x1": 0.23794,
"y0": 0.20727,
"y1": 0.22227
},
"best_label": "NAME",
"labels": {
"NAME": 0.9185,
"NAME_GIVEN": 0.4492,
"NAME_FAMILY": 0.4675
}
},
{
"processed_text": "ORGANIZATION_2",
"text": "Best Capital Corp",
"location": {
"page": 3,
"x0": 0.11706,
"x1": 0.27,
"y0": 0.87,
"y1": 0.88909
},
"best_label": "ORGANIZATION",
"labels": {
"ORGANIZATION": 0.8789
}
}
]
}
]
See how the text Sarah Jackson resulted in a single grouped entity instead of three different ones for the example above.
In this case we have kept the full entities list of dictionaries for simplicity, but during your own implementation you can keep just the components you like.
Wrap Up
Getting a list of entities contained in a text input or in a file is equally simple. The key in this guide is to access the entities
field in the response. It's that simple 😀. See the API Reference to learn more about the other response fields like processed_text
and processed_file
.