Processing Structured Data Sources

This guide illustrates how to use Private AI on Structured Data sources such as Databases, JSON, CSV and XML files. It is also applicable to Databricks and Snowflake.

info

In order to run the example code in this guide, please sign up for your free test api key here or run the container.

Tabular Data

note

If it is only required to redact certain Columns such as VisitNotes please see our PII Safe Sentiment Analysis Walkthrough.

In addition to raw text and files, you can use Private AI to process tabular data, such as found in a database, CSV or Microsoft Excel file. For example, consider the below table:

ClientNm DOfB HlthCareNm TempFd VisitNotes
Stefan Krugler 150368 2342345234 NYC Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in
Hari Seldon 201030 4947384944 LA Patient came in for a regular checkup.

One could try to create a rule-based system, but it is difficult:

  • The column names are all acronyms, DOfB in particular is an uncommon way of referring to a date of birth.
  • Each cell value is written as an integer to save space, removing the hint of / or . characters in identifying the date.
  • TempFd contains the city the individual lives in, but is stored in a temporary field.
  • VisitNotes is a varchar free text field. This field is particularly useful for ML use cases, for example to determine the main visit reasons or for case resolution.

In particular, longer string and blob fields in databases can be difficult.

Processing As a String

A simple approach would be to join each cell together and process as a single long string:

Request BodyPython
Copy
Copied
{
    "text": ["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."]
}
Copy
Copied
from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/deid/", api_key="<YOUR API KEY>")

text_request = request_objects.process_text_obj(text=["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."])

response = client.process_text(text_request)

print(response.processed_text)

However the result isn't great:

Copy
Copied
['ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, [NAME_1], [LOCATION_1], Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in, [NAME_MEDICAL_PROFESSIONAL_1], [LOCATION_2], Patient came in for a regular checkup.']

Formatting the Data Correctly

We can get better results by formatting the input in a "key1: value1, key2: value2, ... pattern, to make sure that each cell entry includes the column name for context. Note that it's important to put a colon at the end of each key. It is also better to process row by row, which scales better for larger tables:

Request BodyPython
Copy
Copied
{
    "text": [
      "ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in",
      "ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."
    ],
}
Copy
Copied
from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/deid/", api_key="<YOUR API KEY>")

text_request = request_objects.process_text_obj(text=[
    "ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in",
      "ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."])

response = client.process_text(text_request)

print(response.processed_text)

Whilst the result is much better, it's susceptible to False-Positive detections on column names. Also as the result is a long string, it is more difficult to relate the predictions back to individual cells:

Copy
Copied
['ClientNm: [NAME_1], DOfB: [DOB_1], HlthCareNm: [NUMERICAL_PII_1], TempFd: [LOCATION_CITY_1], VisitNotes: Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in', 
'ClientNm: [NAME_1], DOfB: [DOB_1], HlthCareNm: [HEALTHCARE_NUMBER_1], TempFd: [LOCATION_CITY_1], VisitNotes: Patient came in for a regular checkup.']

Best Practice: Correct Formatting With Link Batch

We can solve both of the issues with the previous approach by feeding the table in as a list with link_batch on. This is functionally equivalent to the example presented above and this allows us to easily get the results of each cell without any post-processing logic.

Request BodyPython
Copy
Copied
# Request 1:
{
    "text": [
      "ClientNm:", "Stefan Krugler", "DOfB:", "150368", "HlthCareNm:", "2342345234", "TempFd:", "NYC", "VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in",
    ],
    "link_batch": true
}

# Request 2:
{
    "text": [
      "ClientNm:", "Hari Seldon", "DOfB:", "201030", "HlthCareNm:", "4947384944", "TempFd:", "LA", "VisitNotes:", "Patient came in for a regular checkup.",
    ],
    "link_batch": true
}
Copy
Copied
from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/deid/", api_key="<YOUR API KEY>")

row_data = [["ClientNm:", "Stefan Krugler", "DOfB:", "150368", "HlthCareNm:", "2342345234", "TempFd:", "NYC", "VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in"], ["ClientNm:", "Hari Seldon", "DOfB:", "201030", "HlthCareNm:", "4947384944", "TempFd:", "LA", "VisitNotes:", "Patient came in for a regular checkup."]]

for r in row_data:
  text_request = request_objects.process_text_obj(text=r, link_batch=True)
  response = client.process_text(text_request)
  print(response.processed_text)

The result is the same, but it is now easy to get the output for each cell by looking at the corresponding response in the list:

Copy
Copied
['ClientNm:', '[NAME_1]', 'DOfB:', '[DOB_1]', 'HlthCareNm:', '[HEALTHCARE_NUMBER_1]', 'TempFd:', '[LOCATION_CITY_1]', 'VisitNotes:', 'Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in']
['ClientNm:', '[NAME_1]', 'DOfB:', '[DOB_1]', 'HlthCareNm:', '[HEALTHCARE_NUMBER_1]', 'TempFd:', '[LOCATION_CITY_1]', 'VisitNotes:', 'Patient came in for a regular checkup.']

In addition to the redacted result, the entity list (not shown) can be used. This approach can be adjusted to column-by-column, but row-by-row usually leads to better detection performance.

The examples above process the entire table but they can easily be adapted to process specific rows and columns.

JSON & XML

info

The approach described in this section is implemented in the process/files routes for JSON and XML.

A similar approach applies for JSON and XML. Consider the following JSON:

Copy
Copied
{
  "VisitDates": ["15/2/2023", "8/5/2023"],
  "PhysName": "Dr Walter",
  "PersonAttributes": {"Weight": 78.5, "Height": 189},
  "MedLoc": "Mt Western Cancer Center"
}

Whilst it is possible to send the JSON through as text, like one might to an LLM, this might corrupt the JSON - a lot of pre- and post-processing logic is necessary for a production application. Instead, we can parse the JSON and arrange it into a <key>: <value> ... pattern like above, add colons to the last key before each value and enable link_batch. The only difference is that a value might have multiple keys in the hierarchy, so we prepend all of those. The request payload then looks like:

Request BodyPython
Copy
Copied
{
    "text": [
      "VisitDates:", "15/2/2023", "VisitDates:", "8/5/2023", "PhysName:", "Dr Walter", "PersonAttributes:", "Weight:", "78.5", "PersonAttributes:", "Height:", "189", "MedLoc:", "Mt Western Cancer Center"
    ],
    "link_batch": true
}
Copy
Copied
from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/deid/", api_key="aa716098693d470ab5ff3156cd040aa3")

text_request = request_objects.process_text_obj(text=["VisitDates:", "15/2/2023", "VisitDates:", "8/5/2023", "PhysName:", "Dr Walter", "PersonAttributes:", "Weight:", "78.5", "PersonAttributes:", "Height:", "189", "MedLoc:", "Mt Western Cancer Center"], link_batch=True)

response = client.process_text(text_request)

print(response.processed_text)

Yields the following:

Copy
Copied
['VisitDates:', '[DATE_1]', 'VisitDates:', '[DATE_2]', 'PhysName:', '[NAME_MEDICAL_PROFESSIONAL_1]', 'PersonAttributes:', 'Weight:', '[PHYSICAL_ATTRIBUTE_1]', 'PersonAttributes:', 'Height:', '[PHYSICAL_ATTRIBUTE_2]', 'MedLoc:', '[ORGANIZATION_MEDICAL_FACILITY_1]']
© Copyright 2024 Private AI.