Processing Structured Data Sources

This guide illustrates how to use Private AI on Structured Data sources such as Databases, JSON, CSV and XML files. It is also applicable to Databricks and Snowflake.

info

In order to run the example code in this guide, please sign up for your free test api key here or run the container.

Tabular Data

note

If it is only required to redact certain Columns such as VisitNotes please see our PII Safe Sentiment Analysis Walkthrough.

In addition to raw text and files, you can use Private AI to process tabular data, such as found in a database, CSV or Microsoft Excel file. For example, consider the below table:

ClientNm DOfB HlthCareNm TempFd VisitNotes
Stefan Krugler 150368 2342345234 NYC Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in
Hari Seldon 201030 4947384944 LA Patient came in for a regular checkup.

One could try to create a rule-based system, but it is difficult:

  • The column names are all acronyms, DOfB in particular is an uncommon way of referring to a date of birth.
  • Each cell value is written as an integer to save space, removing the hint of / or . characters in identifying the date.
  • TempFd contains the city the individual lives in, but is stored in a temporary field.
  • VisitNotes is a varchar free text field. This field is particularly useful for ML use cases, for example to determine the main visit reasons or for case resolution.

In particular, longer string and blob fields in databases can be difficult.

Processing As a String

A simple approach would be to join each row together and process as a single long string:

Request BodycURLPythonPython Client
Copy
Copied
{
    "text": ["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."]
}
Copy
Copied
curl --location 'https://api.private-ai.com/community/v4/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
    "text": [
        "ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."
    ]
}'
Copy
Copied
import requests

r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
                  headers={"x-api-key": "<YOUR API KEY>"},
                  json={"text": ["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."]})

results = r.json()

print(results)
Copy
Copied
from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")

text_request = request_objects.process_text_obj(text=["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."])

response = client.process_text(text_request)

print(response.processed_text)

However the result isn't great:

Copy
Copied
['ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, [NAME_1], [LOCATION_1], Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in, [NAME_MEDICAL_PROFESSIONAL_1], [LOCATION_2], Patient came in for a regular checkup.']

Formatting the Data Correctly

We can get better results by formatting the input in a "key1: value1, key2: value2, ... pattern, to make sure that each cell entry includes the column name for context. If multiple headers exist, the format should look like: "key1a: key1b: value1, key2: value2, .... Note that it's important to put a colon at the end of each key.

Request BodycURLPythonPython Client
Copy
Copied
{
    "text": [
      "ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."
    ],
}
Copy
Copied
curl --location 'https://api.private-ai.com/community/v4/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
    "text": [
        "ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."
    ]
}'
Copy
Copied
import requests

r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
                  headers={"x-api-key": "<YOUR API KEY>"},
                  json={"text": ["ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."]})

results = r.json()

print(results)
Copy
Copied
from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")

text_request = request_objects.process_text_obj(text=[
    "ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."])

response = client.process_text(text_request)

print(response.processed_text)

Whilst the result is much better, it's susceptible to False-Positive detections on column names. Also as the result is a long string, it is more difficult to relate the predictions back to individual cells:

Copy
Copied
['ClientNm: [NAME_1], DOfB: [DOB_1], HlthCareNm: [HEALTHCARE_NUMBER_1], TempFd: [LOCATION_CITY_1], VisitNotes: Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in, ClientNm: [NAME_2], DOfB: [DOB_2], HlthCareNm: [HEALTHCARE_NUMBER_2], TempFd: [LOCATION_CITY_2], VisitNotes: Patient came in for a regular checkup.']

Best Practice: Correct Formatting With Link Batch

info

These examples process the entire table in a single API request to ensure consistent entity numbering. For larger tables it is recommended to process row by row, without numbered markers.

We can solve both of the issues with the previous approach by feeding the table in as a list with link_batch on. This is functionally equivalent to the example presented above and this allows us to easily get the results of each cell without any post-processing logic.

Request BodycURLPythonPython Client
Copy
Copied
{
    "text": [
      "ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup.",
    ],
    "link_batch": True
}
Copy
Copied
curl --location 'https://api.private-ai.com/community/v4/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
    "text": [
        "ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."
    ]
}'
Copy
Copied
import requests

r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
                  headers={"x-api-key": "<YOUR API KEY>"},
                  json={"text": ["ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."]})

results = r.json()

print(results)
Copy
Copied
from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")

text = ["ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."]

text_request = request_objects.process_text_obj(text=text, link_batch=True)
response = client.process_text(text_request)
print(response.processed_text)

The result is the same, but it is now easy to get the output for each cell by looking at the corresponding response in the list:

Copy
Copied
['ClientNm:', '[NAME_1]', ', DOfB:', '[DOB_1]', ', HlthCareNm:', '[HEALTHCARE_NUMBER_1]', ', TempFd:', '[LOCATION_CITY_1]', ', VisitNotes:', 'Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in', 'ClientNm:', '[NAME_2]', ', DOfB:', '[DOB_2]', ', HlthCareNm:', '[HEALTHCARE_NUMBER_2]', ', TempFd:', '[LOCATION_CITY_2]', ', VisitNotes:', 'Patient came in for a regular checkup.']

In addition to the redacted result, the entity list (not shown) can be used. This approach can be adjusted to column-by-column, but row-by-row usually leads to better detection performance.

The examples above process the entire table but they can easily be adapted to process specific rows and columns.

JSON & XML

info

The approach described in this section is implemented in the process/files routes for JSON and XML.

A similar approach applies for JSON and XML. Consider the following JSON:

Copy
Copied
{
  "VisitDates": ["15/2/2023", "8/5/2023"],
  "PhysName": "Dr Walter",
  "PersonAttributes": {"Weight": 78.5, "Height": 189},
  "MedLoc": "Mt Western Cancer Center"
}

Whilst it is possible to send the JSON through as text, like one might to an LLM, this might corrupt the JSON - a lot of pre- and post-processing logic is necessary for a production application. Instead, we can parse the JSON and arrange it into a <key>: <value> ... pattern like above, add colons to the last key before each value and enable link_batch. The only difference is that a value might have multiple keys in the hierarchy, so we prepend all of those. The request payload then looks like:

Request BodycURLPythonPython Client
Copy
Copied
{
    "text": [
      "VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"
    ],
    "link_batch": true
}
Copy
Copied
curl --location 'https://api.private-ai.com/community/v4/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
    "text": [
        "VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"
    ]
}'
Copy
Copied
import requests

r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
                  headers={"x-api-key": "<YOUR API KEY>"},
                  json={"text": ["VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"]})

results = r.json()

print(results)
Copy
Copied
from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")

text_request = request_objects.process_text_obj(text=["VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"], link_batch=True)

response = client.process_text(text_request)

print(response.processed_text)

Yields the following:

Copy
Copied
['VisitDates:', '[DATE_1]', ', VisitDates:', '[DATE_2]', ', PhysName:', '[NAME_MEDICAL_PROFESSIONAL_1]', ', PersonAttributes: Weight:', '[PHYSICAL_ATTRIBUTE_1]', 'PersonAttributes: Height:', '[PHYSICAL_ATTRIBUTE_2]', ', MedLoc:', '[ORGANIZATION_MEDICAL_FACILITY_1]']
© Copyright 2024 Private AI.