Processing Structured Data Sources

This guide illustrates how to use Private AI on Structured Data sources such as Databases, JSON, CSV and XML files. It is also applicable to Databricks and Snowflake.

info

In order to run the example code in this guide, please sign up for your free test api key here or run the container.

Tabular Data

note

If it is only required to redact certain Columns such as VisitNotes please see our PII Safe Sentiment Analysis Walkthrough.

In addition to raw text and files, you can use Private AI to process tabular data, such as found in a database, CSV or Microsoft Excel file. For example, consider the below table:

ClientNm	DOfB	HlthCareNm	TempFd	VisitNotes
Stefan Krugler	150368	2342345234	NYC	Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in
Hari Seldon	201030	4947384944	LA	Patient came in for a regular checkup.

One could try to create a rule-based system, but it is difficult:

The column names are all acronyms, DOfB in particular is an uncommon way of referring to a date of birth.
Each cell value is written as an integer to save space, removing the hint of / or . characters in identifying the date.
TempFd contains the city the individual lives in, but is stored in a temporary field.
VisitNotes is a varchar free text field. This field is particularly useful for ML use cases, for example to determine the main visit reasons or for case resolution.

In particular, longer string and blob fields in databases can be difficult.

Processing As a String

A simple approach would be to join each row together and process as a single long string:

Request BodycURLPythonPython Client

Copy

Copied

{
    "text": ["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."]
}

Copy

Copied

curl --location 'https://api.private-ai.com/community/v3/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
    "text": [
        "ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."
    ]
}'

Copy

Copied

import requests

r = requests.post(url="https://api.private-ai.com/community/v3/process/text",
                  headers={"x-api-key": "<YOUR API KEY>"},
                  json={"text": ["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."]})

results = r.json()

print(results)

Copy

Copied

from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/community/", api_key="<YOUR API KEY>")

text_request = request_objects.process_text_obj(text=["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."])

response = client.process_text(text_request)

print(response.processed_text)

However the result isn't great:

Copy

Copied

['ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, [NAME_1], [LOCATION_1], Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in, [NAME_MEDICAL_PROFESSIONAL_1], [LOCATION_2], Patient came in for a regular checkup.']

Formatting the Data Correctly

We can get better results by formatting the input in a "key1: value1, key2: value2, ... pattern, to make sure that each cell entry includes the column name for context. If multiple headers exist, the format should look like: "key1a: key1b: value1, key2: value2, .... Note that it's important to put a colon at the end of each key.

Request BodycURLPythonPython Client

Copy

Copied

{
    "text": [
      "ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."
    ],
}

Copy

Copied

curl --location 'https://api.private-ai.com/community/v3/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
    "text": [
        "ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."
    ]
}'

Copy

Copied

import requests

r = requests.post(url="https://api.private-ai.com/community/v3/process/text",
                  headers={"x-api-key": "<YOUR API KEY>"},
                  json={"text": ["ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."]})

results = r.json()

print(results)

Copy

Copied

from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/community/", api_key="<YOUR API KEY>")

text_request = request_objects.process_text_obj(text=[
    "ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."])

response = client.process_text(text_request)

print(response.processed_text)

Whilst the result is much better, it's susceptible to False-Positive detections on column names. Also as the result is a long string, it is more difficult to relate the predictions back to individual cells:

Copy

Copied

['ClientNm: [NAME_1], DOfB: [DOB_1], HlthCareNm: [HEALTHCARE_NUMBER_1], TempFd: [LOCATION_CITY_1], VisitNotes: Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in, ClientNm: [NAME_2], DOfB: [DOB_2], HlthCareNm: [HEALTHCARE_NUMBER_2], TempFd: [LOCATION_CITY_2], VisitNotes: Patient came in for a regular checkup.']

Best Practice: Correct Formatting With Link Batch

info

These examples process the entire table in a single API request to ensure consistent entity numbering. For larger tables it is recommended to process row by row, without numbered markers.

We can solve both of the issues with the previous approach by feeding the table in as a list with link_batch on. This is functionally equivalent to the example presented above and this allows us to easily get the results of each cell without any post-processing logic.

Request BodycURLPythonPython Client

Copy

Copied

{
    "text": [
      "ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup.",
    ],
    "link_batch": True
}

Copy

Copied

curl --location 'https://api.private-ai.com/community/v3/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
    "text": [
        "ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."
    ]
}'

Copy

Copied

import requests

r = requests.post(url="https://api.private-ai.com/community/v3/process/text",
                  headers={"x-api-key": "<YOUR API KEY>"},
                  json={"text": ["ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."]})

results = r.json()

print(results)

Copy

Copied

from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/community/", api_key="<YOUR API KEY>")

text = ["ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."]

text_request = request_objects.process_text_obj(text=text, link_batch=True)
response = client.process_text(text_request)
print(response.processed_text)

The result is the same, but it is now easy to get the output for each cell by looking at the corresponding response in the list:

Copy

Copied

['ClientNm:', '[NAME_1]', ', DOfB:', '[DOB_1]', ', HlthCareNm:', '[HEALTHCARE_NUMBER_1]', ', TempFd:', '[LOCATION_CITY_1]', ', VisitNotes:', 'Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in', 'ClientNm:', '[NAME_2]', ', DOfB:', '[DOB_2]', ', HlthCareNm:', '[HEALTHCARE_NUMBER_2]', ', TempFd:', '[LOCATION_CITY_2]', ', VisitNotes:', 'Patient came in for a regular checkup.']

In addition to the redacted result, the entity list (not shown) can be used. This approach can be adjusted to column-by-column, but row-by-row usually leads to better detection performance.

The examples above process the entire table but they can easily be adapted to process specific rows and columns.

JSON & XML

info

The approach described in this section is implemented in the process/files routes for JSON and XML.

A similar approach applies for JSON and XML. Consider the following JSON:

Copy

Copied

{
  "VisitDates": ["15/2/2023", "8/5/2023"],
  "PhysName": "Dr Walter",
  "PersonAttributes": {"Weight": 78.5, "Height": 189},
  "MedLoc": "Mt Western Cancer Center"
}

Whilst it is possible to send the JSON through as text, like one might to an LLM, this might corrupt the JSON - a lot of pre- and post-processing logic is necessary for a production application. Instead, we can parse the JSON and arrange it into a <key>: <value> ... pattern like above, add colons to the last key before each value and enable link_batch. The only difference is that a value might have multiple keys in the hierarchy, so we prepend all of those. The request payload then looks like:

Request BodycURLPythonPython Client

Copy

Copied

{
    "text": [
      "VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"
    ],
    "link_batch": true
}

Copy

Copied

curl --location 'https://api.private-ai.com/community/v3/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
    "text": [
        "VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"
    ]
}'

Copy

Copied

import requests

r = requests.post(url="https://api.private-ai.com/community/v3/process/text",
                  headers={"x-api-key": "<YOUR API KEY>"},
                  json={"text": ["VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"]})

results = r.json()

print(results)

Copy

Copied

from privateai_client import PAIClient
from privateai_client import request_objects

client = PAIClient(url="https://api.private-ai.com/community/", api_key="<YOUR API KEY>")

text_request = request_objects.process_text_obj(text=["VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"], link_batch=True)

response = client.process_text(text_request)

print(response.processed_text)

Yields the following:

Copy

Copied

['VisitDates:', '[DATE_1]', ', VisitDates:', '[DATE_2]', ', PhysName:', '[NAME_MEDICAL_PROFESSIONAL_1]', ', PersonAttributes: Weight:', '[PHYSICAL_ATTRIBUTE_1]', 'PersonAttributes: Height:', '[PHYSICAL_ATTRIBUTE_2]', ', MedLoc:', '[ORGANIZATION_MEDICAL_FACILITY_1]']