Processing Structured Data Sources
This guide illustrates how to use Private AI on Structured Data sources such as Databases, JSON, CSV and XML files. It is also applicable to Databricks and Snowflake.
info
In order to run the example code in this guide, please sign up for your free test api key here or run the container.
Tabular Data
note
If it is only required to redact certain Columns such as VisitNotes
please see our PII Safe Sentiment Analysis Walkthrough.
In addition to raw text and files, you can use Private AI to process tabular data, such as found in a database, CSV or Microsoft Excel file. For example, consider the below table:
ClientNm | DOfB | HlthCareNm | TempFd | VisitNotes |
---|---|---|---|---|
Stefan Krugler | 150368 | 2342345234 | NYC | Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in |
Hari Seldon | 201030 | 4947384944 | LA | Patient came in for a regular checkup. |
One could try to create a rule-based system, but it is difficult:
-
The column names are all acronyms,
DOfB
in particular is an uncommon way of referring to a date of birth. -
Each cell value is written as an integer to save space, removing the hint of
/
or.
characters in identifying the date. -
TempFd
contains the city the individual lives in, but is stored in a temporary field. -
VisitNotes
is a varchar free text field. This field is particularly useful for ML use cases, for example to determine the main visit reasons or for case resolution.
In particular, longer string and blob fields in databases can be difficult.
Processing As a String
A simple approach would be to join each row together and process as a single long string:
{
"text": ["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."]
}
curl --location 'https://api.private-ai.com/community/v4/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
"text": [
"ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."
]
}'
import requests
r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
headers={"x-api-key": "<YOUR API KEY>"},
json={"text": ["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."]})
results = r.json()
print(results)
from privateai_client import PAIClient
from privateai_client import request_objects
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")
text_request = request_objects.process_text_obj(text=["ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, Stefan Krugler, 150368, 2342345234, NYC, Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, Hari Seldon, 201030, 4947384944, LA, Patient came in for a regular checkup."])
response = client.process_text(text_request)
print(response.processed_text)
However the result isn't great:
['ClientNm, DOfB, HlthCareNm, TempFd, VisitNotes, [NAME_1], [LOCATION_1], Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in, [NAME_MEDICAL_PROFESSIONAL_1], [LOCATION_2], Patient came in for a regular checkup.']
Formatting the Data Correctly
We can get better results by formatting the input in a "key1: value1, key2: value2, ...
pattern, to make sure that each cell entry includes the column name for context. If multiple headers exist, the format should look like: "key1a: key1b: value1, key2: value2, ...
. Note that it's important to put a colon at the end of each key.
{
"text": [
"ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."
],
}
curl --location 'https://api.private-ai.com/community/v4/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
"text": [
"ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."
]
}'
import requests
r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
headers={"x-api-key": "<YOUR API KEY>"},
json={"text": ["ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."]})
results = r.json()
print(results)
from privateai_client import PAIClient
from privateai_client import request_objects
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")
text_request = request_objects.process_text_obj(text=[
"ClientNm: Stefan Krugler, DOfB: 150368, HlthCareNm: 2342345234, TempFd: NYC, VisitNotes: Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in, ClientNm: Hari Seldon, DOfB: 201030, HlthCareNm: 4947384944, TempFd: LA, VisitNotes: Patient came in for a regular checkup."])
response = client.process_text(text_request)
print(response.processed_text)
Whilst the result is much better, it's susceptible to False-Positive detections on column names. Also as the result is a long string, it is more difficult to relate the predictions back to individual cells:
['ClientNm: [NAME_1], DOfB: [DOB_1], HlthCareNm: [HEALTHCARE_NUMBER_1], TempFd: [LOCATION_CITY_1], VisitNotes: Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in, ClientNm: [NAME_2], DOfB: [DOB_2], HlthCareNm: [HEALTHCARE_NUMBER_2], TempFd: [LOCATION_CITY_2], VisitNotes: Patient came in for a regular checkup.']
Best Practice: Correct Formatting With Link Batch
info
These examples process the entire table in a single API request to ensure consistent entity numbering. For larger tables it is recommended to process row by row, without numbered markers.
We can solve both of the issues with the previous approach by feeding the table in as a list with link_batch on. This is functionally equivalent to the example presented above and this allows us to easily get the results of each cell without any post-processing logic.
{
"text": [
"ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup.",
],
"link_batch": True
}
curl --location 'https://api.private-ai.com/community/v4/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
"text": [
"ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."
]
}'
import requests
r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
headers={"x-api-key": "<YOUR API KEY>"},
json={"text": ["ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."]})
results = r.json()
print(results)
from privateai_client import PAIClient
from privateai_client import request_objects
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")
text = ["ClientNm:", "Stefan Krugler", ", DOfB:", "150368", ", HlthCareNm:", "2342345234", ", TempFd:", "NYC", ", VisitNotes:", "Prescribed Stefan 10mg of Ibuprofen for his migraines. Will call him on +1 324 4325 5462 when I get the test results in", "ClientNm:", "Hari Seldon", ", DOfB:", "201030", ", HlthCareNm:", "4947384944", ", TempFd:", "LA", ", VisitNotes:", "Patient came in for a regular checkup."]
text_request = request_objects.process_text_obj(text=text, link_batch=True)
response = client.process_text(text_request)
print(response.processed_text)
The result is the same, but it is now easy to get the output for each cell by looking at the corresponding response in the list:
['ClientNm:', '[NAME_1]', ', DOfB:', '[DOB_1]', ', HlthCareNm:', '[HEALTHCARE_NUMBER_1]', ', TempFd:', '[LOCATION_CITY_1]', ', VisitNotes:', 'Prescribed [NAME_GIVEN_1] [DOSE_1] of [DRUG_1] for his [CONDITION_1]. Will call him on [PHONE_NUMBER_1] when I get the test results in', 'ClientNm:', '[NAME_2]', ', DOfB:', '[DOB_2]', ', HlthCareNm:', '[HEALTHCARE_NUMBER_2]', ', TempFd:', '[LOCATION_CITY_2]', ', VisitNotes:', 'Patient came in for a regular checkup.']
In addition to the redacted result, the entity list (not shown) can be used. This approach can be adjusted to column-by-column, but row-by-row usually leads to better detection performance.
The examples above process the entire table but they can easily be adapted to process specific rows and columns.
JSON & XML
info
A similar approach applies for JSON and XML. Consider the following JSON:
{
"VisitDates": ["15/2/2023", "8/5/2023"],
"PhysName": "Dr Walter",
"PersonAttributes": {"Weight": 78.5, "Height": 189},
"MedLoc": "Mt Western Cancer Center"
}
Whilst it is possible to send the JSON through as text, like one might to an LLM, this might corrupt the JSON - a lot of pre- and post-processing logic is necessary for a production application. Instead, we can parse the JSON and arrange it into a <key>: <value> ...
pattern like above, add colons to the last key before each value and enable link_batch. The only difference is that a value might have multiple keys in the hierarchy, so we prepend all of those. The request payload then looks like:
{
"text": [
"VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"
],
"link_batch": true
}
curl --location 'https://api.private-ai.com/community/v4/process/text' \
--header 'Content-Type: application/json' \
--header 'x-api-key: <YOUR KEY HERE>' \
--data '{
"text": [
"VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"
]
}'
import requests
r = requests.post(url="https://api.private-ai.com/community/v4/process/text",
headers={"x-api-key": "<YOUR API KEY>"},
json={"text": ["VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"]})
results = r.json()
print(results)
from privateai_client import PAIClient
from privateai_client import request_objects
client = PAIClient(url="https://api.private-ai.com/community/v4/", api_key="<YOUR API KEY>")
text_request = request_objects.process_text_obj(text=["VisitDates:", "15/2/2023", ", VisitDates:", "8/5/2023", ", PhysName:", "Dr Walter", ", PersonAttributes: Weight:", "78.5", ", PersonAttributes: Height:", "189", ", MedLoc:", "Mt Western Cancer Center"], link_batch=True)
response = client.process_text(text_request)
print(response.processed_text)
Yields the following:
['VisitDates:', '[DATE_1]', ', VisitDates:', '[DATE_2]', ', PhysName:', '[NAME_MEDICAL_PROFESSIONAL_1]', ', PersonAttributes: Weight:', '[PHYSICAL_ATTRIBUTE_1]', 'PersonAttributes: Height:', '[PHYSICAL_ATTRIBUTE_2]', ', MedLoc:', '[ORGANIZATION_MEDICAL_FACILITY_1]']