Introduction

This is an end-user documentation for Private AI. The documentation is organised as follows:

Getting Started illustrates how to get started.
Fundamentals contains detailed documentation on each feature, such as filters.
Guides & Integrations contains a number of guides on how to use Private AI with LLMs and integrate with various services like dashboards.
API Reference contains full details for the Private AI REST API, including code samples and an interactive demo.
Using the Container describes how to install and run the container locally and in production.
Web Demos contains interactive demos that you can try on your own without need your own API key or container. See how we do!
Frequently Asked Questions contains answers to frequently asked questions.
Supported Entity Types lists the entity types that Private AI currently supports.
Translated Labels lists the labels for entity types that Private AI currently supports.
Supported Languages lists the languages that Private AI currently supports.
Supported File Types lists different filetypes that Private AI currently supports.
Release Notes contains release notes for the current and past Private AI container versions.
Acknowledgements contains acknowledgements to some of the great datasets and library our software is built upon.

Background

This section covers some basic concepts like the definition of PII and provides some background detail on Private AI. In addition to the information below, please see our blog page and research publications.

What is PII?

PII stands for "Personally Identifiable Information" and encompasses any form of information that could be used to identify someone. Common examples of PII include names, phone numbers and credit numbers. These directly identify someone and are hence called 'direct identifiers'.

In addition to direct identifiers, PII also includes 'quasi-identifiers', which on their own cannot uniquely identify a person, but can exponentially increase the likelihood of re-identifying an individual when grouped together. Examples of quasi-identifiers include nationality, religion and prescribed medications. For example, consider a company with 10,000 customers. Knowing that a particular customer lives in Delaware isn't likely to allow for re-identification, but knowing that they live in Delaware, follow Bhuddhism, is male, has Dutch nationality and is taking heart medication probably is!

What is considered PII also depends on the relevant legislation, such as the General Data Protection Regulation (GDPR) or California Consumer Privacy Act (CCPA). The GDPR, for instance, provides the following definition of personal data: "'Personal data' means any information relating to an identified or identifiable natural person ('data subject'); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person." (source: GDPR website)

The CCPA defines 'personal information' as "information that identifies, relates to, or could reasonably be linked with you or your household. For example, it could include your name, social security number, email address, records of products purchased, internet browsing history, geolocation data, fingerprints, and inferences from other personal information that could create a profile about your preferences and characteristics." (source: CCPA website)

Even whom the information relates to/identifies/could be linked to differs between legislations ('data subject' in the GDPR vs. 'you or your household' in the CCPA).

What is De-identification?

De-identification is the process of obscuring information that might reveal someone's identity. De-identification plays a key role in data minimization, which means collecting only absolutely necessary personal data. Not only does that protect individuals' privacy from the data collector (e.g., corporation, government), but it also prevents significant harm to individuals and data collectors in the event of a data breach.

It's a topic of debate that redaction, anonymization and de-identification don't work. This is largely due to a number of high profile, improperly de-identified datasets created by companies claiming that they were anonymized. We wrote about this in our article Data Anonymization: Perspectives from a Former Skeptic. Another key reason is that legacy de-identification systems rely on rule-based PII detectors, which are usually made up of regular expressions (regexes).

warning

The terms redaction, anonymization, and de-identification are frequently used interchangeably. However, caution should be exercised as this practice can be inaccurate and potentially risky. To understand the distinctions between these terms comprehensively, we invite you to peruse our article, Demystifying De-identification, on the subject.

Why is PII Identification and Removal Hard?

Identifying and removing PII requires going beyond removing direct identifiers like names, phone numbers, credit card numbers and social security numbers. For example, quasi-identifiers such as illnesses, sexual orientation, religious beliefs and prescribed medications can all be considered as PII.

In addition to the breadth of what is considered PII, real-world data contains many edge cases that need to be considered. For example, what about a person who is named Paris, or June? What about an internal office extension of x324? In addition to this, even clearly defined PII types can take on many different forms. The United States for example has a different driver's license format in each state, on top of the different formats each country uses. Credit card numbers, for example can be split up in ASR transcripts: Could I have the first four digits of the card please? Four five six seven. Thanks, the next four please? One three two five

For these reasons it is tough to develop rule or regex-based systems that perform well on real world data. To this end, Private AI relies on the latest advancements in Machine Learning (ML) to identify PII based on context. The Private AI team includes linguists and privacy experts who make informed decisions on what is and is not considered PII, in line with current privacy legislation.

Why is Privacy Important in Machine Learning?

Modern Neural Network models such as transformers excel at memorizing training data and can leak sensitive information at inference time. A good example of things going wrong is the ScatterLab Lee-Luda chatbot scandal, where a chatbot trained on intimate conversations started using memorized PII (such as home addresses) in conversations with other people. Even classification models such as those trained for sentiment analysis have been shown to retain sensitive data in input embeddings, allowing for PII to be extracted. For these reasons, Neural Network models such as transformers should never be trained on personal data without some privacy mitigation steps being taken.

Removing all identifying information also helps improve fairness. A model can't discriminate against age and gender if it has been removed from the input data!

Why Synthetic PII?

Generating synthetic PII has two key advantages. Firstly, any PII identification errors become much harder to find. An attacker must first identify what PII is real, and then use this PII to re-identify target subjects. Secondly, synthetic PII eliminates data shift between training and inference. Transformer models in particular are typically pre-trained on large corpora of natural text and synthetic PII is able to eliminate data shift between pre-training and fine-tuning, reducing any accuracy loss that might be induced by redaction.

Private AI's synthetic PII generation system relies on ML to generate PII that is more realistic and better fits the surrounding context than legacy systems relying on lookup tables. While the synthetic PII generation system is still in beta, it was successfully used to eliminate the accuracy loss caused by redaction in the CoLA (Corpus of Linguistic Acceptability) subtask of the GLUE benchmark. A copy of the results is available upon request.