Integrating with Huggingface: Privacy-Preserving Sentiment Analysis
info
In order to run the example code in this guide, please sign up for your free test api key here.
This tutorial will illustrate how to use Private AI to de-identify data before training transformer models with Huggingface. This is important because transformer models in particular are capable of capturing and learning from sensitive details present in the data. Therefore, privacy considerations are more crucial than ever.
A clear illustration of the privacy risks in machine learning is the AOL search data leak in 2006, where search queries of 650,000 AOL users were publicly released. Despite anonymizing the usernames, reporters from The New York Times were still able to identify an individual solely based on their search queries. This underscores the potential risks associated with training AI models on sensitive datasets, even when care is taken to anonymize the data.
To address this, we'll show how to fine-tune a sentiment analysis model while ensuring that personally identifiable information (PII) is removed from the training set. We'll be using the IMDB dataset to fine-tune a DistilBERT model, a smaller, faster variant of BERT that retains over 95% of BERT's performance. For the deidentification process, we'll use the Private AI Docker container and the Python Client to interface with it.
This tutorial builds on the foundations laid out in this Hugging Face sentiment analysis tutorial. For detailed understanding of the sentiment analysis process and the Hugging Face tools used, we highly recommend reviewing that tutorial. However, our focus here is on addressing privacy concerns and demonstrating the usage of Private AI's tooling in this context.
Let's dive in!
Setting Up the Environment
First, we first need to ensure that our environment has the necessary dependencies. Each of these libraries plays a crucial role in our project:
-
datasets
: This is a library for easily loading and preprocessing datasets. We'll be using it to load and preprocess the IMDB dataset. -
transformers
: This is the core library we'll be using for our model. It provides us with the pre-trained DistilBERT model and other utilities like tokenizers and trainers which are vital for fine-tuning our model. -
huggingface_hub
: This library allows us to save, load, and interact with models stored on the Hugging Face model hub. We'll use it to store our fine-tuned model, making it easy to access and use for inference in the future. -
privateai_client
: This is the Private AI Python Client, which we'll use to communicate with the Private AI Docker container that handles data deidentification. -
numpy
: Will be used for computing performance metrics.
Let's go ahead and install these libraries. Use the following command in your Python environment:
pip install datasets transformers privateai_client numpy
Additionally, we will also need to install git-lfs
, as it is required for managing large files with Git. This will be useful for working with our model repository. You can install it using:
!apt-get install git-lfs
Last but not least, we need to pull and run the Private AI Docker container, which is responsible for data deidentification. Follow the instructions on Private AI's Quickstart Guide to set up the Docker container. Note that using the GPU version of the container with the proper hardware will significantly speed up the deidentification process.
With everything set up, now we are ready to start processing our data!
Loading the Dataset
To start, we load the IMDB dataset and create smaller subsets for faster training and testing. This can be accomplished easily with the datasets
library:
from datasets import load_dataset
# Load dataset and create smaller subsets
imdb = load_dataset("imdb")
small_train_dataset = imdb["train"].shuffle(seed=42).select(list(range(3000)))
small_test_dataset = imdb["test"].shuffle(seed=42).select(list(range(300)))
Having these subsets ready, we can proceed with data preprocessing.
Initializing the Private AI client and the Tokenizer
The Private AI client is responsible for connecting with the Private AI Docker container, where the deidentification of the data occurs. The arguments passed to the PrivateAI
constructor define where the container can be reached and could vary depending on your specific setup.
We will also use the DistilBERT tokenizer to convert our text data into a format that the model can understand.
from transformers import AutoTokenizer
from privateai_client import PrivateAI
# Change these these avlues to match your local setup:
scheme = "http"
host = "localhost"
port = "8080"
# Initialize Private AI client
client = PrivateAI(scheme=scheme, host=host, port=port)
# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
Preprocessing and De-identifying Data
The following function is the crux of our preprocessing. It takes in text examples, deidentifies them using the Private AI client, and then tokenizes the deidentified text:
from privateai_client import request_objects
def preprocess_batch(examples):
text_request = request_objects.process_text_obj(
text=examples["text"],
link_batch=False,
processed_text=request_objects.processed_text_obj(pattern="[BEST_ENTITY_TYPE]")
)
response = client.process_text(text_request)
return tokenizer(response.processed_text, truncation=True)
To deidentify each batch of examples we first have to create a process_text_obj
with the content of the examples, and various settings about the deidentification process:
-
link_batch
: This argument decides whether the list of input texts are treated as one context by the deidentification. When set toFalse
, it ensures the individual texts of the dataset share no context with each other. This is important in our case because each review in our dataset is independent of the others. -
processed_text
: This argument allows us to customize how the deidentified text looks. In this case, we've set thepattern
parameter to " [BEST ENTITY TYPE] ". This means that any personally identifiable information (PII) detected will be replaced by a marker describing its type.
The created request object is sent to the Private AI Docker Container for processing by client.process_text(...)
. For more details and options on these and other parameters, please refer to the Python Client documentation and the API documentation.
With the preprocessing function defined, we can apply it to both our training and testing datasets with the Dataset.map
function:
# Map the preprocess function to the train and test datasets
tokenized_train = small_train_dataset.map(preprocess_batch, batched=True, batch_size=10)
tokenized_test = small_test_dataset.map(preprocess_batch, batched=True, batch_size=10)
Now, with our deidentified and tokenized data in hand, we can proceed to train the model.
Training and Evaluating the Model
Since the training and evaluation steps do not differ from the usual procedure with Hugging Face's transformers
library, we won't dive into the specifics here. For detailed steps on model training and evaluation, please refer back to the original tutorial. For your convenience, all the necessary code is provided in the code block below, without further explanation.
from transformers import DataCollatorWithPadding, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import numpy as np
from datasets import load_metric
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)
def compute_metrics(eval_pred):
load_accuracy = load_metric("accuracy")
load_f1 = load_metric("f1")
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
return {"accuracy": accuracy, "f1": f1}
repo_name = "finetuning-sentiment-model-3000-samples-deidentified"
training_args = TrainingArguments(
output_dir=repo_name,
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
weight_decay=0.01,
save_strategy="epoch",
push_to_hub=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_train,
eval_dataset=tokenized_test,
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
trainer.train()
trainer.evaluate()
trainer.push_to_hub()
Experimental Summary
To assess the impact of deidentification on model performance, we ran a series of experiments across two different models and two different datasets. This allows us to generalize beyond a single model-dataset combination. All results were produced by the script presented above with minor modification (namely the replacement of the dataset, model and tokenizer names, and the number of output labels of the model).
The models used were "distilbert-base-uncased"
(as shown above) and "roberta-base"
. These represent a simplified yet powerful transformer model, and a more complex one, allowing us to see how deidentification might impact differently across complexity scales.
For the datasets, we chose the "imdb"
dataset, as we had been working with earlier, and the "sentiment"
configuration from the "tweet_eval"
dataset. This gives us a view on how deidentification impacts performance on different kinds of text data - long-form movie reviews and short-form tweets.
Here's a table summarizing the results from the experiments:
(The F1 scores for the tweet dataset were calculated using the "macro"
averaging method)
Model / Dataset | IMDB | |||
---|---|---|---|---|
No Deidentification | With Deidentification | No Deidentification | With Deidentification | |
DistilBERT | Accuracy: 0.86 F1-score: 0.8599 |
Accuracy: 0.8633 F1-score: 0.8629 |
Accuracy: 0.66 F1-score: 0.6604 |
Accuracy: 0.6533 F1-score: 0.6531 |
RoBERTa | Accuracy: 0.91 F1-score: 0.9132 |
Accuracy: 0.91 F1-score: 0.9103 |
Accuracy: 0.6833 F1-score: 0.6858 |
Accuracy: 0.67 F1-score: 0.678 |
Across the board, the performance metrics show that deidentification has a minimal impact on model performance. This is expected, as the specific content of the PII fields typically does not affect the sentiment of the text. Hence, the sentiment can be inferred equally well even when this information is replaced by markers.
Below you find a summary of the number and types of the redacted entities from both the training and evaluation split of both datasets. Check the full list of supported entity types for a detailed description.
It's worth noting that the used data also contain more sensitive PII, though they appeared less frequently. For example over 20 birth dates of primarily Twitter users have been successfully redacted by the Private AI Docker container.
Model / Dataset | IMDB | |||
---|---|---|---|---|
Training split (3000 items) |
Evaluation split (300 items) |
Training split (3000 items) |
Evaluation split (300 items) |
|
Total number of redacted entities | 35534 | 3039 | 7966 | 764 |
Top 5 most frequent entity types | NAME: 8294 OCCUPATION: 5862 NAME_GIVEN: 5079 NAME_FAMILY: 4176 DURATION: 1611 |
NAME: 659 OCCUPATION: 537 NAME_GIVEN: 388 NAME_FAMILY: 328 ORIGIN: 171 |
ORGANIZATION: 1556 NAME: 1382 USERNAME: 838 NAME_GIVEN: 723 EVENT: 617 |
USERNAME: 151 ORGANIZATION: 98 NAME: 90 NAME_FAMILY: 60 POLITICAL_AFFILIATION: 58 |
Our experiments demonstrate the effectiveness of the deidentification process in preserving the utility of the data for our machine learning task, while ensuring privacy.
Conclusion
Congratulations on completing this privacy-focused tutorial! By adding a deidentification step to our preprocessing function, we ensure that our model doesn't learn any sensitive data during training, enhancing privacy. This is a vital consideration in any data science project, especially when working with potentially sensitive datasets. Always remember, with great data, comes great responsibility!