Integrating with Huggingface: Privacy-Preserving Sentiment Analysis

info

In order to run the example code in this guide, please sign up for your free test api key here.

This tutorial will illustrate how to use Private AI to de-identify data before training transformer models with Huggingface. This is important because transformer models in particular are capable of capturing and learning from sensitive details present in the data. Therefore, privacy considerations are more crucial than ever.

A clear illustration of the privacy risks in machine learning is the AOL search data leak in 2006, where search queries of 650,000 AOL users were publicly released. Despite anonymizing the usernames, reporters from The New York Times were still able to identify an individual solely based on their search queries. This underscores the potential risks associated with training AI models on sensitive datasets, even when care is taken to anonymize the data.

To address this, we'll show how to fine-tune a sentiment analysis model while ensuring that personally identifiable information (PII) is removed from the training set. We'll be using the IMDB dataset to fine-tune a DistilBERT model, a smaller, faster variant of BERT that retains over 95% of BERT's performance. For the deidentification process, we'll use the Private AI Docker container and the Python Client to interface with it.

This tutorial builds on the foundations laid out in this Hugging Face sentiment analysis tutorial. For detailed understanding of the sentiment analysis process and the Hugging Face tools used, we highly recommend reviewing that tutorial. However, our focus here is on addressing privacy concerns and demonstrating the usage of Private AI's tooling in this context.

Let's dive in!

Setting Up the Environment

First, we first need to ensure that our environment has the necessary dependencies. Each of these libraries plays a crucial role in our project:

datasets : This is a library for easily loading and preprocessing datasets. We'll be using it to load and preprocess the IMDB dataset.
transformers : This is the core library we'll be using for our model. It provides us with the pre-trained DistilBERT model and other utilities like tokenizers and trainers which are vital for fine-tuning our model.
huggingface_hub : This library allows us to save, load, and interact with models stored on the Hugging Face model hub. We'll use it to store our fine-tuned model, making it easy to access and use for inference in the future.
privateai_client : This is the Private AI Python Client, which we'll use to communicate with the Private AI Docker container that handles data deidentification.
numpy : Will be used for computing performance metrics.

Let's go ahead and install these libraries. Use the following command in your Python environment:

Copy

Copied

pip install datasets transformers privateai_client numpy

Additionally, we will also need to install git-lfs, as it is required for managing large files with Git. This will be useful for working with our model repository. You can install it using:

Copy

Copied

!apt-get install git-lfs

Last but not least, we need to pull and run the Private AI Docker container, which is responsible for data deidentification. Follow the instructions on Private AI's Quickstart Guide to set up the Docker container. Note that using the GPU version of the container with the proper hardware will significantly speed up the deidentification process.

With everything set up, now we are ready to start processing our data!

Loading the Dataset

To start, we load the IMDB dataset and create smaller subsets for faster training and testing. This can be accomplished easily with the datasets library:

Copy

Copied

from datasets import load_dataset

# Load dataset and create smaller subsets
imdb = load_dataset("imdb")
small_train_dataset = imdb["train"].shuffle(seed=42).select(list(range(3000)))
small_test_dataset = imdb["test"].shuffle(seed=42).select(list(range(300)))

Having these subsets ready, we can proceed with data preprocessing.

Initializing the Private AI client and the Tokenizer

The Private AI client is responsible for connecting with the Private AI Docker container, where the deidentification of the data occurs. The arguments passed to the PrivateAI constructor define where the container can be reached and could vary depending on your specific setup.

We will also use the DistilBERT tokenizer to convert our text data into a format that the model can understand.

Copy

Copied

from transformers import AutoTokenizer
from privateai_client import PrivateAI

# Change these these avlues to match your local setup:
scheme = "http"
host = "localhost"
port = "8080"

# Initialize Private AI client
client = PrivateAI(scheme=scheme, host=host, port=port)

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Preprocessing and De-identifying Data

The following function is the crux of our preprocessing. It takes in text examples, deidentifies them using the Private AI client, and then tokenizes the deidentified text:

Copy

Copied

from privateai_client import request_objects

def preprocess_batch(examples):
    text_request = request_objects.process_text_obj(
        text=examples["text"],
        link_batch=False,
        processed_text=request_objects.processed_text_obj(pattern="[BEST_ENTITY_TYPE]")
    )
    response = client.process_text(text_request)
    return tokenizer(response.processed_text, truncation=True)

To deidentify each batch of examples we first have to create a process_text_obj with the content of the examples, and various settings about the deidentification process:

link_batch : This argument decides whether the list of input texts are treated as one context by the deidentification. When set to False , it ensures the individual texts of the dataset share no context with each other. This is important in our case because each review in our dataset is independent of the others.
processed_text : This argument allows us to customize how the deidentified text looks. In this case, we've set the pattern parameter to " [BEST ENTITY TYPE] ". This means that any personally identifiable information (PII) detected will be replaced by a marker describing its type.

The created request object is sent to the Private AI Docker Container for processing by client.process_text(...). For more details and options on these and other parameters, please refer to the Python Client documentation and the API documentation.

With the preprocessing function defined, we can apply it to both our training and testing datasets with the Dataset.map function:

Copy

Copied

# Map the preprocess function to the train and test datasets
tokenized_train = small_train_dataset.map(preprocess_batch, batched=True, batch_size=10)
tokenized_test = small_test_dataset.map(preprocess_batch, batched=True, batch_size=10)

Now, with our deidentified and tokenized data in hand, we can proceed to train the model.

Training and Evaluating the Model

Since the training and evaluation steps do not differ from the usual procedure with Hugging Face's transformers library, we won't dive into the specifics here. For detailed steps on model training and evaluation, please refer back to the original tutorial. For your convenience, all the necessary code is provided in the code block below, without further explanation.

Copy

Copied

from transformers import DataCollatorWithPadding, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
import numpy as np
from datasets import load_metric

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

def compute_metrics(eval_pred):
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)["accuracy"]
    f1 = load_f1.compute(predictions=predictions, references=labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

repo_name = "finetuning-sentiment-model-3000-samples-deidentified"

training_args = TrainingArguments(
    output_dir=repo_name,
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    save_strategy="epoch",
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_test,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()

trainer.push_to_hub()

Experimental Summary

To assess the impact of deidentification on model performance, we ran a series of experiments across two different models and two different datasets. This allows us to generalize beyond a single model-dataset combination. All results were produced by the script presented above with minor modification (namely the replacement of the dataset, model and tokenizer names, and the number of output labels of the model).

The models used were "distilbert-base-uncased" (as shown above) and "roberta-base". These represent a simplified yet powerful transformer model, and a more complex one, allowing us to see how deidentification might impact differently across complexity scales.

For the datasets, we chose the "imdb" dataset, as we had been working with earlier, and the "sentiment" configuration from the "tweet_eval" dataset. This gives us a view on how deidentification impacts performance on different kinds of text data - long-form movie reviews and short-form tweets.

Here's a table summarizing the results from the experiments: (The F1 scores for the tweet dataset were calculated using the "macro" averaging method)

Model / Dataset	IMDB		Twitter
Model / Dataset	No Deidentification	With Deidentification	No Deidentification	With Deidentification
DistilBERT	Accuracy: 0.86 F1-score: 0.8599	Accuracy: 0.8633 F1-score: 0.8629	Accuracy: 0.66 F1-score: 0.6604	Accuracy: 0.6533 F1-score: 0.6531
RoBERTa	Accuracy: 0.91 F1-score: 0.9132	Accuracy: 0.91 F1-score: 0.9103	Accuracy: 0.6833 F1-score: 0.6858	Accuracy: 0.67 F1-score: 0.678

Across the board, the performance metrics show that deidentification has a minimal impact on model performance. This is expected, as the specific content of the PII fields typically does not affect the sentiment of the text. Hence, the sentiment can be inferred equally well even when this information is replaced by markers.

Below you find a summary of the number and types of the redacted entities from both the training and evaluation split of both datasets. Check the full list of supported entity types for a detailed description.

It's worth noting that the used data also contain more sensitive PII, though they appeared less frequently. For example over 20 birth dates of primarily Twitter users have been successfully redacted by the Private AI Docker container.

Model / Dataset	IMDB		Twitter
Model / Dataset	Training split (3000 items)	Evaluation split (300 items)	Training split (3000 items)	Evaluation split (300 items)
Total number of redacted entities	35534	3039	7966	764
Top 5 most frequent entity types	NAME: 8294 OCCUPATION: 5862 NAME_GIVEN: 5079 NAME_FAMILY: 4176 DURATION: 1611	NAME: 659 OCCUPATION: 537 NAME_GIVEN: 388 NAME_FAMILY: 328 ORIGIN: 171	ORGANIZATION: 1556 NAME: 1382 USERNAME: 838 NAME_GIVEN: 723 EVENT: 617	USERNAME: 151 ORGANIZATION: 98 NAME: 90 NAME_FAMILY: 60 POLITICAL_AFFILIATION: 58

Our experiments demonstrate the effectiveness of the deidentification process in preserving the utility of the data for our machine learning task, while ensuring privacy.

Conclusion

Congratulations on completing this privacy-focused tutorial! By adding a deidentification step to our preprocessing function, we ensure that our model doesn't learn any sensitive data during training, enhancing privacy. This is a vital consideration in any data science project, especially when working with potentially sensitive datasets. Always remember, with great data, comes great responsibility!