The following section provides some performance figures for Private AI's CPU and GPU containers on various AWS instance types, including the hardware in the system requirements.


  • Throughput is given in words, where a word denotes a whitespace-separated piece of text.
  • Words are roughly equivalent to tokens, such as used by OpenAI to measure volume.
  • Throughput and latency tests are performed with a 100 word/500 character test input.
  • Unless otherwise stated, tests are run using the default "high" accuracy_mode .
  • All benchmarks used concurrency settings optimized for throughput.

Please contact us if you require any further information.

Key Takeaways

  • Private AI utilises the latest in transformer technology to deliver the highest possible PII detection performance. At the same time, Private AI runs tens of times faster that BERT-style models and hundreds of times faster than LLMs without compromising accuracy.
  • CPU instances are fine for most use cases, with even a single CPU core able to process 500 words/s.
  • For very large deployments, GPU instances are recommended. A single low cost inferance instance such as the g4dn.2xlarge ($0.752 per hour) can process 1GB of unicode text in under an hour.
  • Hardware type matters. m5zn instances powered by recent Intel Xeon CPUs with AVX512 VNNI support perform over 3X faster than generic instances like c5 . For this reason, it is recommended to use the hardware specified in the system requirements .
  • It is best to avoid AWS Fargate, which is typically provisioned with older CPUs like the c5 .
  • Latency scales approximately linearly with request length. Throughput is unaffected.
  • Batching does not improve throughput significantly. Instead, it is recommended to use a large number of concurrent requests.
  • GPU instances are recommended when processing files.


The below table illustrates the performance of the CPU container on various AWS instance types:

Instance Type Throughput (words/sec) Latency for 100 word request (ms)
c5.large 161 620
c5a.large 123 816
m5.large 143 698
m5n.large 369 271
m5zn.large 503 199

For best throughput, it is recommended to use single logical core workers. The below table illustrates the scaling efficiency when running the container on multiple CPU cores:

Instance Type Logical CPU Cores Throughput (words/sec) Latency for 100 word request (ms) Scaling Efficiency (%)
m5zn.large 1 503 198.82 100
m5zn.3xlarge 6 1878 53.2 62
m5zn.6xlarge 12 2457 40.64 41

The default accuracy_mode value high offers best PII detection performance, however it can be changed to trade PII detection performance for speed:

Accuracy Mode Throughput (words/sec) Latency for 100 word request (ms)
standard 2583 38.67
standard high & standard high multilingual 1473 67.83
high & high multilingual 503 198.82


Below are benchmarks of the GPU container running on a g4dn.2xlarge instance when optimized for throughput with 128 concurrency:

Accuracy Mode Throughput (words/sec) Latency for 100 word request (ms)
standard 57000 198
standard high & standard high multilingual 50700 229
high & high multilingual 21000 530

The above GPU container benchmarks are optimized for throughput. Latency as low as 10ms can be achieved when using a lower number of concurrent requests.


Below are benchmarks of the CPU and GPU container for typical deployments to help with file processing estimates. PDFs were pre-processed to split large files into 5 page chunks to improve throughput via parallelization.

Instance Type Throughput (pages/sec)
g4dn.2xlarge 1.41
© Copyright 2022, 2023 Private AI.