Throughput and latency of the Private AI container varies greatly based on the type of hardware provisioned to the container. It is highly recommended to use the hardware specified in the system requirements.


The below table illustrates the performance of the CPU container on various AWS instance types:

Instance Type Throughput (requests/sec) Latency (ms)
c5.large 1.61 620
c5a.large 1.23 816
m5.large 1.43 698
m5n.large 3.69 271
m5zn.large 5.03 199

For best throughput, it is recommended to use single logical core workers. The below table illustrates the scaling efficiency when running the container on multiple CPU cores:

Instance Type Logical CPU Cores Throughput (requests/sec) Latency (ms) Scaling Efficiency (%)
m5zn.large 1 5.03 198.82 100
m5zn.3xlarge 6 18.78 53.2 62
m5zn.6xlarge 12 24.57 40.64 41

The default accuracy_mode value high offers best PII detection performance, however it can be changed to trade PII detection performance for speed:

Accuracy Mode Throughput (requests/sec) Latency (ms)
standard 25.83 38.67
standard high & standard high multilingual 14.73 67.83
high & high multilingual 5.03 198.82


Below are benchmarks of the GPU container running on a g4dn.2xlarge instance when optimized for throughput with 128 concurrency:

Accuracy Mode Throughput (requests/sec) Latency (ms)
standard 570 198
standard high & standard high multilingual 507 229
high & high multilingual 210 530

The above GPU container benchmarks are optimized for throughput. Latency as low as 10ms can be achieved when using a lower number of concurrent requests.


  • Unless otherwise stated, tests are run using the default "high" accuracy_mode .
  • All tests are performed with a 100 word/500 character test input.
  • Processing time scales linearly to the length of the input text.
  • All benchmarks used concurrency settings optimized for throughput.

For the full benchmark report, please contact us.

© Copyright 2022, Private AI.