Benchmarks

info

Looking for accuracy benchmarks? Please download our Whitepaper

The following section provides some performance figures for Private AI's CPU and GPU containers on various AWS instance types, including the hardware in the system requirements.

Notes:

Throughput is given in words, where a word denotes a whitespace-separated piece of text.
Words are roughly equivalent to tokens, such as used by OpenAI to measure volume.
Throughput and latency tests are performed with a 100 word/500 character test input.
Unless otherwise stated, tests are run using the default "high" accuracy_mode .
All benchmarks used concurrency settings optimized for throughput.

Please contact us if you require any further information.

Key Takeaways

Private AI utilises the latest in transformer technology to deliver the highest possible PII detection performance. At the same time, Private AI runs tens of times faster that BERT-style models and hundreds of times faster than LLMs without compromising accuracy.
CPU instances are fine for most use cases, with even a single CPU core able to process 500 words/s.
For very large deployments, GPU instances are recommended. A single low cost inferance instance such as the g4dn.2xlarge ($0.752 per hour) can process 1GB of unicode text in under an hour.
Hardware type matters. m5zn instances powered by recent Intel Xeon CPUs with AVX512 VNNI support perform over 3X faster than generic instances like c5 . For this reason, it is recommended to use the hardware specified in the system requirements .
It is best to avoid AWS Fargate, which is typically provisioned with older CPUs like the c5 .
Latency scales approximately linearly with request length. Throughput is unaffected.
Batching does not improve throughput significantly. Instead, it is recommended to use a large number of concurrent requests.
GPU instances are recommended when processing files.

CPU

The below table illustrates the performance of the CPU container on various AWS instance types:

Instance Type	Throughput (words/sec)	Latency for 100 word request (ms)
c5.large	161	620
c5a.large	123	816
m5.large	143	698
m5n.large	369	271
m5zn.large	503	199

For best throughput, it is recommended to use single logical core workers. The below table illustrates the scaling efficiency when running the container on multiple CPU cores:

Instance Type	Logical CPU Cores	Throughput (words/sec)	Latency for 100 word request (ms)	Scaling Efficiency (%)
m5zn.large	1	503	198.82	100
m5zn.3xlarge	6	1878	53.2	62
m5zn.6xlarge	12	2457	40.64	41

The default accuracy_mode value high offers best PII detection performance, however it can be changed to trade PII detection performance for speed:

Accuracy Mode	Throughput (words/sec)	Latency for 100 word request (ms)
standard	2583	38.67
standard high & standard high multilingual	1473	67.83
high & high multilingual	503	198.82

GPU

Below are benchmarks of the GPU container running on a g4dn.2xlarge instance when optimized for throughput with 128 concurrency:

Accuracy Mode	Throughput (words/sec)	Latency for 100 word request (ms)
standard	57000	198
standard high & standard high multilingual	50700	229
high & high multilingual	21000	530

The above GPU container benchmarks are optimized for throughput. Latency as low as 10ms can be achieved when using a lower number of concurrent requests.

PDF

Below are benchmarks of the CPU and GPU container for typical deployments to help with file processing estimates. PDFs were pre-processed to split large files into 5 page chunks to improve throughput via parallelization.

Instance Type	Throughput (pages/sec)
g4dn.2xlarge	1.41

Audio

Below are benchmarks of the CPU and GPU container for the typical deployments to help with audio file processing estimates.

Instance Type	Standard ASR Throughput	Premium ASR Throughput
g4dn.2xlarge (GPU image)	Not available for GPU image	20x realtime
m5zn.xlarge (CPU image)	2x realtime	Not available for CPU image

Note: Benchmarks for Audio are given as the multiple of the audio length. For example, "20x realtime" means 20 minutes of audio will be processed in a minute.