Benchmarks
info
Looking for accuracy benchmarks? Please download our Whitepaper
The following section provides some performance figures for Private AI's CPU and GPU containers on various AWS instance types, including the hardware in the system requirements.
Notes:
- Throughput is given in words, where a word denotes a whitespace-separated piece of text.
- Words are roughly equivalent to tokens, such as used by OpenAI to measure volume.
- Throughput and latency tests are performed with a 100 word/500 character test input.
-
Unless otherwise stated, tests are run using the default "high"
accuracy_mode
. - All benchmarks used concurrency settings optimized for throughput.
Please contact us if you require any further information.
Key Takeaways
- Private AI utilises the latest in transformer technology to deliver the highest possible PII detection performance. At the same time, Private AI runs tens of times faster that BERT-style models and hundreds of times faster than LLMs without compromising accuracy.
- CPU instances are fine for most use cases, with even a single CPU core able to process 500 words/s.
-
For very large deployments, GPU instances are recommended. A single low cost inferance instance such as the
g4dn.2xlarge
($0.752 per hour) can process 1GB of unicode text in under an hour. -
Hardware type matters.
m5zn
instances powered by recent Intel Xeon CPUs with AVX512 VNNI support perform over 3X faster than generic instances likec5
. For this reason, it is recommended to use the hardware specified in the system requirements . -
It is best to avoid AWS Fargate, which is typically provisioned with older CPUs like the
c5
. - Latency scales approximately linearly with request length. Throughput is unaffected.
- Batching does not improve throughput significantly. Instead, it is recommended to use a large number of concurrent requests.
- GPU instances are recommended when processing files.
CPU
The below table illustrates the performance of the CPU container on various AWS instance types:
Instance Type | Throughput (words/sec) | Latency for 100 word request (ms) |
---|---|---|
c5.large | 161 | 620 |
c5a.large | 123 | 816 |
m5.large | 143 | 698 |
m5n.large | 369 | 271 |
m5zn.large | 503 | 199 |
For best throughput, it is recommended to use single logical core workers. The below table illustrates the scaling efficiency when running the container on multiple CPU cores:
Instance Type | Logical CPU Cores | Throughput (words/sec) | Latency for 100 word request (ms) | Scaling Efficiency (%) |
---|---|---|---|---|
m5zn.large | 1 | 503 | 198.82 | 100 |
m5zn.3xlarge | 6 | 1878 | 53.2 | 62 |
m5zn.6xlarge | 12 | 2457 | 40.64 | 41 |
The default accuracy_mode
value high
offers best PII detection performance, however it can be changed to trade PII detection performance for speed:
Accuracy Mode | Throughput (words/sec) | Latency for 100 word request (ms) |
---|---|---|
standard | 2583 | 38.67 |
standard high & standard high multilingual | 1473 | 67.83 |
high & high multilingual | 503 | 198.82 |
GPU
Below are benchmarks of the GPU container running on a g4dn.2xlarge
instance when optimized for throughput with 128 concurrency:
Accuracy Mode | Throughput (words/sec) | Latency for 100 word request (ms) |
---|---|---|
standard | 57000 | 198 |
standard high & standard high multilingual | 50700 | 229 |
high & high multilingual | 21000 | 530 |
The above GPU container benchmarks are optimized for throughput. Latency as low as 10ms can be achieved when using a lower number of concurrent requests.
Below are benchmarks of the CPU and GPU container for typical deployments to help with file processing estimates. PDFs were pre-processed to split large files into 5 page chunks to improve throughput via parallelization.
Instance Type | Throughput (pages/sec) |
---|---|
g4dn.2xlarge | 1.41 |
Audio
Below are benchmarks of the CPU and GPU container for the typical deployments to help with audio file processing estimates.
Instance Type | Standard ASR Throughput | Premium ASR Throughput |
---|---|---|
g4dn.2xlarge (GPU image) | Not available for GPU image | 20x realtime |
m5zn.xlarge (CPU image) | 2x realtime | Not available for CPU image |
Note: Benchmarks for Audio are given as the multiple of the audio length. For example, "20x realtime" means 20 minutes of audio will be processed in a minute.