info

Looking for accuracy benchmarks? Please download our Whitepaper

Benchmarked against container version 4.2.0

NER Benchmarks

The following section provides some NER performance figures for Private AI's CPU and GPU containers on various VM instance types, including the hardware in the system requirements.

These numbers have been computed by generating load on the process/text route using the default settings (i.e., HIGH_AUTOMATIC accuracy mode and heuristics coreference). Requests to the process/text route were created using an internal dataset of English examples of varied length. The load was scaled to a concurrency level maximizing the throughput of the process/text endpoint. Therefore, you could expect a lower latency if you have a lower load. A latency as low as 10ms can be achieved on a 100-words input when using a GPU deployment.

NER Performance on CPU

The table below illustrates the performance of the CPU container on various instance types:

Platform Instance Type Throughput1 (words/sec) Average Latency2 (ms)
Azure
Standard_E2_v5 (2 vCPUs, 16GB RAM) 513 1022
Standard_E8_v5 (8 vCPUs, 64GB RAM) 1719 304
AWS
m7i.xlarge (4 vCPUs, 16GB RAM) 834 628
m7i.4xlarge (16 vCPUs, 64GB RAM) 1843 285

1 Throughput is given in words per second, where a word denotes a whitespace-separated piece of text.

2 The average example length used for the testing is 131 words. The values in this column are the average latency over all examples.

When using the STANDARD or STANDARD_MULTILINGUAL accuracy mode, you should expect a throughput that is around 4 to 5 times these numbers. Similarly, the STANDARD_HIGH and STANDARD_HIGH_MULTILINGUAL accuracy mode will deliver a throughput that is around 3 times these numbers.

Note that the coreference_resolution settings model_prediction and combined have a big impact on performance. You can expect the throughput to be cut by 10 if you enable these options.

NER Performance on GPU

The table below contains the benchmarks of the GPU container running on different instance types equipped with a single GPU. Note that the Private AI GPU container is designed to run on a single GPU and will not leverage multiple GPUs.

Platform Instance Type Throughput1 Average Latency2
Azure
Standard_NC4as_T4_v3 (4 vCPUs, 28GB RAM) 11900 131
Standard_NC8as_T4_v3 (8 vCPUs, 56GB RAM) 11450 137
AWS
g4dn.2xlarge (8 vCPUs, 32GB RAM) 12100 538
g4dn.4xlarge (16 vCPUs, 64GB RAM) 14000 186
g5.4xlarge (16 vCPUs, 64GB RAM) 28200 453

1 Throughput is given in words per second, where a word denotes a whitespace-separated piece of text.

2 The average example length used for the testing is 131 words. The values in this column are the average latency over all examples.

When using the STANDARD accuracy mode, you should expect a throughput that is around 4 times these numbers. Similarly, the STANDARD_HIGH accuracy mode will deliver a throughput that is 3 times these numbers.

It is not recommended to use the model_prediction or combined coreference resolution modes with the GPU container because of the big impact these modes have on throughput.

PDF and Image Benchmark

Private AI recommends that documents are processed on GPU instances. Below are benchmarks of the GPU container with all document features enabled. This includes the default OCR, object detection and NER modes.

Note that PDF are processed as images so the processing of a page of PDF is roughly equivalent to the processing of one image.

Note also that the processing time of PDF and images may vary depending on the image size and its resolution and the amount of text.

Instance Type Throughput (pages/sec)
g4dn.2xlarge 1.41

Audio Benchmark

The throughput of the Private AI audio processing is provided below for the Private GPU and the CPU images on a common AWS instance.

Instance Type Image Type Throughput (RTFx1)
g4dn.2xlarge GPU ~23.0
m7i.4xlarge CPU ~2.8

1 The RTFx (inverse realtime factor) is measuring how many minutes of audio can be processed in 1 minute. A RTFx value of 30 means that an hour of audio recording is processed in 2 minutes.

As you can see from the above results, the GPU image is around 10 times faster than the CPU image.

Additional Guidelines

Hardware

Hardware type matters. m5zn instances powered by recent Intel Xeon CPUs with AVX512 VNNI support perform over 3X faster than generic instances like c5. For this reason, it is recommended to use the hardware specified in the system requirements.

As such, it is best to avoid AWS Fargate, which is typically provisioned with older CPUs like the c5.

Scaling Considerations

Latency

The latency on the process/text endpoint scales approximately linearly with the request length. To reduce latency one can call the process/text endpoint with smaller requests. This, in general, will not improve throughput. However, feeding the models with very short inputs (i.e., a few words to a couple of sentences) may reduce the models' accuracy because of the lack of context.

Throughput

To maximize throughput, it is recommended to use a large number of concurrent requests. Batching smaller requests together does not improve throughput significantly.

For very large deployments, GPU instances are recommended. A single low cost inference instance such as the g4dn.2xlarge (~$0.752 USD per hour) can process 1GB of unicode text in under an hour.

Note that scaling GPU instances requires to balance the GPU and CPU resources. As a rule of thumb, larger models like the high accuracy NER model are GPU bound, while smaller models like the standard accuracy NER model are CPU bound (on a GPU instance with too few CPU cores).

It is best to experiment with a few configurations to find the one that best fit your data.

Kubernetes Deployments

You should expect slightly lower numbers for throughput and latency when running the Private AI container on Kubernetes deployments using any of the instance types above. This is due to the overhead of the Kubernetes environment and resources being possibly reserved for the Kubernetes processes.

© Copyright 2025 Private AI.