Concurrency
The recommended level of concurrency, i.e. the optimal number of simultaneous requests to make to the container is covered below for the CPU and GPU containers. The recommended concurrency level is driven primarily by the compute requirement of Private AI's Neural Network models, such as for PII detection. For an example of how to make concurrent requests, please visit our examples repository.
CPU
For Neural Network inference workloads, CPUs don't require inputs to be batched together to achieve good hardware utilization. In practice, due to network overhead and pre/post-processing code it is best to use a low level of concurrency such as 1 per container instance.
GPU
Unlike CPUs, GPUs require inputs to be batched together and processed as a single large input to achieve optimal hardware utilization. This means that there is a tradeoff between latency and throughput. A concurrency level of 32 per container instance is a good tradeoff between latency and throughput, however 16 is the recommended number of concurrent connections to optimize for latency.
The guidance above is a recommended starting point for most installation types. For both CPU and GPU instance types it is recommend that you tune the volume of simultaneous requests based on your unique traffic patterns and use case specific volumes. Acceptable latency, desired throughput, and tolerance for variability greatly influence how you manage the overall load and performance on your Private AI installation.