The recommended level of concurrency, i.e. the optimal number of simultaneous requests to make to the container is covered below for the CPU and GPU containers. The recommended concurrency level is driven primarily by the compute requirement of Private AI's Neural Network models, such as for PII detection. For an example of how to make concurrent requests, please visit our examples repository.
For Neural Network inference workloads, CPUs don't require inputs to be batched together to achieve good hardware utilization. In practice, due to network overhead and pre/post-processing code it is best to use a low level of concurrency such as 2 per container instance. If latency isn't a concern, a value of 32 is recommended.
Unlike CPUs, GPUs require inputs to be batched together and processed as a single large input to achieve optimal hardware utilization. This means that there is a tradeoff between latency and throughput. A concurrency level of 32 per container instance is a good tradeoff between latency and throughput, however concurrency levels as low as 8 do not significantly impact throughput. If latency isn't a concern, a value of 128 will ensure maximum hardware utilization.