[Question] Multiple model inputs and GPU allocations #269

msyulia · 2024-08-29T10:23:52Z

Hi!

I wasn't sure whether to place this under bug or whether it works as intended

I'm currently facing an issue where a model, deployed via Triton ONNX Backend, with up to a hundred inputs has a relatively high nv_inference_compute_input_duration_us, which from my understanding this metric also includes copying tensor data to GPU. Is there a possibility that each input results in a seperate GPU allocator call?

From what I see in ModelInstanceState::SetInputTensors https://github.com/triton-inference-server/onnxruntime_backend/blob/main/src/onnxruntime.cc#L2273 inputs are processed sequentially and each input results in a call to CreateTensorWithDataAsOrtValue is it possible that this could result in seperate GPU allocations and copies therefore a long nv_inference_compute_input_duration_us? Or is copying tensor data to GPU happening before a request is passed to the ONNX Backend?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Multiple model inputs and GPU allocations #269

[Question] Multiple model inputs and GPU allocations #269

msyulia commented Aug 29, 2024

[Question] Multiple model inputs and GPU allocations #269

[Question] Multiple model inputs and GPU allocations #269

Comments

msyulia commented Aug 29, 2024