Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Multiple model inputs and GPU allocations #269

Open
msyulia opened this issue Aug 29, 2024 · 0 comments
Open

[Question] Multiple model inputs and GPU allocations #269

msyulia opened this issue Aug 29, 2024 · 0 comments

Comments

@msyulia
Copy link

msyulia commented Aug 29, 2024

Hi!

I wasn't sure whether to place this under bug or whether it works as intended

I'm currently facing an issue where a model, deployed via Triton ONNX Backend, with up to a hundred inputs has a relatively high nv_inference_compute_input_duration_us, which from my understanding this metric also includes copying tensor data to GPU. Is there a possibility that each input results in a seperate GPU allocator call?

From what I see in ModelInstanceState::SetInputTensors https://github.com/triton-inference-server/onnxruntime_backend/blob/main/src/onnxruntime.cc#L2273 inputs are processed sequentially and each input results in a call to CreateTensorWithDataAsOrtValue is it possible that this could result in seperate GPU allocations and copies therefore a long nv_inference_compute_input_duration_us? Or is copying tensor data to GPU happening before a request is passed to the ONNX Backend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant