Error deploying custom trained model on NVIDIA Jetson Xavier

I trained an object detection model using Roboflow and would like to deploy the model for inference on an NVIDIA Jetson AGX Xavier.

NVIDIA Jetson-AGX Specs:

  • L4T 35.3.1 (JetPack 5.1.1)
  • Ubuntu 20.04.5 LTS
  • CUDA 11.4.315
  • CUDNN 8.6.0.166
  • TensorRT 8.5.2.2

I have followed both the legacy documentation and the current documentation.

Legacy Method

When following the legacy method, I can run the server fine using this line:

sudo docker run --net=host --gpus all roboflow/inference-server:jetson

I receive an error when attempting to run inference using this line:

base64 YOUR_IMAGE.jpg | curl -d @- \
"http://localhost:9001/your-model/42?api_key=YOUR_KEY"

When I instead use the hosted API, it works great. Using this:

base64 YOUR_IMAGE.jpg | curl -d @- \
"https://detect.roboflow.com/your-model/42?api_key=YOUR_KEY"

The error I receive when trying to run inference locally is shown here:

{
"error": "This execution contains the node 'StatefulPartitionedCall/assert_equal_1/Assert/AssertGuard/branch_executed/_139', which has the dynamic op 'Merge'. Please use model.executeAsync() instead. Alternatively, to avoid the dynamic ops, specify the inputs [Identity]"
}

Current Method

When using the current method, I cannot run the roboflow inference server using this line:

sudo docker run --privileged --net=host --gpus all --mount source=roboflow,target=/cache -e NUM_WORKERS=1 roboflow/roboflow-inference-server-trt-jetson:latest

The error I receive is:

OSError: libcurand.so.10: cannot open shared object file: No such file or directory

I’ve checked and the file libcurand.so.10 can be found in /usr/local/cuda/lib64

I’ve tried using

export LD_LIBRARY_PATH=/usr/local/cuda/lib64 ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

with no success.

My questions are:

  1. Which deployment method should I be following?
  2. Has anyone run into these errors before and have any thoughts on what might be going wrong?

Hi, I am having the same issue here but only with our newly trained model. I am using the legacy documentation where I am guided to from the universe project page.

We have trained an OD model via roboflow train (cobe/1, MODEL TYPE: ROBOFLOW 3.0 OBJECT DETECTION (FAST)). Trying to deploy on Jetson Nano (4GB) through the roboflow inference docker image. We get the same error:

{
    "error": "This execution contains the node 'StatefulPartitionedCall/assert_equal_1/Assert/AssertGuard/branch_executed/_139', which has the dynamic op 'Merge'. Please use model.executeAsync() instead. Alternatively, to avoid the dynamic ops, specify the inputs [Identity]"

Our tech stack has been tested and used before training our new model. We tested it with another, older OD model from roboflow universe (Face Detection Object Detection Dataset and Pre-Trained Model by Mohamed Traore) (v15) and that still works well, no problem duing inference. The only difference I see between that and our model is that the one where it works is Roboflow 2.0 while ours is Roboflow 3.0 OD model. I can see that this was only introduced yesterday, so still very new (Announcing Roboflow Train 3.0)

On the docker side we don’t get errors, the weights are downloaded for the newly trained model without a problem but the inference via curl will give the error above, and inference from python will give the following error as the roboflow python package doesn’t get back predictions from the docker container.

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nano/.local/lib/python3.6/site-packages/roboflow/models/object_detection.py", line 266, in predict
colors=self.colors,
File "/home/nano/.local/lib/python3.6/site-packages/roboflow/util/prediction.py", line 520, in create_prediction_group
for prediction in json_response["predictions"]:
KeyError: 'predictions'

The python interface also works perfectly with the above older OD model.

@Lenny Please let us know how to proceed or provide a way to fall back to Roboflow 2.0 models if that is the problem. Alternatively if you see other difference between our model (cobe/1) and the one that works for us (face-detection-mik1i/15) let us know, or if you know a way we can train a similar model to that,

@Jack_Schultz when was your model trained (maybe around 11.07.23)? Did you try deploying other OD models from the roboflow universe?

Hi @Jack_Schultz ! Looks like there was an error in our documentation. For Jetsons, instead of using --gpus all you should use --runtime=nvidia. I’ve updated our docs to reflect that.

Hi @David_Mezey. Yes, the model trained was indeed the ROBOFLOW 3.0 OBJECT DETECTION (FAST) trained on 7/7/2023. This is my first time training a model in Roboflow, so I don’t have previous versions to test against. I tried deploying the model from this project trained using ROBOFLOW 2.0 OBJECT DETECTION (FAST) on 11/18/2022. This deployment worked with no errors using the legacy method on an NVIDIA AGX Xavier.

It seems that the issue might lie with the new Roboflow 3.0 models.

Hi @Paul, thank you for your suggestion. I tried using the --runtime=nvidia flag on both the legacy and current methods with the same errors as the original message.

1 Like

Hi @Paul, thank you for helping with this. Do you maybe have an idea what can cause the issue with our newer models (mine is cobe/1) although we are able to deploy older models (face-detection-mik1i/15 and stop-sign-detection-uv4u8/1)? Are there any preprocessing steps we are not supposed to use for deploying Jetson devices or it might really be the new Roboflow train method? Thank you for your answer in advance.

Hey @Jack_Schultz and @David_Mezey ! I just noticed in the original post you say you are running Jetpack 5.1.1. For that, we have a specific docker image: roboflow/roboflow-inference-server-trt-jetson-5.1.1. Can you give that a try and see if the error persists? I have an Orin Nano running JP 5.1.1 and I was able to run Roboflow 3.0 models on it without error. Again, sorry for the unclear docs. I’ve pushed an update to clarify.

Hi @Paul. I was able to run the docker image roboflow/roboflow-inference-server-trt-jetson-5.1.1 with no problems.

However, when I try to then run

base64 YOUR_IMAGE.jpg | curl -d @- \
"http://localhost:9001/[my-model]/[my-version]?api_key=[my-api-key]"

I get this response from the roboflow server:

INFO:            127.0.0.1:37412 - "POST /[my-model]/[my-version]?api_key=[my-api-key]
LTC HTTP/1.1" 307 Temporary Redirect

and it has been hung up here for ~1 hour.

In the documentation it mentions it may take 5-10 minutes to compile on the jetson device. Is the output I am seeing typical? Do you think it is just taking a while or is there something else going wrong here?

:thinking: I tried the same command you posted and it is running just fine on my Orin Nano. Can you run curl --version? When I run it, I get:

curl 7.68.0 (aarch64-unknown-linux-gnu) libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
Release-Date: 2020-01-08
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS brotli GSS-API HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL TLS-SRP UnixSockets

I’ve also seen the temporary redirect happen when a trailing / is included in the request url. For example, if you hit "http://localhost:9001/[my-model]/[my-version]/?api_key=[my-api-key]" you would get that redirect and no inference would occur (note the / before the ?).

I think you are right, I had an additional / before the ?api_key, thus the redirect error.

Omitting the /, I’m now seeing an error on the client side:

{
"message":"An error occurred when calling the Roboflow API to acquire the model artifacts. The endpoint to debug is https://api.roboflow.com/ort/[model]/[version]?api_key=[api-key]&device=1421220031096&nocache=true&dynamic=true. The error was: 429 Client Error: Too Many Requests for url: https://api.roboflow.com/ort/[model]/[version]?api_key=[api-key]&device=1421220031096&nocache=true&dynamic=true."
}

and on the server side:

Downloading model artifacts from Roboflow API
INFO:     127.0.0.1:56590 - "POST /[model]/[version]?api_key=[api-key] HTTP/1.1" 500 Internal Server Error

Output of curl --version:

curl 7.68.0 (aarch64-unknown-linux-gnu) libcurl/7.68.0 OpenSSL/1.1.1f zlib/1.2.11 brotli/1.0.7 libidn2/2.2.0 libpsl/0.21.0 (+libidn2/2.2.0) libssh/0.9.3/openssl/zlib nghttp2/1.40.0 librtmp/2.3
Release-Date: 2020-01-08
Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp scp sftp smb smbs smtp smtps telnet tftp 
Features: AsynchDNS brotli GSS-API HTTP2 HTTPS-proxy IDN IPv6 Kerberos Largefile libz NTLM NTLM_WB PSL SPNEGO SSL TLS-SRP UnixSockets

I think this means you’ve hit a usage limit. Can you DM me the email associated with your account? I can check and see if there’s something wrong with your limits.

It looks like I am not able to send a dm to your account. I was able to send a message to my account moderator and tag you, however. Let me know if that doesn’t work and if there is an alternative way to reach you.

Hi @Paul, thanks again for looking into this. We don’t have the same system with @Jack_Schultz. I have a Jetson Nano (4GB) version with the following specs:

NVIDIA NVIDIA Jetson Nano Developer Kit
L4T 32.7.1 [ JetPack 4.6.1 ]
Ubuntu 18.04.6 LTS
Kernel Version: 4.9.253-tegra
CUDA 10.2.300
CUDA Architecture: 5.3
OpenCV version: 4.1.1
OpenCV Cuda: NO
CUDNN: 8.2.1.32
TensorRT: 8.2.1.8
Vision Works: 1.6.0.501
VPI: 1.2.3
Vulcan: 1.2.70

Everything as shipped in the JetPack 4.6.1 SDK image (JetPack SDK 4.6.4 | NVIDIA Developer) That is essentially the latest one in terms of functionality other than security updates (up to v4.6.4).

Do you see any problems with our versions that might cause the issue? Also, should we follow the legacy docs (which we do now as this is where you get redirected by default from the “Implement on Jetson Nano” button) or the new docs and if the latter, is there a special image for us as well?

To help looking into the issue I have tried the deployement with a few other models, and could deploy successfully with face-seg/1, traffic-sing-speedlimit/1, character-detection-iis85/2 (all of these are 2.0 models) while we faced the same issue as described in the first message with object-detection-obkad/5, icon-coglc/1 and carbon-model-2/1 (all of which are 3.0 models). For me the prediction power of the model version on deployability on our Nano is getting more and more clear. Is there a possibility to fall back to 2.0 training somehow for the time being?

Thanks again fro your time, we really like the idea behind roboflow and that you make comuter vision projects possible for the community!

@Paul @Jack_Schultz Any update on this?

Hi @David_Mezey. This issue was solved for me using the docker image:

roboflow/roboflow-inference-server-trt-jetson-5.1.1

Additionally, my account was limiting the number of allowable devices to 0, so @Paul was able to reset my account on his end to allow detections.

1 Like

Hi @David_Mezey, the link in the app should be pointed to https://docs.roboflow.com/deploy/enterprise-nvidia-jetson. I’ll fix that. Try using roboflow/roboflow-inference-server-trt-jetson:0.5.4. I am able to infer using this image on a jetson nano with a Roboflow 3.0 model. Note, you may get a warning about the trt-execution provider, but it should fall back to the cuda provider in that case.

1 Like

Hi @Paul thank you for helping, I am now getting the same error as @Jack_Schultz

{"message":"An error occurred when calling the Roboflow API to acquire the model artifacts. The endpoint to debug is https://api.roboflow.com/ort/cobe/1?api_key=<API_KEY>&device=1421621054884&nocache=true&dynamic=true. The error was: 429 Client Error: Too Many Requests for url: https://api.roboflow.com/ort/cobe/1?api_key=<API_KEY>&device=1421621054884&nocache=true&dynamic=true.

Could you please set the number of allowable devices for us as well?

Thank you in advance.

Hey David! Sorry for the late reply. You device limits should be updated now.

Hi @Paul, thank you for helping with the device limits. I could go one step further, the server downloads my model weights successfully but unfortunately still getting an internal server error upon inference with the following error message:

{"message":"[ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Exception during initialization: /home/onnxruntime/onnxruntime/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.cc:801 SubGraphCollection_t onnxruntime::TensorrtExecutionProvider::GetSupportedList(SubGraphCollection_t, int, int, const onnxruntime::GraphViewer&, bool*) const [ONNXRuntimeError] : 1 : FAIL : TensorRT input: /model.22/Range_output_0 has no shape specified. Please run shape inference on the onnx model first. Details can be found in https://onnxruntime.ai/docs/execution-providers/TensorRT-ExecutionProvider.html#shape-inference-for-tensorrt-subgraphs\n"}

I suspect this might be connected to what you mentioned here about falling back to the CUDA provider: Error deploying custom trained model on NVIDIA Jetson Xavier - #15 by Paul

I got the same error with both roboflow/roboflow-inference-server-trt-jetson:0.5.4 that you suggested and roboflow/roboflow-inference-server-trt-jetson:latest that was in the docs.

Do you have any idea how could I overcome this issue?

:thinking: Can you post the docker run command you are using? Also, are you trying to run a model that you trained on roboflow? or one that you trained yourself, then uploaded?

@Paul Thank you for the quick reply.

I run the following docker command (as suggested version 0.5.4)

sudo docker run --privileged --net=host --runtime=nvidia --mount source=roboflow,target=/cache -e NUM_WORKERS=1 roboflow/roboflow-inference-server-trt-jetson:0.5.4

But I also tried:

sudo docker run --privileged --net=host --runtime=nvidia --mount source=roboflow,target=/cache -e NUM_WORKERS=1 roboflow/roboflow-inference-server-trt-jetson:latest

Then for the inference I send a request as

base64 test_cobe.jpg | curl -d @- "http://localhost:9001/cobe/1?api_key=API_KEY"

We trained the model on roboflow and the image we send in the above test request is one of the images from the test set.