Unable to run model on GPU using Nvidia Orin AGX and Jetpack 6.2

  • Project Type: Object Detection
  • Operating System & Browser: Jetpack 6.2 Edge Deployment
  • Project Universe Link or Workspace/Project ID: simplot -workflows - yolov8
  • Do you grant Roboflow Support permission to access your Workspace for troubleshooting? (Yes/No): Yes

I did a fresh installation of complete system including flashing os and Jackpack 6.2 on my Nvidia Orin AGX. I want to run inference on my edge device in offline mode. For that I followed the instructions on Install on Jetson - Roboflow Inference
upon running inference server start it says:

GPU detected. Using a GPU image.
Pulling image: roboflow/roboflow-inference-server-gpu:latest
404 Client Error for http+docker://localhost/v1.53/images/create?tag=latest&fromImage=roboflow%2Froboflow-inference-server-gpu: Not Found (“no matching manifest for linux/arm64/v8 in the manifest list entries: no match for platform in manifest: not found”)

I continued my container setup using Manual Starting the Container for Jetpack 6.2 using
sudo docker run -d \
--name inference-server \
--runtime nvidia \
--read-only \
-p 9001:9001 \
--volume ~/.inference/cache:/tmp:rw \
--security-opt="no-new-privileges" \
--cap-drop="ALL" \
--cap-add="NET_BIND_SERVICE" \
roboflow/roboflow-inference-server-jetson-6.2.0:latest

For testing the installation I forked the yolov8 object detection example: Deploy YOLOv8 Object Detection Models to the NVIDIA Jetson . When I run the inference with my internet connected, it runs pretty fast (though the api_url=localhost:9001). So my first question, isn’t it supposed to run locally?
And when I disconnect the internet it takes 10 to 12 seconds to give the result and I do not see GPU usage either in jtop or nvidia-smi. So I was wondering why it is not using GPU and taking so long for the results?

my code is as follows:

import cv2
import os
import json
from dotenv import load_dotenv
from inference_sdk import InferenceHTTPClient

# Load env vars
load_dotenv("../.env")

def draw_boxes(image, predictions):
    for pred in predictions:
        x = int(pred["x"])
        y = int(pred["y"])
        w = int(pred["width"])
        h = int(pred["height"])
        label = pred["class"]
        conf = pred["confidence"]

        # Convert center → top-left
        x1 = int(x - w / 2)
        y1 = int(y - h / 2)
        x2 = int(x + w / 2)
        y2 = int(y + h / 2)

        # Draw box
        cv2.rectangle(image, (x1, y1), (x2, y2), (0, 255, 0), 2)

        # Label text
        text = f"{label} {conf:.2f}"
        cv2.putText(
            image,
            text,
            (x1, y1 - 10),
            cv2.FONT_HERSHEY_SIMPLEX,
            0.6,
            (0, 255, 0),
            2,
        )
    return image

# Connect to inference server
client = InferenceHTTPClient(
    api_url="http://localhost:9001",
    api_key=os.getenv("ROBOFLOW_API_KEY"),
)

# Open webcam (0 = default camera)
cap = cv2.VideoCapture(0)

if not cap.isOpened():
    raise RuntimeError("Could not open camera")

print("Press 'c' to capture & run inference")
print("Press 'q' to quit")

while True:
    ret, frame = cap.read()
    if not ret:
        break
    # Show live feed
    cv2.imshow("Camera", frame)

    key = cv2.waitKey(1) & 0xFF

    # Capture on 'c'
    if key == ord("c"):
        img_path = "../data/captured_frame.jpg"
        cv2.imwrite(img_path, frame)

        result = client.run_workflow(
            workspace_name="simplot",
            workflow_id="yolov8",
            images={"image": img_path},
            use_cache=False
        )

        # Extract predictions (this path is typical for workflows)
        predictions = result[0]["model_predictions"]["predictions"]["predictions"]

        # Draw boxes on a copy
        output_frame = frame.copy()
        output_frame = draw_boxes(output_frame, predictions)

        # Show result window
        cv2.imshow("Detections", output_frame)

        # Optional: save result image
        cv2.imwrite("../output/detection_result.jpg", output_frame)

        # print(json.dumps(result, indent=2))
        print("Detections shown")


    # Quit on 'q'
    elif key == ord("q"):
        break


cap.release()
cv2.destroyAllWindows()

Hi @Abdul1 ,

I’ll try to reproduce this tomorrow, but my assumption is that it is running locally in both cases on CPU, there’s just a big delay (bottleneck, not the infernece) after disconnecting the internet. We’ll look into it.

Thanks, Erik

Thanks Erik. Please let me know when you figure out what is causing this bottleneck.
Also, why the model is running on CPU and not GPU?
On the other hand when I am using the same workflow on video (using webcam), it is working fine offline and apparently using GPU (I see GPU usage in jtop).

Hi @Abdul1 ,

I believe the main problem is that in your script (1st post) is that its sequential - you take one image, send it to docker, docker does inference, returns the results, etc, instead of doing it in parallel.

And the problem with jtop is that it samples at 1hz, so you’d just see 0% usage for most of the time (unless you measure exactly when inference is happening).

So when you switched from image to video input you went from sequential to parallel execution (which utilizes GPU much better in terms of %).

So you already solved the issue, it’s now running correctly on gpu:)

Hi @erik_roboflow
Thank you for your swift response. I still want to utilize the image based inference for the project needs. I was wondering:

  1. What is causing the delay when internet is disconnected? Are there any limitations of the docker image?
  2. Also if it is not using GPU on single image, what is the reason behind that?
  3. How can we keep the channel open for the inference as I can not do parallel because system is capturing image every second in real time, and we want inference in real time?

Best,
Abdul

Hi Abdul,

1.What is causing the delay when internet is disconnected? Are there any limitations of the docker image?

Delays generally can happen by the application attempting to sync/ cache/ validate models/data with remote servers, leading to timeouts.

Check out: Offline Mode | Roboflow Docs

One thing that might help is our Dedicated Deployments functionality that lets you spin up a GPU pre-configured with your workflow. It will still have that early delay, but then will stay “warm” as you need it.

A few recommended things to keep in mind for Offline Operation:

  1. Pre-cache models while online — Run at least one successful inference with internet connected to download and cache model weights

  2. Pass API key to Docker container — Include -e ROBOFLOW_API_KEY=your_key in your docker run command so authentication happens at container startup, not per-request

  3. Verify cache volume is correctly mounted — Ensure ~/.inference/cache:/tmp:rw mapping persists model data between container restarts

  4. Consider Enterprise Offline Mode — For production air-gapped deployments, Roboflow Enterprise provides dedicated offline licensing with a local license server

  5. Monitor container resources — Use docker stats inference-server to track memory/CPU usage during extended operatio

  • Also if it is not using GPU on single image, what is the reason behind that?

2. Why Is GPU Not Being Used for Single Image Inference?

Potential causes can be: Cold Start — HTTP requests via InferenceHTTPClient don’t keep the model loaded in GPU memory between call as well as Missing TensorRT Config — GPU execution providers not explicitly set

Potential: Add to your Docker command:

-e ONNXRUNTIME_EXECUTION_PROVIDERS="[TensorrtExecutionProvider,CUDAExecutionProvider,CPUExecutionProvider]"

3. How to Keep the Channel Open for Real-Time Inference?

Use Roboflow’s InferencePipeline instead of InferenceHTTPClient:

from inference import InferencePipeline

def on_prediction(predictions, video_frame):
    print(predictions)

pipeline = InferencePipeline.init(
    model_id="your-model/version",
    video_reference=0,  # Your image source
    on_prediction=on_prediction,
    api_key="YOUR_API_KEY",
)

pipeline.start()
pipeline.join()

The reason for this is because InferencePipeline is Roboflow’s optimized solution for Jetson devices, it loads the model once, keeps it warm in GPU memory, and handles continuous inference without per-request overhead.

Thank you ,

Bar Shimshon

Thank you Bar Shimshon for addressing all my questions.

So my deployment is in field on Nvidia jetson at remote location, so Dedicated Deployments functionality will not work us, for this use case.

I checked the Offline Mode | Roboflow Docs , there it says that it can cache weights for up to 30 days. My deployment will be offline for up to 3 months, will it loose the cached weights?

As an Enterprise User, I was able to deploy that on nvidia jetson, will I be charged for that deployment? If yes, is it based on per model basis or usage basis?

For your point Pre-cache models while online, I am doing it already and will try by Passing API key to Docker container.

I do not understand the Consider Enterprise Offline Mode, is this any other method to deploy the model in offline mode?

Thanks,
Abdul

Hi Abdul,

Enterprise Offline Mode is not a different deployment method, it’s the same Docker-based deployment you’re already using, but with an additional License Server component that helps extend offline capability.

Indeed, the model will stop working after 30 days offline.

Roboflow Enterprise Roboflow customers can configure Roboflow Inference to cache weights for up to 30 days. So, the weights themselves remain cached on disk, but the license lease expires, meaning inference requests will fail until the lease is renewed via internet or a License Server connection

.See: License Server | Roboflow Docs

A workaround for this could be the following:

Use Roboflow License Server and deploy it on any machine with periodic internet access (even via cellular/satellite), and your Jetson connects to it locally to renew the lease:

# Jetson points to License Server instead of internet
--env LICENSE_SERVER=<your-license-server-ip>

Essentially - If you wish to firewall the Roboflow Inference Server from the Internet, you will need to use the Roboflow License Server which acts as a proxy for the Roboflow API and your models’ weights.

How it could works:

  • Deploy a License Server on a machine with internet access (in your DMZ)

  • Your Jetson connects to the License Server (not the public internet)

  • The License Server renews weight leases automatically

# On Jetson, point to your License Server
sudo docker run --net=host --env LICENSE_SERVER=10.0.1.1 \
  --mount source=roboflow,target=/tmp/cache \
  roboflow/roboflow-inference-server-jetson-6.2.0:latest

It may be also worth while to contact your Roboflow Enterprise representative to discuss extended offline lease options for your specific use case they may be able to configure a longer expiration period for field deployments.

Finally , in terms fo the payment question, since you’re already an Enterprise customer, I also recommend here that you contact your Roboflow representative to clarify whether billing for self-hosted Jetson deployments is per-device, per-model, or usage-based.

Thank you,

Bar Shimshon

1 Like