Serverless V2 execution time varies WILDLY, seems really incorrect

  • Project Type: Workflow
  • Operating System & Browser: N/A
  • Project Universe Link or Workspace/Project ID: N/A
  • Do you grant Roboflow Support permission to access your Workspace for troubleshooting? (Yes/No): No

When I run a workflow on the Serverless V2 endpoint, the execution time varies wildly. For example, I tested this with a simple object detection workflow with only a yolov8n object detection model. Running this on a simple 480x480 image claimed to be anywhere from 400ms to 7000ms of execution time.

This seems like a bug in the timekeeping, as inference execution time obviously shouldn’t vary by ~20x on the same input. Additionally, even my fastest executions were an order of magnitude slower than the claimed 17 ms yolov8n execution time claimed here: (Legacy) Serverless Hosted API | Roboflow Docs

Can you look into why the execution time is so long on Serverless V2? It’s cost me far more time/credits than I would expect.

Hi @jk1 ,

This is likely because we’re just migrating the backend, which will be more flexible in the future (support for VLMs/larger models). When was this tested?

Thanks, Erik

Thanks @erik_roboflow

I’ve been testing this on and off for a few days. Latest was maybe 2:30pm pacific today (2/11/26). Can I ask how the load balancer would impact execution time?

@erik_roboflow I see in your post edit there may have been some confusion about why this is occurring. Can you let me know when this issue will be be mitigated? No fun paying 100-200x more credits than I expect. Thank you.

Hey @jk1, Erik unfortunately mixed some things up. What you are likely seeing is the ā€œcold startā€ time of your model. When your request hits a server that doesn’t have the model loaded it has to download the weights and load them into its GPU memory which can take a few seconds.

When there is lots of contention for the GPU memory, your model will be evicted to make space for others’. In Serverless v2, this happens pretty frequently because all requests can get routed to all backend servers (and so, counterintuitively, the higher load you put your model under the better it performs because most servers will already have it ā€œhotā€ when the request comes in).

What Erik is referring to is an upcoming update we’ve been testing this week on a subset of traffic that will be able to more intelligently route requests to servers that already have your model loaded to reduce the frequency you’ll encounter cold starts. This should improve the performance and predictability of the Serverless backend. But what you’re observing is normal and expected behavior.

If you need more predictability, our Dedicated Deployments offer GPU machines that are completely allocated to you and once loaded your model will stay hot since nobody else is contending for that memory space. (This is at the downside of needing to cover the full cost of the GPU even when you’re not utilizing it fully, vs paying per request when you are sharing it with others.)

@brad , thanks for the quick response and for the color on the routing upgrade… quick suggestion, perhaps there should be an ā€œunimportantā€ flag for requests that are not urgent from an E2E latency standpoint, but would rather run on a hot GPU, even if it means queuing behind other requests.

However something still feels fishy with the timing. I’ve attached 3 screenshots below. Each is running the same extremely simple workflow without a model that inputs an image and simply returns its height. There seem to be 3 possibilities.

  • Runs with ~300-500ms exec time (probably 8/10 requests)
  • Runs with =100ms exec time (1/10 requests) - I’m aware this is the minimum billing increment
  • Runs with 5000-20000ms exec time (1/10 requests)

Even with a cold start, I would expect execution time here to be pretty much 0 every time, so a 100ms billing increment.

I guess that leaves me with 3 questions.

  1. Why is there 300-500ms exec time on the majority of the runs? Assuming execution on your end looks like [DL+load workflow → DL+load model weights → run workflow], and that this particular workflow should be <1ms to run, does the execution time include network latency on the DL steps?
  2. What’s happening when there’s massive (>5000ms) exec time? Why does this ever happen?
  3. Separately, when I run a simple yolov8n workflow, I never get a 100ms exec time, even if I send a bunch of rapid requests. I assume at least one of them finds a server that’s hot. This doesn’t seem to reconcile with the claimed 17ms exec time in the docs. Am I doing something incorrectly?

Thanks again - appreciate you taking the time to look into the issue

quick suggestion, perhaps there should be an ā€œunimportantā€ flag for requests that are not urgent from an E2E latency standpoint, but would rather run on a hot GPU, even if it means queuing behind other requests.

We don’t currently have a good way to do that on Serverless (but the version we’re testing where we created our own custom load balancing and routing layer specific to the characteristics of Workflows and CV models will help a lot – should be live broadly in the next few weeks).

However something still feels fishy with the timing.

Interesting; we don’t run any tests without models currently but that’s a good idea and does certainly look like something worth looking into. The Workflow Builder UI is a bit of a special case because it needs to fetch the spec and recompile the execution engine on changes (and this is another thing that needs to happen on each machine), but the cold starts shouldn’t take anywhere near that long – we will have to dig a little deeper here to see if there’s an additional source of latency on that part of things.

Separately, when I run a simple yolov8n workflow, I never get a 100ms exec time, even if I send a bunch of rapid requests.

How rapidly? In the current infra with the current GPU scale and load to get hot requests all the time you need to be sending many requests per second and then things should stabilize. This will improve soon with the new smart routing layer.

@brad

thanks - can you or someone else let me know when this is figured out? I’d really like to start using this at scale, it’s just a bit hard to when the billing per call is so uncertain

I was sending 1-2 requests/sec. This is the same rate that I was trying the no-model flow. I assumed that I was hitting a hot server when I was getting the 100ms exec time in the no-model flow, therefore I assumed at least some of the yolov8 flow requests at the same rate would also hit a hot sever.

Also, is the ā€˜x-processing-time’ response header the canonical way to view execution time for a serverless API call?

Good call isolating the no-model workflow — that’s a really useful test case. I’ve been running that same test on the current production deployment and I’m consistently seeing the expected 100ms minimum. So the intermittent spikes you were hitting should be resolved at this point.

To your questions:

The >5000ms spikes were related to an infrastructure issue on our end that has since been addressed.

For yolov8n at 1-2 req/sec on shared serverless — you may not consistently land on a server with your model already loaded, since requests get distributed across the backend pool. The smart routing update Brad mentioned will significantly improve this by preferring servers that already have your model hot. The 17ms number in the docs reflects pure model execution time, not the full request lifecycle.

Would you mind re-testing the no-model workflow and letting us know if you’re still seeing unexpected times? If you can share specific request timestamps that’d be really helpful for us to try to correlate on our end.

Regarding x-processing-time — yes, that’s the right header to look at. With the new system that’s rolling out, it’s a bit tricky though: the processing time returned will actually be from the orchestration node for the workflow, which isn’t what we’ll be billing on directly; we’ll separatly report remote exec time in an additional header. The models execute on separate GPU nodes and will be what the billing is based on — we’re still working on getting that information passed through cleanly in the response. Overall though, we’re seeing billed processing time reduced by a factor of 2-5x with the new system, so it should be strictly cheaper.

Regarding the suggestion for a ā€œrun on hot GPU even if queuedā€ flag — the new routing layer will essentially be trying to do this. Priority queuing where you can express a preference for e.g. slower response time for lower cost isn’t something we have yet, but we’re actively optimizing this system and this is a great idea.

I’d be happy to jump on a call to discuss your specific use cases — would also be good to look at the different options we have to figure out the cheapest way to run your workloads. Let me know if that’d be useful.

Would be happy to hop on a call. I’ll send you a PM to arrange - let me know if that doesn’t come through.

Here’s 5 requests to my no-model workflow (all via the HTTP API so hopefully avoiding whatever web UI issues there might be).

The 1st one processes in ~50ms. Still seems kinda slow for no model at all but I guess as a user I’m indifferent to anything under 100ms.

The next 3 are 400-500ms.

The final one is unfortunately >10000ms

date Fri, 13 Feb 2026 00:56:45 GMT
executionId 1770944205262255775_618b
requestId 9074b6c740b206f3b5d422d7d7230e58
processingTime 0.054318904876708984

date Fri, 13 Feb 2026 00:57:09 GMT
executionId 1770944229000249360_4670
requestId 86188c77b8214c71201c22633c8d06f1
processingTime 0.40620899200439453

date Fri, 13 Feb 2026 00:57:26 GMT
executionId 1770944246140443215_7a62
requestId 46a744da5e4819d171d7b0ca6e8a78e4
processingTime 0.42233800888061523

date Fri, 13 Feb 2026 00:57:46 GMT
executionId 1770944266505727484_8186
requestId c4d2e5644b08311933c2db505f07f5d8
processingTime 0.4050741195678711

date Fri, 13 Feb 2026 01:09:28 GMT
executionId 1770944957071142802_bb9f
requestId cb1fe6759599f2b550a000ab2a0ea788
processingTime 11.065425157546997

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.