Project Universe Link or Workspace/Project ID: N/A
Do you grant Roboflow Support permission to access your Workspace for troubleshooting? (Yes/No): No
When I run a workflow on the Serverless V2 endpoint, the execution time varies wildly. For example, I tested this with a simple object detection workflow with only a yolov8n object detection model. Running this on a simple 480x480 image claimed to be anywhere from 400ms to 7000ms of execution time.
This seems like a bug in the timekeeping, as inference execution time obviously shouldnāt vary by ~20x on the same input. Additionally, even my fastest executions were an order of magnitude slower than the claimed 17 ms yolov8n execution time claimed here: (Legacy) Serverless Hosted API | Roboflow Docs
Can you look into why the execution time is so long on Serverless V2? Itās cost me far more time/credits than I would expect.
This is likely because weāre just migrating the backend, which will be more flexible in the future (support for VLMs/larger models). When was this tested?
Iāve been testing this on and off for a few days. Latest was maybe 2:30pm pacific today (2/11/26). Can I ask how the load balancer would impact execution time?
@erik_roboflow I see in your post edit there may have been some confusion about why this is occurring. Can you let me know when this issue will be be mitigated? No fun paying 100-200x more credits than I expect. Thank you.
Hey @jk1, Erik unfortunately mixed some things up. What you are likely seeing is the ācold startā time of your model. When your request hits a server that doesnāt have the model loaded it has to download the weights and load them into its GPU memory which can take a few seconds.
When there is lots of contention for the GPU memory, your model will be evicted to make space for othersā. In Serverless v2, this happens pretty frequently because all requests can get routed to all backend servers (and so, counterintuitively, the higher load you put your model under the better it performs because most servers will already have it āhotā when the request comes in).
What Erik is referring to is an upcoming update weāve been testing this week on a subset of traffic that will be able to more intelligently route requests to servers that already have your model loaded to reduce the frequency youāll encounter cold starts. This should improve the performance and predictability of the Serverless backend. But what youāre observing is normal and expected behavior.
If you need more predictability, our Dedicated Deployments offer GPU machines that are completely allocated to you and once loaded your model will stay hot since nobody else is contending for that memory space. (This is at the downside of needing to cover the full cost of the GPU even when youāre not utilizing it fully, vs paying per request when you are sharing it with others.)
@brad , thanks for the quick response and for the color on the routing upgrade⦠quick suggestion, perhaps there should be an āunimportantā flag for requests that are not urgent from an E2E latency standpoint, but would rather run on a hot GPU, even if it means queuing behind other requests.
However something still feels fishy with the timing. Iāve attached 3 screenshots below. Each is running the same extremely simple workflow without a model that inputs an image and simply returns its height. There seem to be 3 possibilities.
Runs with ~300-500ms exec time (probably 8/10 requests)
Runs with =100ms exec time (1/10 requests) - Iām aware this is the minimum billing increment
Runs with 5000-20000ms exec time (1/10 requests)
Even with a cold start, I would expect execution time here to be pretty much 0 every time, so a 100ms billing increment.
I guess that leaves me with 3 questions.
Why is there 300-500ms exec time on the majority of the runs? Assuming execution on your end looks like [DL+load workflow ā DL+load model weights ā run workflow], and that this particular workflow should be <1ms to run, does the execution time include network latency on the DL steps?
Whatās happening when thereās massive (>5000ms) exec time? Why does this ever happen?
Separately, when I run a simple yolov8n workflow, I never get a 100ms exec time, even if I send a bunch of rapid requests. I assume at least one of them finds a server thatās hot. This doesnāt seem to reconcile with the claimed 17ms exec time in the docs. Am I doing something incorrectly?
Thanks again - appreciate you taking the time to look into the issue
quick suggestion, perhaps there should be an āunimportantā flag for requests that are not urgent from an E2E latency standpoint, but would rather run on a hot GPU, even if it means queuing behind other requests.
We donāt currently have a good way to do that on Serverless (but the version weāre testing where we created our own custom load balancing and routing layer specific to the characteristics of Workflows and CV models will help a lot ā should be live broadly in the next few weeks).
However something still feels fishy with the timing.
Interesting; we donāt run any tests without models currently but thatās a good idea and does certainly look like something worth looking into. The Workflow Builder UI is a bit of a special case because it needs to fetch the spec and recompile the execution engine on changes (and this is another thing that needs to happen on each machine), but the cold starts shouldnāt take anywhere near that long ā we will have to dig a little deeper here to see if thereās an additional source of latency on that part of things.
Separately, when I run a simple yolov8n workflow, I never get a 100ms exec time, even if I send a bunch of rapid requests.
How rapidly? In the current infra with the current GPU scale and load to get hot requests all the time you need to be sending many requests per second and then things should stabilize. This will improve soon with the new smart routing layer.
thanks - can you or someone else let me know when this is figured out? Iād really like to start using this at scale, itās just a bit hard to when the billing per call is so uncertain
I was sending 1-2 requests/sec. This is the same rate that I was trying the no-model flow. I assumed that I was hitting a hot server when I was getting the 100ms exec time in the no-model flow, therefore I assumed at least some of the yolov8 flow requests at the same rate would also hit a hot sever.
Also, is the āx-processing-timeā response header the canonical way to view execution time for a serverless API call?
Good call isolating the no-model workflow ā thatās a really useful test case. Iāve been running that same test on the current production deployment and Iām consistently seeing the expected 100ms minimum. So the intermittent spikes you were hitting should be resolved at this point.
To your questions:
The >5000ms spikes were related to an infrastructure issue on our end that has since been addressed.
For yolov8n at 1-2 req/sec on shared serverless ā you may not consistently land on a server with your model already loaded, since requests get distributed across the backend pool. The smart routing update Brad mentioned will significantly improve this by preferring servers that already have your model hot. The 17ms number in the docs reflects pure model execution time, not the full request lifecycle.
Would you mind re-testing the no-model workflow and letting us know if youāre still seeing unexpected times? If you can share specific request timestamps thatād be really helpful for us to try to correlate on our end.
Regarding x-processing-time ā yes, thatās the right header to look at. With the new system thatās rolling out, itās a bit tricky though: the processing time returned will actually be from the orchestration node for the workflow, which isnāt what weāll be billing on directly; weāll separatly report remote exec time in an additional header. The models execute on separate GPU nodes and will be what the billing is based on ā weāre still working on getting that information passed through cleanly in the response. Overall though, weāre seeing billed processing time reduced by a factor of 2-5x with the new system, so it should be strictly cheaper.
Regarding the suggestion for a ārun on hot GPU even if queuedā flag ā the new routing layer will essentially be trying to do this. Priority queuing where you can express a preference for e.g. slower response time for lower cost isnāt something we have yet, but weāre actively optimizing this system and this is a great idea.
Iād be happy to jump on a call to discuss your specific use cases ā would also be good to look at the different options we have to figure out the cheapest way to run your workloads. Let me know if thatād be useful.