What has happened to serverless API?

Describe your question/issue here! (delete this when you post)

  • Project Type: Serverless API
  • Operating System & Browser: -
  • Project Universe Link or Workspace/Project ID:
  • Do you grant Roboflow Support permission to access your Workspace for troubleshooting? (Yes/No): yes

I’ve been using the Serverless API to run my workflows for about a year now, and it’s been working quite well. Sometimes I get spikes in the inference, but it’s not so often. However, the last week or so I’ve seen the workflows stop working, and my app stops working.

The problem is not just in my app, it’s in the workflow tool here in roboflow as well. Also, I’ve seen a clear spike in usage, going from 0.1 to 0.4 credits per day to - today, 9.4 credits so far. And not so many successful inferences.

This makes little sense to me. Has anyone else experienced this?

Like I said, the workflows used to work really well for me - but last week has been terrible. This is major issue as I’m having an ad for my app going out this week and it’s not possible to pause… just the worst timing.

And now I’m up to 33.4 credits today, just from running the workflow inside roboflow. This is crazy.

Good afternoon @agoransson,
My name is Ford and I am a Support Engineer at Roboflow. Happy to help here.

To help me triage further, do you grant Roboflow Support Permission to access your workspace? Additionally, can you please provide the name of your erroring workflow?

Thank you!

Of course, access away! ISSF_25_PISTOL

Testing now seems to have “fixed it self”. The workflow works again, after spending 39.something credits in “internal server error”.

So you might not need to debug this issue, Fred.

Thank you for noticing.

Hi @agoransson,
My name is Ford and I am a Support Engineer at Roboflow. I’m incredibly sorry for what you experienced today, especially ahead of your launch. This is not the experience you deserve, and we’re committed to making this right.

I’m glad to hear the 500 error has resolved and inference is back to normal execution times, we have addressed the core issue.

Earlier today, a malfunctioning disk storage component in our Serverless infrastructure caused model weight loading to stall intermittently. This resulted in elevated 504 timeout errors. Requests would wait up to two minutes for model loading before timing out, rather than failing due to an issue with your workflow or images. While the overall error rate was around 3%, some workloads were more heavily impacted depending on timing and which models were being loaded. Unfortunately, our usage tracking was unable to distinguish these infrastructure-level stalls from normal processing time, so credits were charged for the full duration of those timed-out requests even though no inference was performed.

We’ve refunded all affected credits to your account. We’re also implementing improvements to our disk storage monitoring, runtime alerting, and recovery procedures to catch this class of issue faster in the future.