Unexpected credit spike + Serverless V2 reliability concerns — Core plan, production app

Hi! I’m on the Core plan running 5 different models (YOLO26m-pose, RF-DETR-Seg-Medium) via the Serverless V2 API for a production application.

Two issues I’d like help with:

1. Unexpected credit usage spike Over the past week I’ve seen a significant unexpected jump in credit consumption which took me by surprise. This coincides with the infrastructure issues reported by other users on April 15 (disk storage failure causing 504 timeouts and credits charged for failed requests). Could someone from the team review my workspace’s credit usage for the past 7-10 days and confirm whether any credits were incorrectly charged during infrastructure incidents? I’d appreciate a refund for any credits consumed by failed/timed-out requests.

2. Serverless V2 reliability for production use We’re about to go live with a commercial product that depends on the Serverless V2 API. I’ve seen several recent reports of 502 Bad Gateway errors and variable execution times on V2. I’d like to understand:

  • Is the V2 serverless backend considered production-ready and stable?

  • Has the smart routing update mentioned by Brad in February been fully rolled out?

  • What SLA or reliability guarantees, if any, exist for Core plan customers using the Serverless V2 API?

Any clarity here would be greatly appreciated - it will directly inform our deployment strategy going forward.

Thanks

  • Project Type: Keypoint detection (yolo26m-pose)
  • Operating System & Browser:
  • Project Universe Link or Workspace/Project ID: rudys-workspace
  • Do you grant Roboflow Support permission to access your Workspace for troubleshooting? (Yes/No): Yes

Hi Rudy,

Thanks for the detailed write-up — digging into your account turned up something worth investigating.

First, to be clear that these are two separate issues. Your workspace wasn’t impacted by the Apr 15 infrastructure incident. We refunded every workspace whose requests were affected by that outage, and your workspace wasn’t among them because your Apr 15 traffic came in after the incident had been resolved. What you’ve been seeing is a distinct issue specific to your account.

What’s actually happening. Pulling your ingress logs for the last 14 days: 7,605 of 7,788 requests failed with 500s, and every single failure is on your keypoint-detection models. Your segmentation models (front-teeth-segmentation/3, lower-arch-segmentation/14, upper-arch-segmentation/2) ran cleanly the whole time. The failures start Apr 8 — well before Apr 15.

The keypoint models (YOLO26-pose weights) were registered as object-detection models in our backend when they were uploaded. At inference time, the detection post-processor tries to parse pose output tensors and throws a 500. Our upload path should have caught that task-type mismatch and rejected the upload — it didn’t, and that’s the bug on our end. We’ve merged a validation check that now rejects type-mismatched uploads at the SDK boundary (roboflow-python#457). It’s pending a package release, but once it ships this can’t happen silently again.

On credits. Your failed keypoint requests consumed 1.8 credits over the two-week window. I’ve rounded up and added 5 credits to your workspace.

What unblocks you today. Re-upload your keypoint models with -t yolo26m-pose on the upload command. That forces the correct model_type string so they register as keypoint, and they should run cleanly.

One question that’d help us close the loop: what upload path did you originally use for tooth-keypoint-detection/10 — web app, SDK, or direct API? That tells us whether there’s a non-SDK path where the same validation gap still exists.

On V2 reliability. Straight answer, since I’m the one responsible for the service.

Yes, V2 is production-ready. It’s fully on the new async backend — the smart-routing work Brad mentioned in February is rolled out, not pending — and we run every entry-point service inside the Roboflow app on the same fleet, along with a lot of customers running production workloads.

I’m not going to pretend the last three weeks have been super smooth though. There have definitly been some edge cases to fix on the new backend and new model implementations in inference 1.0. We’ve also been migrating to a new cloud provider to get the GPU capacity we need, and the Apr 15 disk failure plus some earlier networking issues came from that side. We’ve been working with the provider on fault tolerance. Uptime for serverless.roboflow.com over the last 30 days is 99.7%. Most incidents hit a subset of the fleet rather than the whole thing — during Apr 15, the majority of customers (>97%) making requests at that time were unaffected because the disk failure only caused new model loads to fail on a subset of nodes being assigned the new models. That’s cold comfort when you’re in the impacted subset, but it’s the risk profile of shared infrastructure.

If you need a formal SLA, happy to loop in an account rep — depending on your volume, a dedicated deployment might make sense and takes shared-fleet variance out of the picture. Plenty of production workloads live happily on shared serverless too though.

Bottom line for your situation: other than the registration bug we just found, every one of your requests should have succeeded. Serverless V2 itself isn’t what’s been breaking you. Re-upload the keypoint models with the right -t flag and I’d expect your traffic to run cleanly.

Thanks again for the report.

— Thomas