Hi Rudy,
Thanks for the detailed write-up — digging into your account turned up something worth investigating.
First, to be clear that these are two separate issues. Your workspace wasn’t impacted by the Apr 15 infrastructure incident. We refunded every workspace whose requests were affected by that outage, and your workspace wasn’t among them because your Apr 15 traffic came in after the incident had been resolved. What you’ve been seeing is a distinct issue specific to your account.
What’s actually happening. Pulling your ingress logs for the last 14 days: 7,605 of 7,788 requests failed with 500s, and every single failure is on your keypoint-detection models. Your segmentation models (front-teeth-segmentation/3, lower-arch-segmentation/14, upper-arch-segmentation/2) ran cleanly the whole time. The failures start Apr 8 — well before Apr 15.
The keypoint models (YOLO26-pose weights) were registered as object-detection models in our backend when they were uploaded. At inference time, the detection post-processor tries to parse pose output tensors and throws a 500. Our upload path should have caught that task-type mismatch and rejected the upload — it didn’t, and that’s the bug on our end. We’ve merged a validation check that now rejects type-mismatched uploads at the SDK boundary (roboflow-python#457). It’s pending a package release, but once it ships this can’t happen silently again.
On credits. Your failed keypoint requests consumed 1.8 credits over the two-week window. I’ve rounded up and added 5 credits to your workspace.
What unblocks you today. Re-upload your keypoint models with -t yolo26m-pose on the upload command. That forces the correct model_type string so they register as keypoint, and they should run cleanly.
One question that’d help us close the loop: what upload path did you originally use for tooth-keypoint-detection/10 — web app, SDK, or direct API? That tells us whether there’s a non-SDK path where the same validation gap still exists.
On V2 reliability. Straight answer, since I’m the one responsible for the service.
Yes, V2 is production-ready. It’s fully on the new async backend — the smart-routing work Brad mentioned in February is rolled out, not pending — and we run every entry-point service inside the Roboflow app on the same fleet, along with a lot of customers running production workloads.
I’m not going to pretend the last three weeks have been super smooth though. There have definitly been some edge cases to fix on the new backend and new model implementations in inference 1.0. We’ve also been migrating to a new cloud provider to get the GPU capacity we need, and the Apr 15 disk failure plus some earlier networking issues came from that side. We’ve been working with the provider on fault tolerance. Uptime for serverless.roboflow.com over the last 30 days is 99.7%. Most incidents hit a subset of the fleet rather than the whole thing — during Apr 15, the majority of customers (>97%) making requests at that time were unaffected because the disk failure only caused new model loads to fail on a subset of nodes being assigned the new models. That’s cold comfort when you’re in the impacted subset, but it’s the risk profile of shared infrastructure.
If you need a formal SLA, happy to loop in an account rep — depending on your volume, a dedicated deployment might make sense and takes shared-fleet variance out of the picture. Plenty of production workloads live happily on shared serverless too though.
Bottom line for your situation: other than the registration bug we just found, every one of your requests should have succeeded. Serverless V2 itself isn’t what’s been breaking you. Re-upload the keypoint models with the right -t flag and I’d expect your traffic to run cleanly.
Thanks again for the report.
— Thomas