I built a program that does this with GPT 4-Vision by extracting one frame from each second, asking if it represents the moment, and then returning the sequence of frames most likely to represent the moment. It works great, but using GPT 4-Vision on that scale is very expensive!
Yes, it is possible to build such a program using Roboflow. You can use Roboflow’s Video Inference API to run computer vision models on frames in videos at a custom FPS. This can be used to identify specific moments in a video, similar to how you’ve been using GPT-4 Vision.
However, to identify specific moments like “Mario fighting Bowser” or “Steph Curry shooting a 3 pointer”, you would need a trained model that can recognize these specific events. If you have a dataset of images labeled with these events, you can use Roboflow to train a model to recognize them.
Once you have a trained model, you can use it to analyze frames from the video. You can then use these vectors with a video search template to enable searching your video with text queries. This can be used to find the time in the video where the event occurs.
Here’s a high-level overview of the steps you would need to follow:
Collect a dataset of images representing the events you’re interested in (e.g., “Mario fighting Bowser”, “Steph Curry shooting a 3 pointer”).
Use Roboflow to train a model on this dataset.
Use the Roboflow Video Inference API to run your trained model on frames from the video.
Use the output from the model to search the video for the events you’re interested in.
Please note that while this approach can help reduce costs compared to using GPT-4 Vision, there will still be costs associated with using the Roboflow Video Inference API and training a model.
For more details on how to use Roboflow with video, you can refer to this blog post. For more information on how to search video frames with Roboflow, you can refer to this guide.
Hi! I’m Roboflow’s AI powered bot. This is an automated response powered using AI with knowledge from our docs, blog, and forum. If this solves your issue, please mark it as the solution using the checkbox icon. If this doesn’t solve your issue, please follow up with what you’ve tried/why, and the community will continue to chime in as usual.
CLIP is awesome here! You create an embedding for each frame and then compare them to the embedding for the test prompt (text and images share the same embedding space).