Need Guidance on Finding a Person of Interest from Live CCTV/Webcam Feed

Hi everyone, I’m new here!

I’m working on a university project for my graduation and research. My goal is to identify a person of interest from a live CCTV or webcam feed by uploading a reference image. However, I’m unsure how to get started or which approach to take.

I’ve explored several models and workflows but can’t decide on the best one. I also have some key questions:

  1. Do I need to train a model from scratch, or can I use an existing image embedding model for direct inference?
  2. If I use embeddings, how should I compare them efficiently with frames from the live feed?
  3. Would this approach be computationally expensive, and are there any optimizations to reduce the cost?

I’m feeling a bit stuck and would really appreciate any guidance or suggestions from experienced members. Thanks in advance for your help!

Hey there! It all depends on what you’re trying to accomplish in terms of accuracy, speed, cost, compute, etc. There is really no right or wrong answer here.

You can train from scratch if needed. Try pre-trained models like this one to see if it works for you. People Detection Object Detection Dataset and Pre-Trained Model by Leo Ueno

Comparing embeddings “efficiently” depends on what you mean by efficient. If you’re thinking about embeddings rather than object detection, you can see this post Launch: Embeddings in Workflows

I do not know what computationally intensive or expensive means to you but what I’ve shared so far is fairly lightweight across training and inference.

I want to detect and track person of Interest in real time Webcam or cctv feed.
Which approach should be good for me.

I want good results, for now I had devised a project in which, I create embedding using ViT & Dinov2 models, stored then and then compared these embeddings over the live feed from Webcam. It get intermediate results, but it shows other people as person of Interest more than often.

Can you guide me more on this path.

This is a good approach. Our team often uses CLIP for embeddings to do things like this but I’ll see if anyone has other ideas.

Great @trevorhlynn, Thanks for you response. I will add CLIP into pipeline. What do you think if I add Cascade or DeepFace like Algos for separate face comparisons, and bodily features by CLIP or DINOV2.

Yes, I think if you have visibility of a face then something like DeepFace would be better.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.