How to make network learn about shape and color of the object separately?

I saw the paper and code of a Person Reidentification library by NVIDIA - GitHub - NVlabs/DG-Net: 👫 Joint Discriminative and Generative Learning for Person Re-identification. CVPR'19 (Oral) 👫

It says there are two different network to focus on Person body shape and clothing separately.

Sorry if the question is noobish.

Intuitively I would run the image through an edge detector, and then train on that image to make the network learn the structure of the pedestrian.
And the network which emphasizes on the color and clothing would be fed the original RGB image instead of greyed one.

I skimmed through the code, but I didnt find any snippet where I can see the above.

Is my above assumption of Edge detector and RGB correct?

Hi Lakshay, before you dive too deep, I want to note that we are not the originators of that paper or the architecture used to create the model so I will answer as best as I can.

From what I can understand:
To get the “shape” of the pedestrian, create an instance segmentation model, as it will give you a “mask” (or outline of the subject) when running inference:

If you are wanting to look at RGB values for clothing (to know what color they are) or other pixels, you can use the OpenCV library for that or train another model with labels for the clothing, and then pass the detected bounding boxes to a classification model to identify the clothing color. Alternatively, you can also label the clothing with instance segmentation and code your project to return the RGB values for the detected clothing when running inference.