Interpreting YOLOv8->TFlite output

We’ve trained a YOLOv8n model for a single class (Cone) and image size 1920 and converted it to a fully quantized TFlite model to run on a Coral Edge TPU. When running the TFlite model using the tensorflow python library, the output is an array of dimensions 1x5x75600. How do we interpret these results into a collection of bounding boxes? And how do we set a custom confidence threshold?

Hi @bababooey1234 !

The process of taking the output of a model and getting something meaningful (like a set of bounding boxes) is sometimes referred to as “decoding” the model output. For YOLOv8, you have a couple options:

  1. You can upload your weights to Roboflow then use the hosted endpoint to run your model. We take care of the decode for you and give you some easy-to-parse json. Here’s the docs on that: Upload Weights - Roboflow

  2. You can tackle the decode on your own. The output of the model is an array of candidate detections with dimensions ( batches, 4 + num_classes, num_candidate_detections ). Each candidate detection is made up of (xc,yy,w,h, class_1_conf, class_2_conf,...,class_N_conf) where xc,yc is the center point of the candidate detection and w,h are the width and height of the candidate detection. class_N_conf is the confidence that this box belongs to the Nth class. In your case, you’ll only have a single confidence. This confidence can be considered the detection confidence when you are applying a confidence threshold. To get a set of meaningful bounding boxes, you’ll need to run all of your candidate detections through Non-Maximal Suppression (NMS), which is the process of deduplicating overlapping candidate detections in favor of the most confident detection.

Hope that helps!

Hi, I need to use the same procedure. I’ve don’t know what is the correct python code to call and use the TFLite model in python . Is possibile to share your code? Thank’s.

@bababooey1234

Hello, on tensorflow website there is a sample of code :

Another example if the model doesn't have SignatureDefs defined.

import numpy as np
import tensorflow as tf

# Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="converted_model.tflite")
interpreter.allocate_tensors()

# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Test the model on random input data.
input_shape = input_details[0]['shape']
input_data = np.array(np.random.random_sample(input_shape), dtype=np.float32)
interpreter.set_tensor(input_details[0]['index'], input_data)

interpreter.invoke()

# The function `get_tensor()` returns a copy of the tensor data.
# Use `tensor()` in order to get a pointer to the tensor.
output_data = interpreter.get_tensor(output_details[0]['index'])
print(output_data)


You just have to load input data according your model : shape and dtype

you can check it here : https://www.tensorflow.org/lite/guide/inference?hl=en#load_and_run_a_model_in_python

Hi,
Thanks for the explanation and it works for bbox predictions.
but, how about segmentation models.
I have trained a model “YOLOV8s-seg” converted to tflite and the outputs are like:
[1, 160, 160, 32] from output_details[1]['index'] which are supposed to be Mask protos
and
[1,40,8400] fro output_details[0]['index'] which are supposed to be Coordinates of detected objects, class labels, and confidence score

I have only 4 classes in my dataset
and the model works perfectly fine before conversion.
but after conversion, I am lost in how to make sense of these dimensions.

You’re on the right track! With YOLOv8 instance segmentation, each prediction (each row of the [1,40,8400] output) has dimensions [num_batch, 4 + num_classes + num_masks, num_candidate_detections]. For YOLOv8 you can see that num_masks is 32, which matches up with the last dimension of the mask protos ouptut. For each detection you’ll want to do some matrix multiplication to combine the mask prediction with the mask protos to compute the mask. See this code in the ultralytics repo: ultralytics/ops.py at 30fc4b537ff1d9b115bc1558884f6bc2696a282c · ultralytics/ultralytics · GitHub

Note, you’ll also need to ensure your NMS function can handle the extra mask dimensions on you predictions array.

1 Like

@Paul Thank you for your prompt response. I now have better understand and I know I have to do some matrix multiplications, And I was going to use the same function which you shared, But in order to apply nms, I need to know the coordinates of bbox and their confidence score and class labels,
If i look at some of the values of matrix with dimension: [1,40,8400]

print((output_data[0][0][0:4]))
print((output_data[0][0][4:6]))
print((output_data[0][5][0:4]))
print((output_data[0][5][4:6]))
print((output_data[0][10][0:4]))
print((output_data[0][10][4:6]))

the output values of something like this:

[     3.5684      9.4033      16.021      20.085]
[     24.542      31.756]
[ 2.1164e-06  1.3497e-06  1.1561e-06  1.3197e-06]
[ 1.2169e-06  1.3582e-06]
[   -0.55169    -0.28594    0.067442     0.23666]
[    0.19009     0.08357]

and for 40 rows i have 8400 values,
and 160x160x32 does not fit into this calculation, unless i am aware of droping some indexes and keeping some,
I hope my question is making sense and i am not making some naive mistake,
Thank you for your patiance and time.
And i am also not sure if the values in 1x160x160x32 are of mask values or something else, as far as i am aware, they should be binary mask values but they are also float values, and not between 0 and 1:

PS: for the matrix 1x40x8400 the values do range between (<640) which is the dimension of input image so they may be the coordinates of the bbox (but in which order, is it xyxy, or xywh or something else and what are the labels for those boxes if they are indeed boxes), but what about index 4,5,6,7 where values are less than zero.

Hi @Paul Thank you for your prompt and helpful response, It makes more sense now,
but still I am not able to understand the structure of output, I am aware I need to do some resizing for matrix multiplications etc, But before that I need NMS and for nms I need to know which value belongs to what.
e.g from 1x40x8400, I have 40 rows and 8400 candidate labels, but these labels are in what order, is it the same like bbox (which I doubt as the values do not look like that)
e.g:

print((output_data[0][0][0:4]))
print((output_data[0][0][4:6]))
print((output_data[0][5][0:4]))
print((output_data[0][5][4:6]))
print((output_data[0][10][0:4]))
print((output_data[0][10][4:6]))

will show the values something like

[     3.5684      9.4033      16.021      20.085]
[     24.542      31.756]
[ 2.1164e-06  1.3497e-06  1.1561e-06  1.3197e-06]
[ 1.2169e-06  1.3582e-06]
[   -0.55169    -0.28594    0.067442     0.23666]
[    0.19009     0.08357]

so for index 4,5,6,7 the values are less than one and the rest all the values on all indexes are less than 637 which makes sens that these may be the coordinate of the input image which is 640x640. But in which order and where is the class label of these values? and what is confidence?
if I don’t have this information, I am not sure how to proceed from here.
Thank you for your patience and time.

Hi @rsadiq, it sounds like this is the ordering/formatting

And this is what I’m seeing for the function definition linked by Paul:

def process_mask_native(protos, masks_in, bboxes, shape):
    """
    It takes the output of the mask head, and crops it after upsampling to the bounding boxes.
    Args:
      protos (torch.Tensor): [mask_dim, mask_h, mask_w]
      masks_in (torch.Tensor): [n, mask_dim], n is number of masks after nms
      bboxes (torch.Tensor): [n, 4], n is number of masks after nms
      shape (tuple): the size of the input image (h,w)
    Returns:
      masks (torch.Tensor): The returned masks with dimensions [h, w, n]
    """

And this may help for NMS: How to code Non-Maximum Suppression (NMS) in plain NumPy

It does sound like that, i agree. But if I look at the values they don’t add up.
I am sorry but I couldn’t get any documentation or help to reorganize tflite output.

num_masks = 32
num_classes = 4
num_predictions = 8400

output2 = np.reshape(output2, (num_predictions, 4 + num_classes + num_masks))  # reshape to [num_predictions, 4 + num_classes + num_masks]
boxes = output2[:, :4]
scores = output2[:, 4:5]
classes = output2[:, 5:5+num_classes]
masks = output2[:, 5+num_classes:]

print("BOX: ",boxes[:1])
print("SCORE: ",scores[:1])
print("CLASS: ",classes[:1])

This should be good right if I am properly ordering,
I should have boxes with their scores and class labels?
But the values in labels and scores are just not making sense. Do I need to apply any other formatting which I am missing ?

BOX:  [[     3.5684      9.4033      16.021      20.085]]
SCORE:  [[     24.542]]
CLASS:  [[     31.756      33.959      37.623      39.702]]

Score of 24.542 ? and class labels of above 30? If the class labels are somehow in 0-1 range, i would apply argmax, but here they just seem like coordinates of bbox, all of the numbers in some order.

It seems like you may have a flipped dimension. If the output you are seeing is out.shape=[1,40,8400], then the first candidate detection would be out[0,:,0] which would have shape out[0,:,0].shape = [1,40,1]. The 40 elements are [xc, yc, w, h, c1, c2, c3, c4, m1, m2, ..., m32]. So, you could reuse the NMS that was working for you previously by passing out[:,:8,:]. If you do this, you need to keep track of which indices made it through NMS so you can match up the 32 element mask vectors. Alternatively, you can update your NMS function to handle the larger input vectors and essentially ignore the extra 32 elements (but keep them around so that you can compute the masks for each prediction after NMS).

1 Like

@Paul thanks man.
Its more than helpful. Appreciate it.

As the OP mentioned that they wanted to use Edge TPU for predicting, I am wondering if they ever actually got it working as due to some quantization issue the exported tflite format for edgetpu wouldn’t detect anything previously. @bababooey1234

Hi, I working on the Yolo version 8 in edgetpu format. After convert the file model.pt into model-edgetpu.tflite format, I can not load the edgetpu model. Can you share me the lines of code that load the edgetpu model and output the prediction. Thank you a lot for your supporting.

Hello mohamed,
I have the same problem as you, how can i solve it? I have a form output (1,10,8400), there are 6 classes + 4 = 10, but what about the list of 8400. How to get the prediction boxes and the associated class. Can you please help me? Thank you very much.

Hello ,
I have the same problem as you, how can i solve it? I have a form output (1,10,8400), there are 6 classes + 4 = 10, but what about the list of 8400. How to get the prediction boxes and the associated class. Can you please help me? Thank you very much.

Ciao @Paul , extremely thanks for this hint.
I went through your suggestions; the output of my yolov8 to tflite model is a float32[1, 6, 3549].
My decoding algorithm is the following:

  1. First iterate over the 6 size array (which is considered to be my single detection);
  2. For each of the 6 size array element, which size is 3549, I take the following elements in position 0 (xc), position 1 (yc), position 2 (width), position 3 (height) then, starting from 4 there’s the confidence.
  3. In my case, as per testing training, there are just 2 classes (hence, 4 + 2 = 6). So, in order to get the confidence of a class, I simply take values in position 4 and 5.

Now, if steps are ok (?), I proceed with nms and then trying to draw Rect (all this within an Android Application). The issue I am having is that boxes are not been drawn properly (see image). As you can see from the image they are really small and are not fitting properly the object which I want to detect.
The code I am using for constructing Rect is the following:

                    final float xPos = detection[0][i][0];
                    final float yPos = detection[0][i][1];
                    final float w = detection[0][i][2];
                    final float h = detection[0][i][3];
                    final RectF rectF = new RectF(
                            Math.max(0, xPos - w / 2),
                            Math.max(0, yPos - h / 2),
                            Math.min(bitmap.getWidth() - 1, xPos + w / 2),
                            Math.min(bitmap.getHeight() - 1, yPos + h / 2));

Where I am wrong? Any hint you could provide will be really appreciated.
Extremely thanks,
m

Hey @kubermario ! In the picture you attached, is the box the green part near the top right of the image? And is most of the green just the label for that small box?

The steps you posted to go through NMS look good to me. Can you post one or two example predictions before and after NMS?

Here are some pitfalls I’ve encountered before:

  • Are your coordinates relative? If so, then they will all be fractions less than 1, in which case you’ll need to scale them by the width and height of the image.

  • During NMS, are you sure you are sorting the correct way? It’s easy to accidentally sort detections by decreasing confidence or IOU in which case you’ll be taking the worst fitting candidates instead of the best fitting.

  • I’m not too familiar with Android development but double check that the coordinate system is similar to other image processing systems, meaning the origin is in the upper left, x is positive to the right, y is positive down.

1 Like