Advanced Vision FOMO adding keras layers

I love Vision FOMO with the Arduino Portenta and LoRa Vision Shield. It does exactly what it is supposed to do. My after school group has had some good success with our RC cars using FOMO to drive between a 2 object track.

I am working on line following and FOMO works as planned (see images below). So I am coding various types of PID and other controllers to process the data to drive the car by extracting: Stop, GoRight, GoLeft and StraightFaster from the FOMO data. It is not going well.

I keep thinking this extraction of information would be best as a machine learning problem on top of the FOMO data. I have done similar stuff using TensorflowJS here, but not with EdgeImpulse. Today I may try training a regular FOMO model with the 4 labels, but I am fairly sure it will fail.

What I want to do is to add a few classify layers to the end of the FOMO model. Does anyone have any suggestions, and would this dramatically slow down FOMO, which presently is a ridiculously fast (72 ms)?

I can do my normal brute force trial and error approach, which takes months but normally is successful or I can try a model Keras Layers addition that a machine learning expert suggests. Any suggestions?

Here are my FOMO line images I get so far. The data is good and clean, just what to do with the data gets messy. I have a few ideas on the go, but I think changing the ML model is the best approach.

So Training for specific car commands did not work well.

Reading the above post, I guess my question is: Can the output from FOMO be the input for a second model on a micro-controller? The Portenta has 2 cores and I have good ways to communicate between the cores link here, It would be really intesting to have both cores running an Edgeimpulse model that sends data from one to the other. Too bad the inner M4 core is so slow.


The crappy model above, trained on what the car should do (1right, 2left, 3fast) runs better than any of my other models trying to extract information from where the lines are showing up. So much for my predictive ability.


For any beginners wanting a good laugh.

I took the positive from the above model and tried a model fully based on what my Line following track might look like. Here are my model results


Oh well!

kind of connected to this post Passing FOMO results data (to a different core on the Arduino Portenta)

sorry, i somehow missed this last week…

so the main output of FOMO before the bounding box conversion is something equivalent to a segmentation map; by default a (96,96) inputs with 3 classes (e.g. {implied_background, white_line, witches_hat}) will reduce to a (12,12,3) tensor representing the distribution of classes across the entire 12x12 output . note: it’s the logits, not a softmax, at this point.

In expert mode it’s

    model = Conv2D(filters=32, kernel_size=1, strides=1,
                activation='relu', name='head')(head_of_mobile_net.output)
    logits = Conv2D(filters=num_classes, kernel_size=1, strides=1,
                    activation=None, name='logits')(model). # say (12, 12, 3)

what you want to do is further reduce with two important points

  1. you want to keep some spatial info (e.g. are witches_hat & road_line on the left, right or the middle) and
  2. you want to change the output distribution to the 4 {stop, right, fast, straight} class set.

you can do 1) by a stack of strided 2d convolutions until you get to a “small enough” tensor that you can flatten to do 2) as standard logistic regression.

just writing from my head but it’ll be something like …

steering_head = Conv2D(filters=8, kernel_size=3, strides=2, 
                       activation='relu')(logits)  # (6,6,8)
steering_head = Conv2D(filters=8, kernel_size=3, strides=2,
                       activation='relu')(steering_head)  # (3,3,8)
steering_head = Flatten()(steering_head)  # (72)
steering_head = Dense(units=4, activation='softmax')(steering_head)

this won’t add much more compute, but will keep the important aspects you need. the main tunables for model size / latency are the #filters on the conv layers, try to make values small ( maybe even 4 would work? ). note: it’s important to Flatten not PoolXYZ before the classifier, the whole point is you don’t want to loss the spatial info…

Happy to elaborate more (unsure your level of familiarity with this)


1 Like

Oh, forgot to add. This extra part could be trained independently after the original FOMO model. E.g. self drive the car around for a bit and record your actions. That gives you training data. You can add sensible augmentations include stuff like how a left/right flip of the FOMO output would correspond to a turn_left / turn_right label swap.

1 Like

Thanks @matkelcey this is really interesting and makes some sense to me, actually was the answer I was hoping for. I am going to continue working with my students on post FOMO algorithms for the next 2 weeks until the course is finished, then may take a stab at this.

@matkelcey The part about training independently is interesting, my pre-edgeimpulse background was with tensorflowJS , using javascript I could “freeze” layers. Could I train a FOMO model using only my white lines, then add more layers and freeze the FOMO layers and then train it again with labels of: goLeft, goRight, goStraight-Faster. (stop is the default if nothing is classified)

Yeah spot on, I actually forgot to say this, but as you mention it’s usually more stable to freeze the torso of the model (FOMO) and only train the new classifier. You can think of it along the same lines as transfer learning.

1 Like

the classifier will always return a result so you have to be a bit more explicit about stop; simplest way is to just a stop class and train for it (e.g. in original training data instances for where all output cells were background)

1 Like

@matkelcey I have shared a FOMO lines model with you on the expert network ID=109651 called rocksetta-lines-fomo-0stop-1right-2left-3straight That link might work if you are logged in.

Presently I have trained it only on the FOMO lines and here is the code that seems relevant:

cut_point = mobile_net_v2.get_layer('block_6_expand_relu')
    #! Now attach a small additional head on the MobileNet
    model = Conv2D(filters=32, kernel_size=1, strides=1,
                activation='relu', name='head')(cut_point.output)
    logits = Conv2D(filters=num_classes, kernel_size=1, strides=1,
                    activation=None, name='logits')(model)
    return Model(inputs=mobile_net_v2.input, outputs=logits)

I am in no rush to do this but would like to understand what you are thinking I should try.

  1. train model using only “lines” - done
  2. freeze all present layers
  3. add a few more layers as mentioned above
  4. retrain model with “lines” but now with the specific bounding box labels: (ostop, 1right, 2left, 3straight)
  5. Use the output to control my toy car.

And here is the entire model keras expert mode.

import os
import tensorflow as tf
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.layers import BatchNormalization, Conv2D
from tensorflow.keras.models import Model
from ei_tensorflow.constrained_object_detection import models, dataset, metrics, util

def build_model(input_shape: tuple, weights: str, alpha: float,
                num_classes: int) -> tf.keras.Model:
    """ Construct a constrained object detection model.

        input_shape: Passed to MobileNet construction.
        weights: Weights for initialization of MobileNet where None implies
            random initialization.
        alpha: MobileNet alpha value.
        num_classes: Number of classes, i.e. final dimension size, in output.

        Uncompiled keras model.

    Model takes (B, H, W, C) input and
    returns (B, H//8, W//8, num_classes) logits.

    #! First create full mobile_net_V2 from (HW, HW, C) input
    #! to (HW/8, HW/8, C) output
    mobile_net_v2 = MobileNetV2(input_shape=input_shape,
    #! Default batch norm is configured for huge networks, let's speed it up
    for layer in mobile_net_v2.layers:
        if type(layer) == BatchNormalization:
            layer.momentum = 0.9
    #! Cut MobileNet where it hits 1/8th input resolution; i.e. (HW/8, HW/8, C)
    cut_point = mobile_net_v2.get_layer('block_6_expand_relu')
    #! Now attach a small additional head on the MobileNet
    model = Conv2D(filters=32, kernel_size=1, strides=1,
                activation='relu', name='head')(cut_point.output)
    logits = Conv2D(filters=num_classes, kernel_size=1, strides=1,
                    activation=None, name='logits')(model)
    return Model(inputs=mobile_net_v2.input, outputs=logits)

def train(num_classes: int, learning_rate: float, num_epochs: int,
          alpha: float, object_weight: int,
          best_model_path: str,
          input_shape: tuple) -> tf.keras.Model:
    """ Construct and train a constrained object detection model.

        num_classes: Number of classes in datasets. This does not include
            implied background class introduced by segmentation map dataset
        learning_rate: Learning rate for Adam.
        num_epochs: Number of epochs passed to
        alpha: Alpha used to construct MobileNet. Pretrained weights will be
            used if there is a matching set.
        object_weight: The weighting to give the object in the loss function
            where background has an implied weight of 1.0.
        train_dataset: Training dataset of (x, (bbox, one_hot_y))
        validation_dataset: Validation dataset of (x, (bbox, one_hot_y))
        best_model_path: location to save best model path. note: weights
            will be restored from this path based on best val_f1 score.
        input_shape: The shape of the model's input
        max_training_time_s: Max training time (will exit if est. training time is over the limit)
        is_enterprise_project: Determines what message we print if training time exceeds
        Trained keras model.

    Constructs a new constrained object detection model with num_classes+1
    outputs (denoting the classes with an implied background class of 0).
    Both training and validation datasets are adapted from
    (x, (bbox, one_hot_y)) to (x, segmentation_map). Model is trained with a
    custom weighted cross entropy function.

    nonlocal callbacks

    num_classes_with_background = num_classes + 1

    input_width_height = None
    width, height, input_num_channels = input_shape
    if width != height:
        raise Exception(f"Only square inputs are supported; not {input_shape}")
    input_width_height = width

    #! Use pretrained weights, if we have them for configured
    weights = None
    if input_num_channels == 1:
        if alpha == 0.1:
            weights = "./transfer-learning-weights/edgeimpulse/MobileNetV2.0_1.96x96.grayscale.bsize_64.lr_0_05.epoch_441.val_loss_4.13.val_accuracy_0.2.hdf5"
        elif alpha == 0.35:
            weights = "./transfer-learning-weights/edgeimpulse/MobileNetV2.0_35.96x96.grayscale.bsize_64.lr_0_005.epoch_260.val_loss_3.10.val_accuracy_0.35.hdf5"
    elif input_num_channels == 3:
        if alpha == 0.1:
            weights = "./transfer-learning-weights/edgeimpulse/MobileNetV2.0_1.96x96.color.bsize_64.lr_0_05.epoch_498.val_loss_3.85.hdf5"
        elif alpha == 0.35:
            weights = "./transfer-learning-weights/keras/mobilenet_v2_weights_tf_dim_ordering_tf_kernels_0.35_96.h5"

    if (weights is not None) and (not os.path.exists(weights)):
        print(f"WARNING: Pretrained weights {weights} unavailable; defaulting to random init")
        weights = None

    model = build_model(

    #! Derive output size from model
    model_output_shape = model.layers[-1].output.shape
    _batch, width, height, num_classes = model_output_shape
    if width != height:
        raise Exception(f"Only square outputs are supported; not {model_output_shape}")
    output_width_height = width

    #! Build weighted cross entropy loss specific to this model size
    weighted_xent = models.construct_weighted_xent_fn(model.output.shape, object_weight)


    #! Wrap bbox datasets with adapters for segmentation maps
    train_segmentation_dataset = dataset.bbox_to_segmentation(
        train_dataset, input_width_height, input_num_channels,
        output_width_height, num_classes_with_background)
    validation_segmentation_dataset = dataset.bbox_to_segmentation(
        validation_dataset, input_width_height, input_num_channels,
        output_width_height, num_classes_with_background)

    #! Initialise bias of final classifier based on training data prior.
        model, train_segmentation_dataset, num_classes_with_background)

    #! Create callback that will do centroid scoring on end of epoch against
    #! validation data. Include a callback to show % progress in slow cases.
    callbacks = callbacks if callbacks else []
    callbacks.append(metrics.CentroidScoring(validation_dataset, output_width_height, num_classes_with_background))

    #! Include a callback for model checkpointing based on the best validation f1.
            monitor='val_f1', save_best_only=True, mode='max',
            save_weights_only=True, verbose=0)),
              epochs=num_epochs, callbacks=callbacks, verbose=0)

    #! Restore best weights.

    return model

model = train(num_classes=classes,

override_mode = 'segmentation'
disable_per_channel_quantization = False

Looks like I asked this question on a different forum and got a different suggestion. Here is the link

I like having multiple possible solutions.

1 Like

HI @matkelcey I had to do some raw sensor work before coming back to this issue. I will DM you and see if there is a chance for a phone or zoom call. I get the gist of what you mentioned at Advanced Vision FOMO adding keras layers - #6 by matkelcey but not sure how to put it into practice.

It’s not the expert mode changes, they make sense it’s more the big picture of what to do. How to merge FOMO as a base model with a 4 output classification of the FOMO data. It does sound really interesting, I hope we can connect.

@matkelcey so what I am not yet understanding is do I train 2 models or just one in your suggestions above here?

For 2 models I use what I have already trained, a FOMO model detecting the white line the car follows, and then uses it as the base model for your suggestions above with 4 classes, and retrain that model. (I don’t yet understand how to use one model as a base model for the second model, basically how to do transfer training on edgeimpulse with your own model).

Or do I just make the changes you suggested and train the model ONCE using my 4 classes (stop, left, right, fast).

What I don’t understand with the one model idea is how the FOMO part understands that the line is the important thing to learn. The bounding boxes will be huge compared to the precise bounding boxes I have been using. Do I just trust, that the model, with enough training data will figure it out?

My thinking from the original post was that it’d be two models for training.

A FOMO one that takes the camera input (96, 96, 1) and returns (12,12,3) describing the location of {implied_background, white_line, witches_hat}

Then another model that takes the output of FOMO (12,12,3) and collapses all the way down to (4) (being {stop, right, fast, straight})

These could be trained as completely seperate projects; the input training data for the second coming from some output data of the first.

For deployment you can run them sequentially or, with some surgery, turn them into a single model.