How to: image vector creation with masked autoencoder

Hi there,

The architecture I would like to replicate can be found in this paper:

I am curious to find out whether anyone has any experience with masked autoencoders in KNIME and how to set it up.

The first problem I have, is building a vision transformer in KNIME. I usually work with the Keras nodes, but from my understanding it is not possible to build a vision transformer with the Keras nodes atm, am I right?

If anybody has any ideas on how to approach this problem or has examples to share, I would be very grateful.

My goal is to use the vector from the autoencoder to enhance an existing nlp-based classifier. I tried several CNN-based architectures before but due to the limited amount of labelled data the results are far from satisfactory. I think an unsupervised approach might be working better because I have a lot of similar, but unlabelled images available. I stumbled across the paper above, which I think is pretty neat. I´m also curious to hear of any other methods out there for unsupervised image vector creation.


Hi @e2dboden,

as you can see by the many responses, this is quite a tough question :sweat_smile: Let me try to give hints, even though I don’t have experience with masked autoencoders.

Did I understand correctly that a vision transformer is a concatenation of certain layers? If so, this should be possible via the Keras Layer nodes (see here for a list). If its more complex, it’s probably more convenient to build the vision transformer with the DL Python Network Creator – KNIME Hub or the DL Python Network Editor – KNIME Hub .

You could also use an existing trained network, if available and modify it (or parts of it). An example for this you can find here: Fine-tune VGG16 (Python) – KNIME Hub

But lets see whether this bump to the topic works and someone more experienced in deep learning has some idea!

Kind regards, Lukas