RNN Keras in KNIME (Sequence to vector)

knimeoutjie · January 30, 2020, 12:54pm

I need someone to help my build a RNN in KNIME with Keras. I did it in Python but it is ugly and dirty. Would appreciate any pointers.

Sequence to Vector. Actually it is a whole sequence of data which leads to a single binary output.

nemad · January 30, 2020, 1:51pm

Hello @knimeoutjie,

and welcome to the KNIME community.
Could you please provide a more detailed description of the network you want to build?

Best regards,

Adrian

knimeoutjie · January 30, 2020, 2:29pm

Hi Adrian,

I have come up with an example from science which would be very similar to my final dataset. At least on the level of input which is my biggest problem at this point.

Here the example from science (ignore units, I know this is not very scientific):

Imagine you have a cylinder which is running full of water from a constant running source. You want to build a RNN model to predict if the cylinder is full or not.

You choose 30 000 cylinders with random radius (30 000 random between 3 – 7) and height (30 000 random between 10 – 30). You also get random time unit (lets say seconds) that the source is open (30 000 random between 10 – 30) for.

If you use a constant flow into the cylinder of 100, you get an even spread of full and not full cylinders.

We pass the radius, height and seconds (time unit) running to the RNN. You make another feature which is the volume that flowed in the cylinder after 1 second and then after 2,3,4 ect. Since the time unit (let’s say seconds) are also random, some cylinders will be full and some will not be by the time the last second is reached.

What is nice about this example is that the output is measurable and the input sequence varies from sample to sample. We know the volume of each cylinder and thus if it will be full or not. So we have good training data.

Summary

The RNN will only receive the radius, height and seconds of running water from constant source. It also receives the volume after 1 second, 2 seconds, etc, etc, but it does not know the formula of the cylinder nor the full volume of it. How accurate can it predict if a specific cylinder is full or not?

Like I said, this varying in sequence length (matrix) to vector (full or not) fits my real dataset very well. The advantage of doing this first is that this example is very measurable and when I did it in Python it worked very well. I am happy to share my Python code, but it is super messy.

I would appreciate any help,

Leon

nemad · January 31, 2020, 12:10pm

Hi Leon,

I fear I don’t understand your example.
So you have for each cylinder the three features [radius, height, time] where time is the seconds that water runs into the cylinder. In addition you have a sequence of length time which contains for each time unit the volume of water that the cylinder contains?
But isn’t that sequence the same for all cylinders as the volume of water flowing into them is constant? Of course smaller cylinders will fill up faster and therefore the sequence is shorter.
I don’t know the network you already have but I would actually only use the three features and ignore the sequence, as they already contain all the information necessary to complete the task.

Cheers,

Adrian

knimeoutjie · January 31, 2020, 1:37pm

Hi Adrian,

Thanks a lot!

That might be the case but remember my final data set resembles the features I have manufactured and they cannot be reduced. I therefore came up with this example on purpose although it might be a little bit artificial. My real data set has sequence length of various lengths and I want to use a RNN sequence to vector setup. Also please keep in mind that the purpose is learning and learning to use KNIME and not so much making this specific model work with a 100% accuracy. Sure it would be interesting to play around afterward but my first goal is to set up the RNN to take a sequence of input of varying length and feed that through the RNN.

I do appreciate your comments a lot!
Leon

nemad · January 31, 2020, 2:41pm

Hello Leon,

I see, so in your setting you have sequences of varying lengths and you want to extract a vector describing each sequence, right?
The main challenge you face is to pad your sequences so that they all have the same length.
Don’t worry, you can make use of Keras’ masking during training.
The best strategy for padding depends on the format of your data.
If you have a collection column containing the sequences, then you can do the padding by first applying the Split Collection Column node followed by a Missing Value node in which you replace all missing numerical values by 0.
You then need to define your network, for this I’d recommend the following structure:
Keras Input Layer -> Keras Masking Layer -> Keras LSTM Layer -> Keras Dense Layer.
The first represents the input, the second performs the masking of the 0 values, the LSTM then reduces your sequence into a vector and the dense layer finally produces your output.

Cheers,

nemad

knimeoutjie · February 1, 2020, 2:35pm

Hi Adrian,

I will do exactly that and it seems quite close to what I did in Python.
r = radius
h = height
t = time in second

When generating the r,h and t values I used the ‘Random Numbers Generator’ to output integers. I then used two separate ‘Joiner’ nodes to get them in one table (inner joint). I have 3000 rows for now. What turns out to be a little bit trickier is to generate a sequence of the length of the t (in seconds) value and have the t value increase by one until t is reached. This would result in the set of matrices that would be needed as input, except that I will work out the volume of each record and add that as a feature.

My problem for now is to reshape the data so that each row becomes a matrix with increasing t value.
If I can do that I know that my example will take off big time!

I do appreciate and enjoy your help. Please help again!
Leon

NB: The output it a 1 by 1 vector. In this case binary, empty of full.

nemad · February 3, 2020, 12:20pm

Hi Leon,

ah I see, so you actually have a [t, 3] matrix?
Unfortunately, handling matrices is still a bit awkward in KNIME.
There are essentially to strategies:

Use our image processing extension, as pictures can also be seen as kinds of matrices.
Provide 3 input layers for your model where each receives one sequence (collection). One for the radius, height and time.

For your example, I’d recommend the second approach, as it doesn’t require to learn our image processing extension.
In order to create the collections, I’d use the group by node, but you’ll first have to add a column that identifies the rows belonging together.
I am still not 100% sure on the format of your data, so if I misunderstood something, please let me know.

Cheers,

Adrian

knimeoutjie · February 3, 2020, 2:29pm

knimeoutjie · February 3, 2020, 2:40pm

Hi Adrian,

I made some data in excel quickly and pasted it in the post. Not all these columns would be needed in the end. Those that will be passed to the RNN would be number (as you suggested a unique number of the training sample), second (clearly increasing in the sample), radius (constant for out example), height (constant for example) and volume run in (the amount of water in the cylinder at time t). The label is given in Full/Not Full. The first training example is clearly not filled up and the second is. The column total volume will not be in given to the model. I rather give you this example to clear up all misunderstanding over the data.

I am stuck at the moment in the generation of the data and I think you are already thinking about the next step (feeding the RNN). I am interested in both.

Firstly then I have this in KNIME

That part was easy. Now I want to generate two sequences from the two rows. The two sequences will be the bigger matrix described above. Those two lines must be translated into two sequences.

Then I have to feed that matrix of matrices to the RNN. I wanted you to be very clear on my data since it was still a little bit ambiguous.

Many, many thanks for helping me!
Leon

nemad · February 3, 2020, 5:21pm

Hi Leon,

ok, I’ll give you a quick outline on one way to create the sequences:

Loop over the rows in your last post using the Chunk Loop Start
Use the Table Row to Variable node to extract the seconds
Create an empty table with as many rows as you have seconds (use the flow variable button)
Cross join the current row with the empty table using the Cross Joiner
Use the Counter Generation together with the Math Formula to simulate the constant flow of water
Use the Group By node without any groups to aggregate the whole table into your desired sequences

I know I could have created the workflow but I think this way you will learn more about the KNIME AP

Cheers,

Adrian

knimeoutjie · February 4, 2020, 6:39am

Hi Adrian,

It almost feels like cheating so easy it was to do. Apologies for the screen print, but I am still figuring out the copy function from the data sheet.

I have added a few extra columns for now (will drop them later). I can send you my workflow if you want but I don’t think that is necessary. I did however not add the label yet and I did not use the ‘Group By’ node. I would be interested to know what you wanted me to use the Group By Node.

and also now is the right moment to comment on your (t x 3) matrix comments (please).

Leon

nemad · February 4, 2020, 8:28am

Haha that’s what we like to hear
In the table you displayed in your post, each training instance is represented by multiple rows but in order to train your RNN each training instance must be represented by a single row. In order to get there you can use the GroupBy node. It allows you to aggregate multiple rows (a single group) into a single row, using a variety of aggregation functions. In your case you will need to aggregate the volume as a list.

Regarding the (t x 3) comment: The way I understood it was that each time step in your time series should be represented by the tree features Length, Radius and Volume.
In the current version of KNIME this is a little tricky to achieve as we only have collections in the base version. One way of dealing with matrices is our Image Processing extension but for your example this would be an overkill.
As an alternative you can simply have three Keras Input Layers where each represents one of the sequences. In order to get to the (t x 3) input to your RNN you can simple concatenate the sequences with a combination of Keras Reshape Layers and Keras Concatenate Layers.

Cheers,

Adrian

knimeoutjie · February 4, 2020, 11:20am

Hi Adrian,

I have done it slightly different from what you suggested since grouping it only by volume at t did not result in only one row. You comments please, but I grouped it by seconds (t) and volume at time t, to get a nice row.

Clearly the two resulting vectors are now stored in the row.

Your comments please? Should I not have done padding with zeros towards the entry with the maximum length before putting them in this beautiful row.

Regarding your (t,3) comments. Does this still apply to me or were you imagining that may data would have a different shape?

Super thanks!
Leon

nemad · February 4, 2020, 3:46pm

There are many ways to skin a cat.
As long as you ended up with the desired result, there is nothing to comment on.

You will have to do padding at some point anyway, so you can just as well do it before you create the collections.

Perhaps I simply misunderstood what you wanted to do. I thought your plan was to create a time series where for each time step you have the three features Length, Radius and Volume.
From a modeling perspective that’s actually unnecessary because Length and Radius stay constant, so you can just append them to the output of your RNN before you calculate the final output of the network.

Cheers,
Adrian

knimeoutjie · February 8, 2020, 9:47am

Hi Andrew,

Apologies for being quiet, I have been busy with some other stuff. BUT I am still very keen on building this.

Ok the padding and Grouping went ok (I think)

With the Keras nodes things are getting a bit obscure because there are no output tables…

My Keras LSTM Layer does not like the dimensions that I am passing. I get this error:

Clearly there are some major gaps in my understanding, but I have to say I really like KNIME more and more.

Thanks a LOT!
Leon

knimeoutjie · February 8, 2020, 10:35am

Apologies, Adrian, not Andrew…my fault

knimeoutjie · February 8, 2020, 11:34am

Hi Adrian,

I would appreciate if you could comment on one other thing I think about now. From my data print you can see that I padded the zeros towards the end of the sequence (which is mapped in a vector now). Would it not have been better to put the zeros in front of the actual sequence? Does that question make sense? Front padding vs End padding if I can make up terms. If so, how would I do this in KNIME?

Thanks a million!
Leon

nemad · February 10, 2020, 9:18am

Hi Leon,

that’s a lot of questions, so let me start at the beginning.
The output of the Keras nodes is a network architecture that you then have to train on your data with the Keras Network Learner.
For an example on how to use the Deep Learning framework, you can check out this workflow by Scott.

For your use-case padding in the front might indeed be a better approach but it’s sligthly more complicated and there exists a simple alternative:
In the LSTM node there exists an option called “Go Backwards” which when enabled will let the network read the sequence from back to front.

Cheers,

Adrian

knimeoutjie · February 10, 2020, 9:26am

Thanks Adrian,

After I did the last post I tried to implement that Keras Network Learner, but I see that there is a dependency on Python. I dont mind to implement Python, but it would be good to understand why that is needed before I do it. I see it is best to use the Anaconda environment? Is that still the best option for Python with KNIME?

Many thanks,
Leon