Label multiple CSV rows as one observation

Sorry, I dont really know how to describe the topic better, but here is my problem:

Being rather new to KNIME, I try to create an predictive ml algorithm based on data I received from a friend.
The data contains multiple csv files (24M lines each) with a 20kHz resolution data from 7 acceleration sensors as float values and a timestamp.

My problem is the labeling, because lets say: Every 500.000 lines are “one” separate data-recording. So I would like to tell Knime that row…

  • ‘x’ till ‘y’ are to be labeled “OK”
  • ‘y + 1’ till ‘z’ are to be labeled “not OK”
  • etc.

…but making clear, that the lines are depending on each other and are not independent observations in each row.

I hope I made the problem clear, if not, I will gladly try to explain it better.

Thank you!
Rob

Hi @Rob and welcome to the KNIME community forum,

To do this you can use the Rule Engine node and the ROWINDEX value. For example:

$$ROWINDEX$$ < 3 => "OK"
$$ROWINDEX$$ >= 3 AND $$ROWINDEX$$ < 10 => "not OK" 

:blush:

What do you want to predict?

Hey @armingrudd, thank you for your quick reply. That is already a great help in finding my way into Knime.

From my understanding, this adds an additional column to my table following the rules.
But does it also make the classifier understand, that e.g. row 3 - 9 are one observation? and not seven independent observations?

I want to predict the “OK” and “not OK” State which are described by the sensor data. In the “OK”-State, the sensors give different values in a given time-windows (e.g. in 500.000 consecutive lines) than in the “not OK”-State

You can use the rule-engine once you have grouped your rows into unique observations. Hence my question about what you want to predict. Group your rows into observations containing the result and then use the rule-engine node to assign a class based on the result value.
How you group depends on your value of interest and obviously will be much easier if your window is constant length (always 500’000 rows). Maybe check out moving aggregation row and chunk loop start.

With chunk loop, you can define a fixed number of rows to which all actions are applied until loop end. Bu t only works for a fixed length window. BTW with 500’000 rows per observation and 24 mio rows that is only 48 actual observations to train from? Not sure one can do much with so little data!

1 Like

Thank you very much, I will give that a try.

Fortunately there is more data - there are many files with the 24 mio rows. Otherwise I’d agree that it was too little.
Well to be honest, I still think it is not enough to train a decent model but I am trying to show the possibility to do it in KNIME. It is no problem to get much more data as the machines run 24/7.

Thank you for your assistance, @beginner. Now I do have one hopefully last question to that topic. Maybe it is a mistake in thinking about how the model is trained, but the result I have from chunking the input data is something like this:

Index Time Sensor_1 Sensor_2 Result
1 09:00:00:000000 -2,889 16,469 OK
2 09:00:00:000050 -7,267 11,366 OK
3 09:00:00:000100 -6,712 6,991 OK
500.000 09:00:25:000000 -10,715 12,643 OK

What I want is something like this, meaning that the “OK” is global for the whole dataset.:

Index Time Sensor_1 Sensor_2 Result
1 09:00:00:000000 -2,889 16,469
2 09:00:00:000050 -7,267 11,366
3 09:00:00:000100 -6,712 6,991 OK
500.000 09:00:25:000000 -10,715 12,643

From my point of view, this makes a difference because the 500.000 rows are depending on each other as the represent a development over time. Each row for itself does not have any meaning, the same values can occur in the “OK” and in the “not OK” state. Only together with the other rows, there is supposed to be a scheme which a model can be trained on.

Or am I having an error in my thoughts about that?

Thanks again!
Rob

I’m repeating myself: You need to group your 500’000 rows into one. Any ML algorithms assume 1 row = one observation. Since you can assign a class to these 500’000 rows you need to encode that rule why it is OK or NOT OK into feature(s) eg. new columns.

2 Likes

@Rob I think you might have to answer some questions for yourself (and us if we should be able to help you further).

The first question is: do you have a ground truth? Do you know (in retrospect) which lines (500k lines) are OK or Not-OK? If you have this ground truth you could proceed fro there. If you do not have it you will have to employ techniques of anomaly detection and check back if someone might be able to interpret the times where you found anomalies which might or might not hint at a ‘Not-OK’ status.

https://www.knime.com/white-papers
(section IoT)

If we assume you have the ground truth you will have to employ techniques of data reduction (to one line). One extremely brutal technique would be to transpose all 500k lines to columns to produce a table that has one entry for each instance. That would be extremely hard to handle and all following models would require a lot of dimension reduction.

For a starter, you could try and remove highly correlated values as they might not add any relevant information to what you might want to find.

An alternative could be to provide each line with ‘only’ the preceding 10 or 100 lines depending on what kind of relationship you would expect them to have and what kind of trend an expert would expect from the sensor results.

Then you could employ more simple aggregation functions like mean, min, max, kurtosis and skewness or other trend methods to describe your entity.

If you are dealing with time series you might try and use a tool like tsfresh and see if you could extract time-related trends. I am not 100% sure it can handle grouped time series but might be worth a try.

In general, it would make sense to know more about the logic and inner workings of this machine and what experts would expect and what measures they deem to be relevant.

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.