New time-series nodes

Hello,

today I was playing around with the new nodes in KNIME 2.11.

  • time series missing value
  • seasonality correction

I did not really understand it while studying the example 050010_Energy_Usage_Time_Series_Prediction

Here is my data and what I want to achive. I collect energy data in irregular time intervals maximum every 5 minutes. 

 

row ID    date-time-stamp       energy
Row325143 2013-12-21T17:37:20   783,66
Row325201 2013-12-21T17:42:20   783,75
Row325259 2013-12-21T17:47:20   783,95
Row325317 2013-12-21T17:52:20   784,13
Row325375 2013-12-21T17:57:20   784,32
Row325443 2013-12-21T18:02:46   784,54
Row325501 2013-12-21T18:07:20   784,83
Row325559 2013-12-21T18:12:20   785,06
Row325617 2013-12-21T18:17:20   785,20
Row325675 2013-12-21T18:22:20   785,24

The problem with this is that one cannot use all the nice prediction and mining functions from the energy examples available here as date-time-stamp is very uneven. So I have to make very regular date-time-stamp data out of this. I tried to do this with the missing value node, but could not find the right settings.

This is my expected result: Resample all data points to 15 Min intervals with linear interpolated energy values and a delta caluclation in the last colum:

row ID	  date-time-stamp     energy regular-date-time   resampled-energy delta
Row325143 2013-12-21T17:37:20 783,66			
Row325201 2013-12-21T17:42:20 783,75			
Row325259 2013-12-21T17:47:20 783,95 2013-12-21T17:45:00 783,86	          0,57
Row325317 2013-12-21T17:52:20 784,13			
Row325375 2013-12-21T17:57:20 784,32			
Row325443 2013-12-21T18:02:46 784,54 2013-12-21T18:00:00 784,43	          0,71
Row325501 2013-12-21T18:07:20 784,83			
Row325559 2013-12-21T18:12:20 785,06			
Row325617 2013-12-21T18:17:20 785,20 2013-12-21T18:15:00 785,13	
Row325675 2013-12-21T18:22:20 785,24			

From there I could continue with the big-data-time-series white paper...

Happy about any help

Hi,

This one (in attachment) is really heavy, but it could give you inspiration.

Any suggestion to simplify this is welcomed. I'd like to learn to do that a proper way. The delta is not on the first line but on the end line, sorry for that. If you try this workflow beware, the recursive loop end needs a parameter.

Best regards

Fabien

Wow... that is a way harder problem than I first expected.

Here's my take on it. Not really simpler than Fabiens, not as complete, but different, for some more inspiration.

I can't see a way how the more complex time series nodes can help here. The simple ones can help us extract raw data, however, with which we can reduce the problem to a more general one. The hardest part is the linear interpolation. The only node that can do that on its own (as far as I know) is the Time Series Missing Values, but it uses the row count as a distance. So we must compute it "manually". Fabien solved that by using Lag Columns, while I used some grouping trickery. My way may look a bit shorter, but it is way harder to generalise. A third way might be to use a Group Loop Start on quarters, then something to select the rows to compute on. And a fourth way might be a Java Snippet. (But where's the sport in that.)

I don't think there's a "proper" way to do such tasks. The fold we're trying to do is a little bit too special for support by general nodes.

fabienc,

thanks a lot for spending so much time & brains on this. Now I know why I could not fix it with 2 or 3 nodes.

I am currently investigating your model. That is pretty big!

Next time I will post the data to make it easier. I am astonished that the community is providing complete models to help. I really appreciate your support.

Cheers, Stefan

Marlin,

many kudos for your help. Right now I am comparing the models. As I am still a beginner this will take some time.

What puzzels me a little bit: this is a very typical problem if you collect real world sensor data. I have several more problem classes like this...

Stay tuned for my feedback ;-)

 

 

 

fabienc,

whow. That's wizard work for me as a KNIME beginner. I understood the approach and have just two more question:

1) How does the "Group by (node 32)" work? I do not really understand why this parameter generate such a nice output. 

2) The "cross-joiner (node 8)" will generate a lot of data right? 

If the energy data set has n values and there should be m regular-time-stamp output data set the table will be (n * m)?

If yes, I will generate a huge memory problem with that. I collect about 9,000 values for ONE sensor in ONE month. As I have 26 sensors and data from 39 months this is probably more than KNIME can handle? Is there an easy way to restrict the cross-joiner maybe to 1 hour?

Thanks a lot. Really cool.

@stefferber

 

Marlin,

many very good suggestions in your flow. Finding the right quarter looks like an essential first step. What about his formular 

$quarter-per-day$ = floor(($Hour$*60+$Minute$)/15)

as there are 96 quarters per day: 0...95

and then calculating the minutes of the quarter is easy:

$minutes-per-hour$ = mod($quarter-per-day$, 4)*15

Your calculation of the distance in seconds to the next quarter inspired me. Hence, I left in the signum as it keeps track which value I have to filter later. 

Anyway, now I have it. Tomorrow I will post my solution with 12 "linear" nodes (no loops, no joiner) here. I am too tired know.

@stefferber

 

Steffeber,

Yes, that's a great formula. Didn't see that one. I guess a fresh mind helps. ;)

I agree that this sounds like a recurring problem but I'm not sure what would be a good way to turn it into a node. I mean one could just create a resampling node, but that's not really general, or modular. I have a feeling there's some bigger, more useful pattern behind that, that I just can't see.

So what are examples of these other problem classes you mentioned? Maybe if we close in on the missing parts, some idea will emerge and inspire someone to take it up.

To answer one of your questions to Fabien: You can not restrict the cross joiner by itself, but you can use a chunk loop or a recursive loop with a row filter to step through parts of huge lists, and then apply e.g. joiners only on these parts. I understand your reluctance to use loops, and I try to avoid them, too, but with one exception: If I have a lot of data (i.e. hopefully always), I often use one around everything else. It gets a csv reader and a filter at the beginning and a csv writer at the end, and helps me keep data if something goes wrong. (And with real world data, something probably will go wrong.) Not as fast from a computational standpoint, but faster if you include restarts in the calculation.

Fabienc & Marlin,

thanks a lot for your lessons in KNIME. From your examples I made this up.

This time I included all the data to make it easier to follow up. In addition there is a graph explaining all the involved calculation and naming of data points:

 

How to generalize from this? I think that is not that difficult:

  1. Input: date-time-series of data streams
  2. parameters: 
    1. start-date-time-stamp, end-date-time-stamp, regular interval (also date-time-stamp type)
    2. Interpolation type: linear, polynomial, square, ...
    3. Number of regression data points: >= 2
    4. filling up missing data points: YES/NO
  3. output: regular date-time-series with interpolated data 

What do you think?

@stefferber

 

Wow, Stefferber, that's a really clean and simple flow, good job!

I think it would be great if the String Manipulator could be replaced, but the only ways I can think of would make the flow larger again, and not make it more efficient. There's definitely room for a new node there. Also, changing the order of your last nodes to get the Row Filter to the earliest possible point might have a small effect on performance, albeit by breaking the separation of stages. Apart from that, I think you found the best possible solution. Congratulations!

Now about the generalization... Yes, your proposal is one possibility. But what about spike handling? What if the base sampling is irregular by design to enhance precision where it is most important? What if data from different categories is sampled almost, but not quite the same way? All that probably happens all the time, and all that is possible now, if sometimes very complicated. The thing is: all that probably wouldn't be much easier if we just add such an equal-interval-resampling node, which means such a node would only be useful in a very small subset of situations. And integrating it all into one node defeats the purpose of a modular system. So I agree that there's probably "something" that could be done, but I still don't think we have a good idea what that might be...

Marlin,

thanks for your additions. I aggree with the RowFilter efficiency improvement. I will do that.

The Date-Time-To-String-&-Back-To-Date-Time exposes the KNIME weakness of missing a complete set of operators on the date-time-type. 

Removing spikes and data defect is still missing in my model. If you look more carefully you find them in there. According to the measurement my house required 435 kWh in 5 minutes, about 5,000 kWatt power ;-) That is about 5% of my closest nuclear power plant production! Hence, I only have 43,5 kWatt in my house mains inbound. 

Sometimes data is missing for several hours or days. Sometimes the energy counter drops to 0 and starts from there.

Regarding the general node I understand your arguments. Maybe we start with a custom node and learn from there. Is there already a collection of optional nodes where this could fit in?

Happy x-mas and thanks a lot for your help on this. I learned a lot in just a few days! 

@stefferber

Hi,

It's a little bit too late for I see that a lot of work has already been done on this "little subject" but reporting is always a good thing isn't it ?

The Groupby node 32 is here to remove the result of the combination loop * Crossjoiner all the dataset is copied x times in the results for each loop iteration. I used that trick because I couldn't do what I really wanted. The better way seemed to code in a java snippet (I do prefer python) or to use variables in loops. I tried, the problem I couldn't cope with is :

when you try to pass datetime values into variables you obtain strings and I couln't found how to revert them to date in order to calculate time differences.

Yes ! you did it. I will keep it in my Local directory toolbox !

hi all, Greetings

i am quite new to KNIME,

I have time series data in minutely form

2014-03-10T00:00:00 56.0
2014-03-10T00:01:00 56.0
2014-03-10T00:02:00 56.0
2014-03-10T00:03:00 5.0
2014-03-10T00:04:00 154.0
2014-03-10T00:05:00 154.0
2014-03-10T00:06:00 154.0
2014-03-10T00:07:00 154.0
2014-03-10T00:08:00 154.0
2014-03-10T00:09:00 154.0
2014-03-10T00:10:00 56.0

....................

2014-03-10T00:55:00 56.0
2014-03-10T00:56:00 5.0
2014-03-10T00:57:00 5.0
2014-03-10T00:58:00 5.0
2014-03-10T00:59:00 5.0
2014-03-10T01:00:00 5.0
2014-03-10T01:01:00 5.0
2014-03-10T01:02:00 5.0
2014-03-10T01:03:00 5.0
2014-03-10T01:04:00 154.0
2014-03-10T01:05:00 154.0
2014-03-10T01:06:00 154.0
2014-03-10T01:07:00 154.0
2014-03-10T01:08:00 154.0
2014-03-10T01:09:00 154.0
2014-03-10T01:10:00 154.0

the data goes on for 24 hours... is there a simpler way to resample minutely data into hourly data?... the other coulm being averaged or interpolated... similar to pandas python resample ?

Thanks

Please start a new thread for your question, this one is really outdated.

Cheers, Iris