I have a table coming from a database where there are measures every 5'; but in some cases, some measures are missing (=> full rows are missing)
Does somebody know if there is a way to generate rows in order to have all the measures interpolationg the missing values with the adjacent values?
Here's one approach. Note: this approach assumes the time values are in exact intervals of 5 minutes. If not, you will have to modify this or find a different approach.
1) Start by converting your date and time values to Long values so that you can perform linear regression where time is the independent variable and your measurement is the dependent variable (I include a Java Snippet to do this in the attached workflow). (Note: java has a method to convert date/times to milliseconds since Jan 1 1970 GMT).
2) Use a Linear regression node to perform the regression (assuming your response is linear with time, if not use an appropriate regression model)
3) Next generate a separate table containing a Date And Time column with rows for every time value you expect to have in your data table, even missing time values. Convert these Date and Time values to Long values just as you did before with your data table.
4) Next join the two tables, keeping all values from the table created in step 3 and joining to rows in your data table.
5) Use a Regression Predictor node to apply the parameters from the linear regression of step 2 to the joined table in step 4.
6) Finally, use a Column Merger node to keep measurements from the original table unless the value was missing, in which case use the predicted (calculated) measurement value from the regression.
Sample workflow attached.
the above method relies on a predictive relationship with time for the measurement. If there isn't one, an alternative is:
If your table doesn't already having missing rows from time points where no data was collected then generate a new table with all the possible time values using Time Generator node. Take your main table and join it into this generated table using joiner node, use time columns as the joining columns, and use a full outer join. Make sure your main table time column is in the same time format first. you should now have missing rows where there was no data captured.
now use the lag column node and apply lag interval of +1 on your measurement column,
now use another lag column node and apply lag interval of -1 on your measurement column.
now use maths formula node to take the average of these lag columns.
now use column merger node, choose your original data column as primary column and the averaged data from the lag columns as secondary column. So this puts in the averaged data from the lag columns if the original measured data is missing.
finally a column filter node to remove those lag columns and merged lag column.
hope that gives the desired output.
If there are places where there are 2 or more consecutive missing data points, then you will need to increase the lag intervals to more than 1.
Hi Don & Simon
Thanks a lot for these 2 methods where the core looks to be similar in the 2 cases; I need now to check if the linear regression model is ok for the case or if I have to use the lag column node.
I will keep you informed.
We will publish a time series missing value node with the next release :-)
It will give you the possiblity to chose between last, next, average and linear method for resolving missings, stay tuned.
@dnaki: Nice workflow!!
Did you know that you can use the Time Difference Node instead of your Java Snippet? Okay, I have to admit, the result will be negative instead.
Thanks for the tip, Iris! In general, I prefer to avoid the use of a Java Snippet if possible when developing workflows for others.
Thanks to all for your help.
Finally, I have used Don's workflow in which I have replaced the linear regression by Simon idea to use the lag columnnode (btw, it works only for positive values fot the lag interval and I had to sort the table twice to get the lag of +1 and -1;
It looks that the new node described by "Iris" fits with my needs, and I will replace the current workflow by this node as soon as it is available. The current method works well if 1 or 2 rows are missing ; if more, it needs to be improved.
You are correct, the Lag column only seems to support +1 or higher, strangely it doesnt support negative values, which seems a strange omission.
Anyone from the KNIME Team reading... can this feature be added, to allow Lag Columns in the reverse direction. I would assume this would be programatically straight forward?
@Simon: I logged it in our bug tracing system, but unfortunately not for 2.10.
To finalize the workflow , as I have more than 20 columns on which I have to apply the described process, I thought using Column List Loop but it seems difficult to access the column content in a java snippet when we have the name or the index; I have seen in a previous post that Dynamic Column Access should be available in the release 2.7 but I did not find it; do you have an idea to avoid repeating 20 times the workflow?
with 2.10 we published the Time Series Missing Value node. It does work on multiple columns at once.
(if you want to uuse the other flow take a look into the example workflow 01006_loopColumnsManipulateEach. There it is demonstrated with the math formula node, how the colums are set to a unique name and renamed afterwards)