Thanks in advance for reading and maybe even solving my question that I could not resolve on my own and did not find a (similar) solution in the forum (yet…).
Here we go: I do have a data set with a data quality issue. I do have items (A, B, C, …) over time (every month) with some attributes (a,b,c, d, …). Correctly this attribute exists for every month the item exists. In reality people enter the attributes in the first month and forget sometimes in following ones. In addition, the item may start to exist in one month, keeps existing for some months and may disappear and potentially even reappear in months further down the line, often then with different attributes.
I tried to provide a simplified version of the real data in the attached workflow. I tried “guessing” the missing attributes from previous months with lag columns within two loops (one on the item, the inner one on the year-month) but failed…
One idea could be to employ a program to impute the missing values. Some time ago I set up an example using the R package Amelia.
Please also note the additional links in the example. It is possible that in the meantime there are more advanced methods available. Also success might very much depend on how regular your missing numbers are missing and if a pattern can be established.
Thanks, @mlauber71, for that approach. However, there is a certainly logic to fill the missings in my case. As the attributes are always available when an item appears the first time, I can base the algorithm on that fact and make it more a deterministic thing rather than guessing.
Something like this (in a kind of pseud-code):
for each item
for each year-month
if missing(attribute(item, current year-month)) AND not missing (item (current year-month)
then attribute (item, current year-month) = attribute (item, previous year-month)
end
end
While typing I guess what I am missing is how to set this up in KNIME and esp. how to get the attribute-value of the previous month.
To shift the date back a month a year or a certain time. I would suggest you make yourself a table with the values from the time before and join that back to your current values.
Then you can use the Rule Engine to replace the missing values