Auto-Calculating & Populating Missing Data In A Table

creedssmith · November 22, 2023, 10:47pm

Using monthly home sale data from to 2009 to 2023 for each of 82 separate neighborhoods, there are months with zero sale data. See Image.

As you can see, the range of months of missing data varies greatly (zeros are there simply because I used the Missing Value node, which can just as easily be converted back to “?”).

I can’t use the pre-formatted functions in the Missing Value node, such as Mean, Median, or Moving Average, as the range of years’ data and sporadic zeros would create huge errors. Next Value or Previous Value won’t work due to the unevenly placed zeros.

I’m thinking that perhaps the Rule Engine or Math Formula nodes might be able to help me “average” the previous non-zero value with the next available non-zero value, but have not come up with a way to automatically seek the non-zero value in each direction since there may be 1-6 or more zeros clustered in the column.

Any suggestions are appreciated. A single neighborhood of data is attached in excel format

~$temp.xlsx (165 Bytes)

Thanks!

mlauber71 · November 23, 2023, 6:58am

@creedssmith one thing you could do is try the R/Rstudio package Amelia to have a more complex importation of values

I have a python project using several packages somewhere unfinished…

Also there was this suggestion

AlexanderFillbrunn · November 23, 2023, 8:18am

Hi,
What you can also do is impute missing values using a regression tree or linear regression. You split your data into those rows with missing and those without missing mean ClosePrice. Then you train your model on the table where the values exist to predict the ClosePrice. In the other table, you remove the Mean(ClosePrice) column using a Column Filter (its only missing values anyways) and then let the predictor node (for regression tree or linear regression) recreate it with predicted values. This might work better, especially since your averaging might be out of whack when you have a missing value where the previous or next non-missing value is in a different neighborhood.
To make sure the model you train is any good, you could perform a simple cross validation (https://www.youtube.com/watch?v=kQXxf_VX1ww) and check how much the values are off before you actually employ the model for imputation.
Kind regards,
Alexander

creedssmith · November 23, 2023, 4:38pm

@AlexanderFillbrunn, thanks for the advice.

Using the Row Splitter, I sent the populated “mean close price” data to the Simple Regression Tree Learner and the dates missing MCP data to the Column Filter and then the Simple Regression Tree Predictor. I included the “Year”, “Month”, and “Quarter” columns and removed the “Mean Close Price” and all other columns. These are however, the columns I did send to the Learner with the populated data.

However, I get a Console message from the Predictor node “Required data column “Mean(ClosePrice)” does not exist in table” when there is no MCP column. The Predictor node will not execute.

So, I reset the Column Filter to allow the MCP column, used the Rule Engine to set the MCP column to “blank” for each of the prior months with missing data, converted the Rule Engine output column from string to integer using the String to Number node, and was then able to run the Regression Tree Predictor for output.

Does this all seem fairly appropriate relative to what you were thinking, or did I go off-course?

I will start checking the quality of the output a bit later.

BTW, I ran just one neighborhood. So I will have to figure out how to loop the full series of 86 neighborhoods to each run independently and accumulate the results.

Thanks!

AlexanderFillbrunn · November 23, 2023, 4:55pm

Hi,
Did you select the Mean(ClosePrice) column as target column in the Simple Regression Learner? And it should be excluded automatically, but can you make sure it is also not in the list of feature variables in the same node? Then the predictor should definitely not need the Mean(ClosePrice) to be included in the table.
For doing this for multiple neighborhoods, you can use the Group Loop Start node with a Loop End and your imputation flow in-between.
Kind regards,
Alexander

Daniel_Weikert · November 23, 2023, 5:44pm

Why not using a linear interpolation. Missing value seems to support that
br

AlexanderFillbrunn · November 24, 2023, 8:17am

Hi @Daniel_Weikert,
When there are multiple missing values in a row, linear interpolation will assign different values to each like so:
1, ?, ?, 4 → 1, 2, 3, 4
But the data does not seem to be sorted in a fashion where this makes sense. It is not a time series after all.
Kind regards,
Alexander

creedssmith · November 24, 2023, 11:27am

@AlexanderFillbrunn, yes you are correct, I must have done something wrong the first time, so recreated the regression series nodes and bypassed the two steps I had added earlier.

I am moving on to testing the quality, then the loops.

Thanks for the help !

creedssmith · November 24, 2023, 11:47am

@AlexanderFillbrunn, I don’t want to burn too much of your time, but just a quick visual of the known Mean Close Price and the Predicted Mean Close Price seem a bit weird. Should the prediction essentially repeat the same values it’s using for the prediction?

Please see the known data on the left and the predicted data on the right. Thanks

AlexanderFillbrunn · November 24, 2023, 12:10pm

Hi,
This does not seem so bad. What I would do: separate the dataset where you do know the MCP into a training (70%) and test (30%) set. Then you train your model on the 70% and apply it with the predictor on the 30%. Then you can compare the actual numbers and see how much you are off.
Kind regards,
Alexander

creedssmith · November 24, 2023, 1:17pm

@AlexanderFillbrunn I will do so. Thanks for all of you help

marzukim · November 25, 2023, 12:06pm

just my approach which is may not be the best solution, and unsure if it meets your expectations.
the attached workflow calculates the mean within cluster zero/missing values.

KNIME_mean-MM.knwf (119.2 KB)
rgds
linux knime 5.1.x

creedssmith · November 25, 2023, 3:23pm

@marzukim, WOW. I tracked your progression of each step allowing for incremental adjustments and a final outcome presenting a gradual trend where there was a great deal of missing data and substantial jumps in the values available, both up and down from previous and following known data.

@mlauber71 @AlexanderFillbrunn @Daniel_Weikert I wish to thank you and each other contributor here for your ideas. If I ever get half as smart as you guys, I may just accomplish this project which is far above my abilities

creedssmith · November 26, 2023, 6:39pm

@marzukim, I’m simply confirming your solutions. The process ending with the Column Merger node is a second option and the process ending with the Missing Value node is the first option. I don’t see where you ever combined the results, so I believe you offered two separate solutions.

Thanks again, creed

marzukim · November 27, 2023, 12:23am

yes, my approach employs the workflow route that ended in column merger nodes. the second approach, i’m using recommended approach by Daniel, utilizes the linear interpolation option. additionally, Markus and Alexander have also proposed the solutions. it is up to your preference to choose the method that best suits your requirements.

happy knimining …
rgds.

system · February 25, 2024, 12:24am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.