Regarding SMOTE

Hi again. i've been using Knime this days and I'm quite interested in the SMOTE algorithm. is it build to consider both nominal and numerical variables while creating new rows?



the algorithm only considers numerical variables by design, though the implementation will also have some randomness on nominal attributes.

So what happens is the following: The algorithm iteratively considers different reference objects (rows) and then determines the respective nearest neighbors (it does so by a Euclidean distance based on the numerical columns!!!). It then generates a new row based on the reference object (from which it copies all non-numerical columns), whereby the numerical columns get assigned values which lie along the line between the reference object and its chosen neighbor.

So in the end you get newly generated values in numerical columns and copied values in nominal ones (there is still some randomness in there because the algorithm picks different reference objects, i.e. it will not always have the same nominal value).

Please also note that the implementation is somewhat inefficient. The calculation of the neighbors is brute-force and could be sped up by using some index method. That needs to be implemented...

Hope it helps,