I have a tricky question, but I'm hoping someone can answer it.
Problem (in brief): I do not understand how the Logistic Regression Learner node handles predictor variable categories.
Problem (complete): I am using the Logistic Regression Learner node to run a multivariate analysis on two datasets (same variable list, different values). I have a dependent variable (nominal, 0 or 1 values) and two independent variables (ordinal, domain 1,2,3). There are no missing values. All variables are string (i.e., KNIME nominal/categorical). When the analysis runs, the reference category is the last value of the dependent domain range (1=success). So far so good...
If I run the analysis on dataset A and dataset B, KNIME generates regression coefficients and statistics using independent variables (y1,y2) with values in categories 2 and 3. Again, so far so good...
Here's the kicker, in order to find measures using independent variables with values in category 1, I then copy the Logistic Regression Learner node and check "Use order from column domain...First value is chosen as reference for dummy variables." Dataset A generates results for categories 1 (accurate) and 2, while Dataset B wrongly generates results for categories 1 (inaccurate) and 3.
Again, the domain for both independent variables is (1,2,3). The first 16 values for all four variables (Ay1,Ay2,and By1,By2) are as follows...
Var_Ay1 ("RDNG..."): 3,3,3,3,3,1,2,3,3,3,3,3,3,2,2,2
Var_Ay2 ("MATH..."): 3,1,2,1,3,1,3,3,3,2,1,3,3,1,3,1
Var_By1 ("RDNG..."): 3,3,2,3,2,2,3,2,3,3,2,3,2,2,1,3
Var_By2 ("MATH..."): 2,3,2,1,1,1,2,1,3,3,1,3,3,1,1,3
Sort ascending by Var_By1 ("RDNG..."), then sort ascending by Var_By2 ("MATH...")
Var_By1 ("RDNG..."): 1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2
Var_By2 ("MATH..."): 1,1,1,1,1,1,3,3,3,1,1,1,1,1,1,1
Sort ascending by Var_B_y2 ("MATH..."), then sort ascending by Var_B_y1 ("RDNG...")
Var_By1 ("RDNG..."): 1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2
Var_By2 ("MATH..."): 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,...2
Just to confuse matters further, I've tried adding a Edit Nominal Domain node that precedes the Logistic Regression Learner node. If I configure the domain range for Var_By1 and Var_By2 to have a reverse domain range (3,2,1), the checked Logistic Regression Learner node results ("Use order from column domain...First value is chosen as reference for dummy variables.") are correct. The results report categories 1 and 2 for Var_By1 and Var_By2. But this doesn't work when I try to reproduce it for Var_Ay1 and Var_Ay2.
To conclude: I know I'm missing something obvious here. But I want to understand the process -- and specifically how I should sort/order the data (or even the workflow) so I can reliably perform analyses using different data.