I have a dataset of 300 samples (classified in 2 classes or categories) and 370 continuous variables or predictors (normalized, centered and scaled). I want to determine the area under the curve (AUC) for each variable for sample classification. How can I do that without having to do it manually for each variable? I guess I need a loop, but am new to KNIME and I have not found such an example in the software YouTube channel.
it depends a bit on what exactly you want to do. Do you aim to train a model such as Logistic Regression for each variable independently?
I created a small workflow that demonstrates both my understanding of your task and how to solve it: https://kni.me/w/fYJZ8PpflUSTy932
Thank you very much, Adrian and Hans. I understand almost all steps. I am new to Knime (I come from Stata) and some nodes are rare to me, but I will work from these.
On the other hand, is there any way to rank the predictors based on their AUC? I want to generate a table with the ranked predictors and their AUC, from highest to lowest. Is there any node that collects the predictors with their AUC and ranks them?
unfortunately, your workflow doesn’t contain the data, so I can’t debug it on my machine.
You can ship your data with the workflow by putting it into a data folder inside of the workflow folder (if it isn’t there yet, you can just create it).
By getting stuck, do you mean that it fails with an error?
If so, then the problem might be that the data is linearly separable which is an issue for the Iteratively reweighted least squares solver I used in the original workflow. The simplest solution would be to switch to Stochastic average gradient in the Logistic Regression Learner Dialog.
I cannot attach the data because it is privately owned; is there a way to share it just with you, so that readers can take benefit of the issue but not having access to the data themselve? If not and it is necessary, I may delete variable names and some rows hoping that the issue did not arose specifically from the deleted info.
Anyway, I changed the solver to Stochastic average gradient and the flow worked until the Joiner (Labs) node. It says that this node is missing (KNIME console: WARN MISSING Joiner (Labs) 0:11 Node can’t be executed - Node “Joiner (Labs)” not available from extension “KNIME Base nodes” (provided by “KNIME AG, Zurich, Switzerland”; plugin “org.knime.base” is installed). How can I install a single node? I am running KNIME 4.1.3 on a Windows 10
Ok, I see. If it worked with the other solver, then I believe there is no further need for me to debug the workflow.
The Joiner (Labs) was added with the 4.2.0 release, so you won’t find it in 4.1.3.
However, you can also use the normal Joiner node in its place, to get the workflow running.
Thank you, Adrian. However, the final result shows values for the AUC ranging from 1 to 0, and almost 1/3 are below 0.50. I do not identify the problem. I have attached a sample of the dataset with col names removed, do you know where the problem is?
Thank you very much in advance,
The output changend after configuring the ColumnFilter. Maybe you have to make some adjustments to your LogReg Learner? Why are you not satisfied with this outcome? What did you expect the outcome would look like?
I would expect all AUC values to be between 0.5 and 1. While some may be around 0.5 (the classifier is no better than chance), I would not expect any of them to be 0 (the classification predicted is always wrong)
I reviewed the workflow and I did not identify any obvious mistake. However, the results show approximately 1/3 of all AUC values are <0.50 and more than 30 predictors with an AUC of 0.00 for the first dataset, which seems very unusual. The same occurs when any of two additional datasets (included) are used (all three datasets use the same predictors, but the variable holding the true class, in the first column in each dataset, differs in each excel file). Is there anything wrong in the workflow?
I am using KNIME 4.1.3 on a Windows 10. I would prefer not to update the KNIME version to avoid problems with my existing workflows unless it is necessary.
I just realized that HansS’ answer might be a bit confusing to you. The screenshot shows that only the class prediction is kept but that’s actually the issue with the workflow. What you need to keep is the probability of the positive class (one of the P (Col0= columns).
That’s the first bug, but there is a second one, namely, the Joiner needs to be configured to only append the class column (Col0). To do this go to the Column Selection tab in the node dialog and move all columns except for Col0 to the left.
With the described changes, a substantial part of the AUCs becomes 1 but there are still a good amount of features with lower AUC and even some below 0.5.