AUC for many independent variables

Good evening,

I have a dataset of 300 samples (classified in 2 classes or categories) and 370 continuous variables or predictors (normalized, centered and scaled). I want to determine the area under the curve (AUC) for each variable for sample classification. How can I do that without having to do it manually for each variable? I guess I need a loop, but am new to KNIME and I have not found such an example in the software YouTube channel.

Thank you,
Marc

Good morning @MarcB,

it depends a bit on what exactly you want to do. Do you aim to train a model such as Logistic Regression for each variable independently?
I created a small workflow that demonstrates both my understanding of your task and how to solve it: https://kni.me/w/fYJZ8PpflUSTy932

Kind regards,

Adrian

Hi @MarcB

See this Forum Post Control columns in input in loop . The example uses a regression model but can easily be replaced by a classification model.
gr Hans

Thank you very much, Adrian and Hans. I understand almost all steps. I am new to Knime (I come from Stata) and some nodes are rare to me, but I will work from these.

On the other hand, is there any way to rank the predictors based on their AUC? I want to generate a table with the ranked predictors and their AUC, from highest to lowest. Is there any node that collects the predictors with their AUC and ranks them?

Thank you very much in advance,

Hi Adrian,

I imported the sequence and I made minor changes (database, I added a normalization node to standardize the predictors) but the learner (logistic reression) got stucked, see attached file:

AUC_Loop.knwf (50.3 KB)

How can I solve this?

Thank you,
Marc

Hi Marc,

unfortunately, your workflow doesn’t contain the data, so I can’t debug it on my machine.
You can ship your data with the workflow by putting it into a data folder inside of the workflow folder (if it isn’t there yet, you can just create it).
By getting stuck, do you mean that it fails with an error?
If so, then the problem might be that the data is linearly separable which is an issue for the Iteratively reweighted least squares solver I used in the original workflow. The simplest solution would be to switch to Stochastic average gradient in the Logistic Regression Learner Dialog.

Hi there @MarcB,

see here how to share workflows:

Welcome to KNIME Community!

Br,
Ivan

2 Likes

Hi again Adrian, and thank you.

I cannot attach the data because it is privately owned; is there a way to share it just with you, so that readers can take benefit of the issue but not having access to the data themselve? If not and it is necessary, I may delete variable names and some rows hoping that the issue did not arose specifically from the deleted info.

Anyway, I changed the solver to Stochastic average gradient and the flow worked until the Joiner (Labs) node. It says that this node is missing (KNIME console: WARN MISSING Joiner (Labs) 0:11 Node can’t be executed - Node “Joiner (Labs)” not available from extension “KNIME Base nodes” (provided by “KNIME AG, Zurich, Switzerland”; plugin “org.knime.base” is installed). How can I install a single node? I am running KNIME 4.1.3 on a Windows 10

Thank you,

Thank you, Ivan. I will adhere to these guidelines in the future.

Best regards,
Marc

Ok, I see. If it worked with the other solver, then I believe there is no further need for me to debug the workflow.
The Joiner (Labs) was added with the 4.2.0 release, so you won’t find it in 4.1.3.
However, you can also use the normal Joiner node in its place, to get the workflow running.

And how should I configure the Joiner node? Since I haven’t seen the final results, I am not sure of what is coming out from each previous node.

Thank you,
Marc

The joiner is used to rejoin the class column which was filtered out in the loop.

Thank you, Adrian. However, the final result shows values for the AUC ranging from 1 to 0, and almost 1/3 are below 0.50. I do not identify the problem. I have attached a sample of the dataset with col names removed, do you know where the problem is?
Thank you very much in advance,

Sample.knwf (1.7 MB)

Hi @MarcB

I think you need to configure the Column Filter again (only keep the probability of the positive class).
Knipsel
gr. Hans

2 Likes

Thank you Hans. I updated the node but but it did not change the output: there are still have many AUC<0.50, and some of them are even 0.00

1 Like

The output changend after configuring the ColumnFilter. Maybe you have to make some adjustments to your LogReg Learner? Why are you not satisfied with this outcome? What did you expect the outcome would look like?

I would expect all AUC values to be between 0.5 and 1. While some may be around 0.5 (the classifier is no better than chance), I would not expect any of them to be 0 (the classification predicted is always wrong)

Hi Adrian (other inputs are obviously welcome!),

I reviewed the workflow and I did not identify any obvious mistake. However, the results show approximately 1/3 of all AUC values are <0.50 and more than 30 predictors with an AUC of 0.00 for the first dataset, which seems very unusual. The same occurs when any of two additional datasets (included) are used (all three datasets use the same predictors, but the variable holding the true class, in the first column in each dataset, differs in each excel file). Is there anything wrong in the workflow?

I am using KNIME 4.1.3 on a Windows 10. I would prefer not to update the KNIME version to avoid problems with my existing workflows unless it is necessary.

Sample.knwf (3.4 MB)

Thank you very much in advance,
Marc

Hi Marc,

I just realized that HansS’ answer might be a bit confusing to you. The screenshot shows that only the class prediction is kept but that’s actually the issue with the workflow. What you need to keep is the probability of the positive class (one of the P (Col0= columns).
That’s the first bug, but there is a second one, namely, the Joiner needs to be configured to only append the class column (Col0). To do this go to the Column Selection tab in the node dialog and move all columns except for Col0 to the left.
With the described changes, a substantial part of the AUCs becomes 1 but there are still a good amount of features with lower AUC and even some below 0.5.

Cheers,
Adrian

Thank you Adrian, I will incorporate these changes and come back with feedback.
Best regards,
Marc