Machine learning Classification

Haroon_954 · October 7, 2021, 6:56am

Dear all,

I am trying to find the best classification model for accidents crash data. I developed this network in KNIME, it gives an accuracy of 100% for different models I tried (like SVM, RF, MLP)
Please guide where I am making mistakes because 100% accuracy is rarely possible

aworker · October 7, 2021, 7:01am

Hi @Haroon_954 and welcome to the KNIME forum

Is it possible for you to upload here the workflow with the data? It would help people in the forum to investigate why you are getting this results.

Best,

Ael

mlauber71 · October 7, 2021, 7:34am

You could see what variables mostly drive the results. It is possible that you have some sort of leak, where a variable different from the target would explain all of the results.

Maybe you try and get the variable importance (H2O.ai Random Forest learner has them) and see what variables have the most influence on your result.

Haroon_954 · October 7, 2021, 7:41am

MLP Workflow.knwf (17.2 KB)
RF Workflow.knwf (17.0 KB)

Thanks aworker for your response

Sure, I have uploaded the KNIME workflow for SVM and MLP along with data (.CSV file). kindly anyone please check and guide.

Thanking in Anticipation

M-2 Data Research.xlsx (144.1 KB)

aworker · October 7, 2021, 9:16am

Hi @Haroon_954

Thanks for the data and the workflow.

I had a look at your data and the column “Total Number of Deaths” exactly matches the column to predict “Accident Severity (Two Classes)” (with different constant values but this doesn’t matter). This is why you get 100% Accuracy.

An easy way to check for this kind of problems (among others s.a. correlation between variables) is to train a Decision Tree Learner and look at the first branch of the Decision Tree obtained after training, as showed in the snapshot below:

It shows that just the “Total Number of Deaths” column is enough to predict with 100 % Accuracy.

Hope this helps.

Best

Ael

ipazin · October 7, 2021, 12:46pm

Hello @Haroon_954,

here is a workflow that might help in general in regards to dimension reduction:

Br,
Ivan

Haroon_954 · October 7, 2021, 4:12pm

Thank You so much aworker

kindly share this wok flow if possible, I need to check and apply accordingly

aworker · October 7, 2021, 4:30pm

My pleasure @Haroon_954

The workflow is almost the same as the one you uploaded. I had just added the -Decision Tree- node and a -String Manipulation- node to convert the “Accident Severity (Two Classes)” variable to a nominal one.

Please find the workflow below:

2_bis.knwf (326.5 KB)

and a snapshot with another possible Decision Tree visualization. There are two in the -Decision Tree Learner- node:

Hope this helps.

Best,

Ael

Haroon_954 · October 7, 2021, 4:41pm

Got it aworker

Thanks for you Cooperation and guidance

Haroon_954 · October 7, 2021, 6:40pm

Dear All,

When I try to plot ROC Curve for Random Forest (RF) classification it gives me proper targeted variables to include and plot ROC

but incase of SVM (support vector machine) this (prediction(Accident severity “two classes”) is not shown in left side column to include in Right side column and plot ROC?

Need help, Please where I am doing mistake

Thank you

Haroon_954 · October 7, 2021, 7:04pm

Thank You aworker for your kind response

Sure, Attached is the workflow with data

R2-SVM.knwf (16.9 KB)
Processing: M-2 Data Research 2.csv…

Please, what is the procedure or alternative for drawing ROC if we use a classification model other than RF because I need to compare different models like RF, SVM, LR, ANN, etc and select the model with the best prediction and accuracy

aworker · October 8, 2021, 8:47am

Good day @Haroon_954

I had a look at the SVM Predictor configuration and there is an option that you need to check to generate the probabilities of every class. I have modified the workflow in this sense to be able to generate the ROC curves. I have also added the normalization of variables since this is something recommended to be done when working with an SVM model. The workflow now is as follows:

20211008 Pikairos Machine learning Classification SVM ROC Curve.knwf (378.7 KB)

Please be aware that you can store the data in the workflow and upload it here with the workflow already executed. This makes easier the execution of your workflow. I have hence created a folder called “_LOCAL_DATA” inside the workflow and put your file inside. The way you call this file in the -CSV Reader- node is as follows:

The URL should start with “knime://knime.workflow/_LOCAL_DATA/” to indicate that the data is under the local folder “_LOCAL_DATA”.

With respect to your question:

Not all the machine learning models generate probabilities associated to the classes and how they do it is dependent on the ML algorithm. You’ll find this probability information in most of the ML ensemble models based on Trees. You can also get it from ANNs. The way you get this information, when available, will differ between methods.

Hope this helps.

Best

Ael

system · October 15, 2021, 8:47am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.