Optimisation of SVM parameters

Dear Keerthan,
There are a few things I wish to ask you, but I forgot to send them in my yesterday’s reply:

  • on the (c) (Naïve Bayes), there was a greater difference between the accuracy as generated in the “Best parameters” Component results:
    image
    and the accuracy that was directly calculated if I apply only the same “best parameters” (i.e., with no loops) on the Learner node, in a “manual check”:
    image

  • On the SVM, there was also a similar difference, but this time it’s very smaller:
    0.2834 (on “Best parameters”) versus 0.2546 (on a manual check with the same parameters.

  • On the MLP, this difference was:
    0.7165 (on “Best parameters”) versus 0.5454 (on a manual check with the same parameters.

  • And on the PNN, this situation was worse:
    0.2992 (on “Best parameters”) but “Minimum standard deviation” (on the Theta Minus) = 0.976, versus “Threshold standard deviation” (which was expected to be the Theta plus, but it’s higher than the former, and so, it’s not applicable. How could this be so?
    image

Would you mind helping me to understand the origin of these differences?

Thanks once again for all your effort in this “Evangelism task”.
B.R.,
Rogério.

Hi @rogerius1st ,

@k10shetty1 is on vacation that is why he is not answering.

Please however consider this:

The framework we put in place is really general but the results you get are mostly based on:

  • Your data, that is the distrubution of the features in the sample of data you are training the model with. The sample, however big (and yours is really small), is just an approximation of the reality it describes of which we are not expert and cannot help you with.
  • The model trained. Each model you have trained has its unique training and those parameters are controlling such training. In order to answer your questions we need to recall the details of the model training and perform the exercise of understanding what each of these parameters means.
    You did not select one model, you selected many and this increases this effort of understanding the various algorithms and what each of those parameters controls. Moreover each model might need a different kind of data preparation.

To help you we would need to:

  • Undertstand your data.
  • Recall how the models are trained and what each parameters means.

Please notice however you are asking to perform a gigantic task.
This task would not even work if the results you are getting are simply due to chance as they are highly dependant on how you partition train and test.

The approach we can provide guidance on, it is a bit more basic but more in line with the work of many business data scientists in this domain: try and try and try and then pick the combination which gives the best results (after proper validation) and explain the model which performs best with XAI or by analyzing the structure of that model (for example decision tree or logistic regression are interpretable after training).

When no attempt increases the performance usually this has nothing to do with the single step (such as parameter optimiazation) but rather something more fundamental like how the data was collected and whether the task at hand is even feasible.

In your case you can achieve a good performance even with little data if you select the right model, rather than performing the parameter optimization. It seems to me that without proper data preparation many of those models you selected won’t train, no matter what parameters you adopt there.

Given this please consider to (on your own) understand what each of those parameters means and read papers on how they should be optimized and/or how data should be prepared (for example MLP requires normalization and so on)

If you can’t do the above on your own we recommend to please consider the AutoML components (blog post) which do all these steps automatically for you and optimizing as much as possible within a reasonable amount of time.

Given all of that:

  1. I downloaded your workflow.
  2. I used the same partioning node on all models (that is the very least we can do before comparing performances, even if you are using the same seed and settings in all partitioning node it would be a pain to control and change it in all its occurrences everywhere)
  3. I added the AutoML component and compare its output to yours
  4. The best result I could find like this is: Gradient Boosted Trees with an accuracy of 100% on the validation set!

The main issue here is that you have really little data: 180ish rows!
With this little data you want to do so much! Too much!
Partitioning at first for global train and test.
Then on train (127 rows) partioning again for parameter optimization with cross validation.
Then on test (55 rows) get statistically significant performance metrics.

Also why did you pick those precise models? Was this an assignment?

On another hand the performance is measured with accuracy on a multiclass string target after binned from a numerical target.

Did you consider doing a regression from the start? Did you consider reducing the number of classes to two?

I added the AutoML Regression to the workflow too.

Open the View of the 3 components in the workflow to inspect results.

2022-11-28_13h19_07

6 Likes

I am having a bit of a deja vu here :slight_smile: since I tried to build a workflow comparing several multiclass models with seemingly similar data before and also discussed about the structure and quality of the data:

My impression then was that it had more to do with the task and also the target. I suggested if maybe formulating it as a regression task might help but we never managed to finish that conversation. I wonder if given more data an SVM might perform better with a regression task.

A support vector classifier came out on top the last time around although the model overall was not very good.

Other relevant links might be this.