how do know if your dataset is good before using ML algorithms?

mlauber71 · February 13, 2024, 9:49pm

Is it this dataset “Heart Disease” by UCI Machine Learning Repository or another one from Kaggle? Or is it this one - what is the source of that? Also there was this discussion we had (how can I find if the model have overfitting - #9 by mlauber71).

Typically such datasets are reasonably well prepared and are meant to be used in machine learning cases. Sometimes they tend to be ‘too good’ so as to spoil aspiring data scientists and give them a false impression of the real world …

So if the results are not to your satisfaction maybe the method or preparation did not work.

As @rfeigel already said: Domain knowledge most often is necessary to understand what is going on and if you do not use a prepared dataset like from a Kaggle competition you will have to invest some time to bring the data into such a shape that it will truthfully represent your task and the data has any chance at all to contain what you are looking for.

To check a dataset you could start by checking the correlations with the target variable. If there are only weak correlations it is very unlikely that you will be able to get a good model.

Another thing to check is if the data has some variations, so not all values in a column are constant or nearly constant or the colum features are highly imbalanced.

If your target variable is imbalanced that does not need to stop you from building a model; you could for example use a different metric like AUCPR to help you.

Further hints about models you could check here:

Another pre-check I like is to just run some (H2O.ai) models which will give us variable importance. You can check with domain knowledge if the ‘leading’ variables would make any sense at all. Or if they might be too good to be true (most likely constituting a leak, that is information about the target that a real model will not be allowed to have).

Welcome to the (real) world of data science

Maybe another remark: if you encounter problems you could maybe share the workflows and approaches you have used so the KNIME community might have a look and give further advise what could be done; like when you said the heart disease dataset would not give you good results.