how do know if your dataset is good before using ML algorithms?

mohammad_alqoqa · February 13, 2024, 8:42pm

hi everyone

I need to know what test or node I must use to ensure
that my data are balanced (I don’t mean it has missing value or needs normalizing) I mean it’s a real dataset that does not have random value or no sense in it when I use ML algorithms on it. The result of machine learning will be good
to make decisions on it, I used to use the heart dataset (to apply like videos I watch) but I figured out it was a bad dataset because all ML algorithms gave me bad results. Still, I can’t test all algorithms every time.

I would appreciate any information from your education or experience that can help me.
thanks all

rfeigel · February 13, 2024, 8:59pm

I don’t think there’s a simple answer to assessing “data quality.” Its important to have sufficient domain knowledge as a beginning. Read this for some help on dealing with unbalanced data. A lot of data sets, e.g. for fraud are naturally unbalanced.

mohammad_alqoqa · February 13, 2024, 9:24pm

thank you, I know it’s not easy but wasting time on a bad dataset is horrible.

mlauber71 · February 13, 2024, 9:49pm

Is it this dataset “Heart Disease” by UCI Machine Learning Repository or another one from Kaggle? Or is it this one - what is the source of that? Also there was this discussion we had (how can I find if the model have overfitting - #9 by mlauber71).

Typically such datasets are reasonably well prepared and are meant to be used in machine learning cases. Sometimes they tend to be ‘too good’ so as to spoil aspiring data scientists and give them a false impression of the real world …

So if the results are not to your satisfaction maybe the method or preparation did not work.

As @rfeigel already said: Domain knowledge most often is necessary to understand what is going on and if you do not use a prepared dataset like from a Kaggle competition you will have to invest some time to bring the data into such a shape that it will truthfully represent your task and the data has any chance at all to contain what you are looking for.

To check a dataset you could start by checking the correlations with the target variable. If there are only weak correlations it is very unlikely that you will be able to get a good model.

Another thing to check is if the data has some variations, so not all values in a column are constant or nearly constant or the colum features are highly imbalanced.

If your target variable is imbalanced that does not need to stop you from building a model; you could for example use a different metric like AUCPR to help you.

Further hints about models you could check here:

Another pre-check I like is to just run some (H2O.ai) models which will give us variable importance. You can check with domain knowledge if the ‘leading’ variables would make any sense at all. Or if they might be too good to be true (most likely constituting a leak, that is information about the target that a real model will not be allowed to have).

Welcome to the (real) world of data science

Maybe another remark: if you encounter problems you could maybe share the workflows and approaches you have used so the KNIME community might have a look and give further advise what could be done; like when you said the heart disease dataset would not give you good results.

mohammad_alqoqa · February 14, 2024, 7:36pm

First of all, thank you for all the valuable information you have provided.

Regarding your question, yes, it is the same data as in the previous question. I have noticed that no matter what the machine learning algorithm is, the results do not reach 80% at best, even with adjusting the parameters does not improve the results.

**I mentioned at the beginning of a series on this link (https://www.youtube.com/playlist?list=PLiOBAYiI6xJjHSOtryFfr0FzZYPqTBOZi) that when applying linear regression, it gave a result opposite to what is known for cholesterol. Perhaps this is repeated in more than one column, as there is no strong relationship between the data.

rfeigel · February 15, 2024, 12:31am

What exact data set are you using?

mlauber71 · February 15, 2024, 4:21pm

The dataset cited seems to be from this page. It might be based on another set but I was not able to track it down further.

Question is what you mean by 80%. Accuracy out of the box for this task goes to 85% without much data preparation. Some details might depend on the exact split of test and training. On Kaggle some more elaborate preparations reached 90% tough I have not checked them in detail.

I took the liberty of dusting off my collection of binary classification algorithms for KNIME 5.2 and run it with this dataset. AutoML (that is GBM mostly) and XGBoost seem to be quite robust.

You will have to decide for yourself if the results are good to be used in your business case. In this case: it is better to scare some people that might later turn out to not have problems or miss some who might later develop a disease.

Also there might be a case where the modle detected all the right signs of a health problem (or a defect in predictive maintenance) but it is not fully present yet so technically this might be a wrong classification but for all practical reasons it is not.

Some models do contain some variable importance metrics. There are also two Jupyter notebooks in the /data/ folder to run a XGBoost model and to inspect the H2O.ai/GBM model in greater details (along these lines).

If you want you can explore these models. I might write some more elaborate blog about what is happening here; but one can just feed new data with a “Target” variable as string (0/1) and a “row_id” and split that into test and training, make sure the Python environment is running and just run this. If you are patient you can increase the time the automl process is running.

rfeigel · February 15, 2024, 4:46pm

For whatever its worth the UCI data has a target scaled 0-4.

mohammad_alqoqa · February 15, 2024, 8:45pm

this is data I used it

heart-csv.csv (51.2 KB)

rfeigel · February 15, 2024, 10:21pm

What’s the source of this data?

rfeigel · February 16, 2024, 12:57am

I did a very simple XGBoost model with default settings using your data. I realize you’re after an apriori method(s) to assess data quality. Sometimes its better to try reasonable models to help make the assessment. AutoML is very useful. @mlauber71 has given you a lot of good advice.
Accuracy - 86.9%
Cohen’s kappa - .736

mlauber71 · February 16, 2024, 7:20am

@rfeigel … or just use XGBoost and be done on Mac and Linux Python H2O AutoML will include that in the models tested.

One can also try and combine that with some hyper parameter optimization:

In a lot of cases it will not be a question of 0.84 vs 0.85 which might shift some anyway but to think about what to do and where to make the cut for a prescribed action and also integrate some cost estimations.

mohammad_alqoqa · February 16, 2024, 7:57pm

from kaggle

mohammad_alqoqa · February 16, 2024, 8:11pm

that is awesome
Can you share the workflow
What is the training and testing ratio?

mohammad_alqoqa · February 16, 2024, 8:26pm

I am sorry, but is this all your reply or some of it is hide

rfeigel · February 17, 2024, 12:24am

Here’s the workflow.

mlauber71 · February 17, 2024, 6:45am

@mohammad_alqoqa that was the whole text. Meant as a slight joke. Skip all the fancy models just use XGBoost and be done.

Here another more compact approach

mohammad_alqoqa · February 17, 2024, 11:55am

Thank you all.
This was very useful information,
I hope to reach your skill someday.

system · May 17, 2024, 11:55am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.