Hi everybody, I’m a newbie in data science, so this is my idea. I think that all the matches of football can predicts thanks the statistics before the match and I’ve develop a multinomial logistic regression. I try to explain better, for example if there is a match A vs B on 12/12/2022, i can predict the result with the entires statistics about A and B before that match. The main statistics comes from understat.com and are for examples, mean goals, mean goals againist, mean expected goals, mean expected goals againist and other that I’ve explained in the black box on the workflow.
toto2021.knwf (424.3 KB)
The problem is that the accuracy is around 50% and these are my question: How can improve the accuracy of the model?
I’ve found these suggestions but I don’t know if are they possibile in knime, and if yes, how:
- Feature Scaling and/or Normalization
- Class Imbalance
- Optimize other scores
- Hyperparameter Tuning - Grid Search
- Explore more classifiers
- Error Analysis
Can you help me please?
Hi @bruss and welcome to KNIME Forum
Cool to create a model that can predict the outcome of a football match.
Your wf is not so easy to understand. But I can give you some suggestions to improve the accuracy of your model.
- I think you need more data (matches) and less features
- Your features will make the difference wether your are able to create a model that meets your expectations.
- Do some analysis on the feature importance of your current feature set, and see what kind of features add the most to the prediction and get inspired for more/other features.
- Some ideas for new features
- ranking position home team vs away team
- goal difference home team vs away team
- historical mutual result
- probability of winning, losing or draw given the result of the previous match(es) for the home and away team
- average number of goals scored and against in the last x matches
But always be sure that no information about already played matches finds its way into the features.
Thank you Hans for yout suggestions. I have already start from a model with more data (all matches of 2021, look here
toto2021.knwf (435.2 KB)
) and I start my logistic learner with only 1 or 2 features and then I add mores but the accuracy is always around 50% (low enough). I also try with a naive bayes learner but I don’t understand how it works, do you know?
Can you help me for example to find the outliers on the logistic regression model?