Greetings everyone!
I have question related to the confusion matrix based on selecting the best parameter. The first image is the entire workflow to date which shows a phase by which I conduct a grid search on hyper parameters with the goal of using the best return for the final model. As shown below, the best hyper parameter combination gave me the following confusion matrix together with a 92% accuracy and other stellar classification-based statistics
However, when I go to use the chosen parameters established on the “Using best parameters” phase to the “Using best parameters” phase, I get the following confusion matrix together with a suspect 100% accuracy, rather than something that resembles the statistics established in the first step. I am optimizing to maximize accuracy.
Im wondering what might be the case and how I can fix this issue.
Thank you again everyone for helping me learn the ropes related to data science and machine learning. I look forward to continue this process. You all have been great thus far! 
~Cole K.
Now that I have thought this further. . .is it posisble that the reason regarding the 100% accuracy is that the algorithm simply memorized things the second time?
Further update
Unfortunately After I did some tweaking, I still have the issue of the holdout test set returning 100% accuracy even after I reconfigured the partitioning. I still don’t know why the data that got partitioned into the bottom “test” workflow returns a 100% accuracy.
Here is the flow variables that I sent through the learner on the test sub-workflow :
And the dialogue box
If anyone has ideas into how to change the workflow to make sure the test accuracy is not 100% I would be forever happy with your assistance.
Many thanks!
~Cole K.
1 Like
Hi Cole,
If you are unsure you can use a hyperparameter set with very low accuracy. Just to test it.
On the other hand, your test dataset has just a small size. Maybe that’s the cause for that suspicous high accuracy.
Hi @Cole_Kingsbury , any chance (if possible) to share us sample data and flow? At first, couple of things come to my mind:
- Data Leakage: Test data might have been exposed to the model during training, which would result in overly optimistic performance (e.g., 100% accuracy).
- Incorrect Split: If the data is time-series, random splitting could allow the model to “see” future data, leading to inaccurate results. For time-series, the data should be split chronologically.
- Random Split: If the split is random, especially with imbalanced data, the training and test sets may contain similar examples, whick makes model to perform unusually well.
If you need further assistance or clarification, feel free to reach out. Best of luck with the troubleshooting, and I’m sure you’ll have it sorted out soon.
Best,
Alpay
1 Like