Questions on overfitting and H2O model deplyoment

Greetings everyone!

After spending quite a few hours on configuring the H2O workflow below to make predictions on the basis of a train and test data set, I’m starting to think about applying this model to my data --which the model hasn’t seen yet. This will be my first model ever. I think the term is model deployment.

The workflow below shows the current setup with the H2O to MOJO and MOJO writer nodes linked in the top. Three questions:

  • Is this the right setup?
  • What do I check to confirm the absence of substantial potential overfitting?
  • How can I configure the nodes to ensure only the best model is saved rather than simply the most recent?

Any help would be most welcome. Loving this process so far!

Many thanks!

~Cole K.

@Cole_Kingsbury I think you will store the best parameters and then train another model using them.

1 Like

Many thanks for this resource, I do appreciate this. Some secondary questions. I now have this at the end of my “line” following your heart disease template example. Im trying to minimize error as false positives need to be minimized.

My main question is how to connect the H2O random forest learner at the end with the H2O Cross Validation loop start.:

I have a single data table of train-test data consisting of geochemical data so not sure how this should be connected so that I can begin to apply the best model to my geochemical data. Thanks again for this resource you gave me.

~Cole K.