Deep learning workflow adaptation to new data

dataminer_1 · May 18, 2021, 8:36pm

I am having a lot of trouble adapting the 09_Wide_and_Deep_Learning_on_Census_Dataset example workflow to use other data sets. I like this workflow because it auto-configures the model shape and handles string and numerical input data transparently.

I want to adapt this example workflow for use with other tabular data sets like the Wisconsin Breast Cancer data set from the UCI Machine Learning Repository. I have tried also other data sets like the Boston Housing data set. All of my test models are perfect (100% accurate). I must have made some mistake in the configuration. I am wondering if the target variable sneaked into the mix of predictor variables somehow (the workflow is very complex).

I want to use this workflow on 10,000 records of clinical data to predict the incidence of precancerous colon polyps at the UCI Medical Center. My bagged boosted tree model is only about 70% accurate. I am hoping that the deep learning model can beat that. Some changes must be necessary in the workflow configuration to use other data (other than specifying the target variable), but I can’t figure out what I missed.
Any suggestions anyone?

stelfrich · May 20, 2021, 2:44pm

Hi @dataminer_1,

Does your target variable show up in the first or second output port of the Preprocessing metanode? If yes, it seems to have snuck in somehow. The configuration of the Preprocessing metanode is not very robust and you need to go in and open the views of the Column Selection and Column Filter nodes as soon as your input data changes (even the types of columns, for instance).

I have tried to adapt the workflow to this dataset but it seems that some of the lower-level components do not handle empty input tables well. This is an issue for said dataset because it either contains all categorical columns (if you interpret the domain 1-10 as categories) or all numerical columns. Either way, one of the metanodes that handle the data will not execute successfully. I seem to remember that the Boston Housing dataset has a mixed set of variables, so that might work better. Unfortunately, I have run out of time and can’t investigate this anymore.

Best regards,
Stefan

system · November 19, 2021, 2:44am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.