Question about combining unstructured data with structured data

Hi all

I have a question about duo mining, first of all, I have 200 text doc with 150,000 terms after pre-processing. I have used random forest to select 1000 features for subequent doc classifiction using ANN. Due to the small sample size, I have applied 10-fold cross-validation to make the prediction. Secondly, I have another set of numeric figures (3 continuous variables) which can be treated as market data to enrich the doc classification.

Questions:

  1. If I would like to use 2-stage modeling, I.e. results from ANN + market data (3 variables as stated above), it seems the sampls size is too small as the output from ANN is only 20 (output from 10-fold cross-validation).

  2. If I use the market data (3 variables) with selected features from doc (1000 features), then the contribution from the market data should be neglected (as 3 vs 1000).

As I don’t want to use votting or simple weighted approaches to combine the result, may I have your professional view to make a sensible prediction?

If this is possible, is there any built-in nodes to do it?

Thanks
Lawson

Hi Lawson,

Is there any specific reason why you want to use this two step approach? What are you doing for pre-processing since you still have 150k terms afterwards? Maybe some frequency-based filtering can help to first remove terms that occur only in a few or even in a single document.

As for your second question, do you expect the three features to have an impact? If yes, you should also expect that this impact shows in the model, no matter if it’s only 3 features.

Cheers,
Roland