I'm currently using a Naive Bayes Learner/Predictor model to classify my data. I've gotten to the point where I have 2,278,435 rows that I'm looking to run through Document vector as a non-collection cell.
I've attempted partitioning to try and reduce the stress, but found that the processing will hang at 99% idenfinitely (or what seems to be indefinitely), when the rows exceed 100K. Is there a maximum or best practice to volume processing in the Document vector node?
the Document Vector node will create one row for each input document, which is in your case 2,278,435. It will create one column for each unique term in your set of documents. Be aware that if you have many columns (>3000) this will be very time consuming and not reasonable anymore. How many unique terms do you have in your document set? Have you filtered them before?
Best practice would be to filter the terms / document properly and check how many columns will be created i.e. how many unique terms are in the corpus. Keep the number of words below 3k. I never tried to create 2Mio document vectors to be honest.
First, thanks for the reply :)
Unfortunately I'm working with an extrememly large dataset and using the relative term frequency functionality. I believe the values are unique at this point, which speaks to the diversity of the data set I guess.
I was playing around with the Parallel Chunk functionality from KNIME Labs, but it doesn't appear to be compatible with Document vector, or I'm not quire grasping the error:
ERROR Parallel Chunk End Execute failed: Cell count in row "1" is not equal to length of column names array: 1809 vs. 1804
I was hoping this could help take advantage of CPU and my increased heap sizes, but I'm not able to get it to run properly.
the Parallel Chunk Loop node works fine for tasks which can be executed data parallel. This is not the case for document vector creation. The node needs to see all the data, since it will create a column for each unique term in the data set. When looking at chunks only the (in parallel executed) nodes see only a part of the data and can not create a column for each term in the complete data set.
Can you count how many unique terms you have in your data set (use a GroupBy)?
Can you then count in how many documents each term occurs (GroupBy and unique count over documents)?
Based on this information you can filter many terms out. Only keep terms that are contained in more than 1% in your document set. How many unique terms are there after filtering?