Missing Value for Collections

I m developing a workflow for text classification using DL4J Nodes in it. I pass a training set of sentences into a Word2Vec Learner and then use the model and the same sentences in Word Vector Apply node. Then a Split Collection Column node splits into multiple columns each cell of which is by itself vector values. But there are missing values in these columns showing up as ? in the output table.

When I pass this output to DL4J Feedforward Learner Node as feature vectors, it is complaining that there should not be missing values.

I explored using Missing Value node to replace all such missing values with empty vector []. But, that node does not have 'Fix Value' option for collection cells (like the one for String).

How can I replace missing values in collection cells to empty collection?

Kindly help.

Hi, missing values in a collection can't be replaced. You need to split up the collection with the Split Collection Column node, then handle the missing values with the Missing Values node and then create the collection again with the Create Collection column node.

I hope his helps.

Cheers, Kilian

Thanks Kilian!

I have added a loop to process each column from the Split Collection Column node that has missing vector values to once again split into simple double value columns, replace missing values, and aggregate back into vector.

I will let you know if there are any issues.

Thanks once again.

Hi,

could you maybe attach a workflow reproducing the behavior? If you are using the same sentences for the Apply node that you are using for training there should not be missing values.

Cheers

David

 

Hi David,

True, the sentences used in the Word2Vec Learner and Word Vector Apply nodes are the same (output of Concatenate node). Please refer to the workflow picture that I had attached with my initial question.

In the console logs, for the Word Vector Apply node, I am seeing the following:

WARN  Word Vector Apply    0:330      12355 words are not contained in the WordVector vocabulary.

Hi sballa,

thanks for your answer. The Learner node has a option to omit words which occur below a certain frequency. The default is set to 5. Hence, all words occurring less than 5 times in the corpus will not be contained in the WordVector vocabulary. Therefore, they can't be matched in the Apply node and you get missing values in the list of vectors. If you don't want that you can set the option to zero.

Sorry that I didn't realize that earlier.

Cheers, David

 

Hi David,

Unfortunately, missing value problem of Word Vector Apply node occurs even after configuring the Word2Vec Learner component's Minimum Word Frequency to 0 (with every other parameter set with default values). It does not happen if Word Vector Apply node's 'Calculate Document Mean Vector?' option is checked, but, that will output only mean word vectors which will not be sufficient to be be used in the downstream nodes of my workflow.

Thanks, Sudha

 

Hi Sudha,

sorry for the late answer and the problems with the node. Could you maybe attach a real workflow? Or create one reproducing the behavior with dummy data if you can't share your data? That would be a great help.

Thanks

David

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.