Sentiment Classification Using Word Vectors - - Deep learning 4J

Hi,

I was playing with the example workflow present in knime server   08_Sentiment_Classification_Using_Word_Vectors .

But I think that the original workflow is not very useful for classification purpose because it trains the model on all the dataset and not on a partition.

I tried to modify the workflow in order to apply the trained word vector to new Test data. But I don't understand why the algorithm predicts everything wrong.

Please find attached the workflow. What am I missing? What is wrong?

Could you help me please?
Thanks in advance

Hi iiiaaa,

in the original workflow there are two steps that are important for the classification. The first one is to embed our text samples into an vector space, which is done by the Word Vector Learner configured to perform Doc2Vec. For this step we did not use any ground-truth label information but we generated a distinct document id for each document to use as label. What this does is that it learns a vector for each document (together with vectors for each word) because each document corresponds to one distinct id that we used as label. As you see in the original workflow, only the document vectors (the vectors corresponding to the doc ids) are kept (word vectors are discarded because we don't care about them because we want to classify whole documents). Therefore, in this first step it is no problem to use the whole dataset (but normally you are perfectly right regarding a classification task).

Then in the second step we want to train a classifier on the created word vectors (you can see the first step as a kind of preprocessing bringing our data in a suitable form for the classifier). Now we use a Partitioner and train the model only on a potion of the data.

Generally, for Doc2Vec one would use doc ids and not labels in the case of classification because this technique is able to effectively embed text into a (relatively) low dimensional vector space which allows to discriminate between the classes (you can have a look at the plot of the 2-dim PCA of the vectors in the original workflow).

If you use the ground-truth labels for Doc2Vec you get something like a centroid of the classes. This could be used e.g. for similarity search.

I hope that helps. If you have further question I'm happy to help.

Hi Davek,

thank you very much for the explanation. Now I understand why the original workflow (with the  Word Vector Learner that does not use the ground-truth labels) is correct.

But I would have 2 questions:

1) I have modified the workflow 08_sentiment_classification_using_word_vectors (attached) to try the second approach that you suggested:  to use the ground-truth labels in the Word Vector Learner to get something like a centroid of the classes and then use a similarity search to predict the labels.

But I don't understand why the algorithm predicts everything wrong. What am I missing? Could you help me please?

2) Would you have any suggestions regarding how to decide the parameters in the Word vector learner? Some rules of thumb....?  Doing loop optimisation can be very time expensive with deep learning...

Thanks in advance

Regards

 

Hi iiiaaa,

1) I've had a look at your workflow and I think you are doing nothing wrong. However, in order to explain the bad results, I added a small visualization (just 2-dim PCA and plot), of the resulting word vectors, to the workflow (see attachment). There one can clearly see that the vectors created by the Word Vector Apply node seem to be really bad. The node basically just takes a sentence, looks each word up in the Word Vector Model to get the corresponding vector, and then calculate the mean of them. This is a very baseline approach to convert documents into vectors using a Word Vector Model. Maybe you noted the small warning message of the Apply node:

WARN  Word Vector Apply    0:65       7058 words are not contained in the WordVector vocabulary.

If a word of a sentence is not contained in the Word Vector Model vocabulary it will be just skipped. In this case these are quite a lot. So i think the training corpus is just too small to get sensible results using the centroid approach because too many words are simply missing.

2) Generally, it is very hard to give general advice on hyper parameter tuning. In this case i would look at the original paper and use the parameters they used as a starting point (however in this case parameters seem not to be the problem). Using more epochs is probably also a good idea. Unfortunately, often you just have to play around with the parameters a bit.

Regards

David

1 Like