Word Vector - Fields documentation (or explanation?)

gustavo.velho · March 27, 2017, 9:50pm

Hey there,

I've been doing some testings with Deep Learning and I'm focused on text processing, document classification, sentiment analysis, etc. I've checked examples on the server and it's pretty clear how to learn and apply a vector on a set of documents. But I still want to understand better how the Word Vector node works; more specifically, what the fields on the node configuration means (seed, learning rate, minimum learning rate, etc)?

I understand some of them are (kind of) self-explanatory, but what "learning rate" really means? or batch size - is it the number of documents on each iteration? and Epochs? :)

As I'm still learning all this machine learning/data analysis thing, some of those terms are not familiar to me.

I checked the site but couldn't find anything specific to knime: https://deeplearning4j.org/

Any documentation or tips would be greatly appreciated!

Gustavo

Geo · March 29, 2017, 1:15am

Not a word2vec expert here, but terminology such as `epochs` or `learning rate` sounds a lot like neural network jargon. This is the link provided in the KNIME documentation: https://deeplearning4j.org/word2vec

You will also find this link on the aforementioned page: https://www.quora.com/How-does-word2vec-work

Here are some links to resources:

Neural Networks and Deep Learning, a free on-line book, which explains the concepts pretty well and which has an acceptable learning curve;
A Basic Introduction To Neural Networks;
A brief introduction to Neural Networks;

This is easily an intimidating topic (even the non-"deep learning" part !). That's why I'm usually sticking with simpler text representations whenever I can (e.g. the standard text processing nodes, BoW, nearest neighbors with string distance functions, etc) - that will solve your most common problems.

If text processing is what you want to do, you should definitely check out this reference (if you don't know it already): Introduction to Information Retrieval

gustavo.velho · March 30, 2017, 2:26pm

Great information Geo, appreciate your help!

Yes, I'm focused on applying text processing for social media/web analysis, I've been using Knime for 1 and half year now, and I'm pretty familiar with the text processing node and have been working mainly BoW, CoOcurrences and nGrams to understand trending topics and conversations.

Now I'm willing to advance a bit more and willing to classify documents in categories and sentiment, besides of identify potential topics automatically (the topics extractor isn't really good for this). I've checked the sample workflows on Knime that uses Decision Trees, SVM learners, but my document vector is causing processing problems (too large), and the Word2Vec seemed to work best in this case.

Anyway, appreciate your help, this is really useful information as I'm willing to learn more on this area.

Gustavo Velho

Aleenah · April 24, 2017, 6:04pm

Hi,

I am trying to classify sentiments using the WordVectorLearner in Knime and I have taken help from the Example: Sentimen Classification Using Word Vector ( source : https://www.knime.org/nodeguide/analytics/deep-learning/sentiment-classification-using-word-vectors)

I don't understand why the vocabulary includes labels "DOC_1", DOC_2" etc as words in my case while in the example nothing like this happens. Why would it include the labels of the documents in vocabulary? Please help me I am kind of stuff and this is ver disturbing. I am unable to attach the file because of it's big size.

Thanks.

kilian.thiel · May 16, 2017, 12:00pm

Hi Gustavo,

about your problem that the document vectors get too large. This is a big problem when dealing with these kinds of vectors are there are a few things that you could do.

Filtering terms that occur only a few documents, before creating the vector (grouping ob BoW, row filtering low freq terms, and reference row filtering the original BoW). This helps to filter terms / features that will have very low impact on your classification because they are almost constants with value = 0 in almost all documents, except some very few. According to Zipf's law this will filter the long tail of low freq. terms and reduce the feature space a lot.
Use the Document Vector Hashing node. This node hashes the terms and creates vectors of predefined length. Be aware, that it is likely, that quality is decreasing due to collisions.
Use a PCA after 1. to further reduce the feature space / document vectors.
Use Doc2Vec to create semantic word embeddings with predefined vector size.

I hope this helps.

About the questions of Aleenah, why do the vocabulary include the labels "DOC_1" ...? Word2Vec create word vectors. To classify documents you need document vectors. Doc2Vec is a trick that uses the Word2Vec learning method but creates also vectors for documents, not only for terms. To identify the vectors for documents these labels are included. A document is treated somehow as a set of terms not as a single term.

Cheers, Kilian

system · June 2, 2023, 8:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.