DL4J, Word Embeddings, Word2Vec, Vocabulary Extractor

Could you please help with the following problem.
I am using DL4J, Word Embeddings, Word2Vec.
After Word2Vec Learner node Vocabulary Extractor did some mistakes – bad mergers. Please see rows dedicated by red lines.

Bag of Words is good. Vocabulary Extractor is bad.

What should I do to avoid these mistakes as they corrupt all calculations?

Now I am forced to analyze bad combinations and insert them in the table of selected words to save estimation.


Thank you in advance.

HI @Vladimir_Savin,

the tokenizer of the Word2Vec node just splits on whitespace. For the Bag Of Words (or more precise when the Documents are created), a more sophisticated technique is used (the technique can be selected in the configuration of the Strings To Document node). Hence, in your highlighted examples the terms seem to not be split via whitespace in the original data, thus the Word2Vec node recognizes them as a single term. Sorry for that inconvenience.
As a workaround, you could tokenize the raw data (I assume every document is s String in your case) before feeding it to the Word2Vec node. I.e. insert a whitespace after every term. I’ll try to find a convenient method to this in KNIME.

You could do it using NLTK using a Python Scrip node. On the linked page, the first example shows how to tokenize a string.

Cheers
David

Hi David,

Thank you for quick response and advices.

I looked through these sentences – they are split via dots.

Earlier I removed Punctuation Erasure Node.

I will try to use Python Scrip node.

Thank you again.

Best regards,

Vladimir Savin Hi David,

Thank you for quick response and advices.

I looked through these sentences – they are split via dots.

Earlier I removed Punctuation Erasure Node.

I will try to use Python Scrip node.

Thank you again.

Best regards,

Vladimir Savin

Hi David,

I have updated manually word docs with additional whitespaces in the original data dedicated by green, please see attached doc.

Unfortunately results are the same.

Please help.

Hi @Vladimir_Savin,

I’m sorry to hear that. Could you maybe upload a workflow showing the problem?

Hi David, I sent you mail with my workflow. Thank you in advance

Hi @Vladimir_Savin,

unfortunately I did not receive an email yet. Would it be possible if you just attach a workflow with your next post?

Word2Vec_example_Jan2019.knwf (106.5 KB)
HIPAA_Contract.docx (12.3 KB)
ISO27001_Contract.docx (12.8 KB)

Please find uploaded my workflow and examples of input data

Word2Vec_example_Jan2019_.knwf (1.3 MB)
please see workflow option

Hi @Vladimir_Savin,

thanks for the workflows. After the content Strings are converted to documents, the N Chars Filter seems to connect some words if they are separated by punctuation followed by whitespaces (we are going to look into why that happens). You can see this by looking at the Documents using a Document Viewer node before and after the N Chars Filter node. As these words are then connected, the Word2Vec Learner node does not recognize them as separate terms anymore. In order to fix this, just use a Punctuation Erasure node before the N Chars Filter. This way the N Chars Filter won’t connect the words anymore. Then, the Vocabulary Extractor outputs the same number of words as the Bag Of Words Creator (if duplicates are removed)

I attached a workflow showing my described solution.

I hope that helps.

Cheers

Word2Vec_example_Jan2019_fixed.knwf (237.5 KB)

1 Like

Hi David,

It works.
Thanks a lot.

Best regards,

Vladimir Savin

Great to hear that, could you maybe mark my answer as solution? This way it the thread gets displayed as solved in the overview.

Hi David,

It works.
Thanks a lot.

Best regards,

Vladimir Savin

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.