DL4J, Word Embeddings, Word2Vec, Vocabulary Extractor

Vladimir_Savin · January 29, 2019, 7:49am

Could you please help with the following problem.
I am using DL4J, Word Embeddings, Word2Vec.
After Word2Vec Learner node Vocabulary Extractor did some mistakes – bad mergers. Please see rows dedicated by red lines.

Bag of Words is good. Vocabulary Extractor is bad.

What should I do to avoid these mistakes as they corrupt all calculations?

Now I am forced to analyze bad combinations and insert them in the table of selected words to save estimation.

Thank you in advance.

DaveK · January 29, 2019, 1:20pm

HI @Vladimir_Savin,

the tokenizer of the Word2Vec node just splits on whitespace. For the Bag Of Words (or more precise when the Documents are created), a more sophisticated technique is used (the technique can be selected in the configuration of the Strings To Document node). Hence, in your highlighted examples the terms seem to not be split via whitespace in the original data, thus the Word2Vec node recognizes them as a single term. Sorry for that inconvenience.
As a workaround, you could tokenize the raw data (I assume every document is s String in your case) before feeding it to the Word2Vec node. I.e. insert a whitespace after every term. I’ll try to find a convenient method to this in KNIME.

You could do it using NLTK using a Python Scrip node. On the linked page, the first example shows how to tokenize a string.

Cheers
David

Vladimir_Savin · January 29, 2019, 1:57pm

Hi David,

Thank you for quick response and advices.

I looked through these sentences – they are split via dots.

Earlier I removed Punctuation Erasure Node.

I will try to use Python Scrip node.

Thank you again.

Best regards,

Vladimir Savin Hi David,

Thank you for quick response and advices.

I looked through these sentences – they are split via dots.

Earlier I removed Punctuation Erasure Node.

I will try to use Python Scrip node.

Thank you again.

Best regards,

Vladimir Savin

Vladimir_Savin · January 30, 2019, 11:14am

Hi David,

I have updated manually word docs with additional whitespaces in the original data dedicated by green, please see attached doc.

Unfortunately results are the same.

Please help.

DaveK · January 30, 2019, 12:19pm

Hi @Vladimir_Savin,

I’m sorry to hear that. Could you maybe upload a workflow showing the problem?

Vladimir_Savin · January 31, 2019, 7:07am

Hi David, I sent you mail with my workflow. Thank you in advance

DaveK · January 31, 2019, 12:52pm

Hi @Vladimir_Savin,

unfortunately I did not receive an email yet. Would it be possible if you just attach a workflow with your next post?

Vladimir_Savin · January 31, 2019, 1:12pm

Word2Vec_example_Jan2019.knwf (106.5 KB)
HIPAA_Contract.docx (12.3 KB)
ISO27001_Contract.docx (12.8 KB)

Please find uploaded my workflow and examples of input data

Vladimir_Savin · January 31, 2019, 1:19pm

Word2Vec_example_Jan2019_.knwf (1.3 MB)
please see workflow option

DaveK · January 31, 2019, 2:53pm

Hi @Vladimir_Savin,

thanks for the workflows. After the content Strings are converted to documents, the N Chars Filter seems to connect some words if they are separated by punctuation followed by whitespaces (we are going to look into why that happens). You can see this by looking at the Documents using a Document Viewer node before and after the N Chars Filter node. As these words are then connected, the Word2Vec Learner node does not recognize them as separate terms anymore. In order to fix this, just use a Punctuation Erasure node before the N Chars Filter. This way the N Chars Filter won’t connect the words anymore. Then, the Vocabulary Extractor outputs the same number of words as the Bag Of Words Creator (if duplicates are removed)

I attached a workflow showing my described solution.

I hope that helps.

Cheers

Word2Vec_example_Jan2019_fixed.knwf (237.5 KB)

Vladimir_Savin · February 1, 2019, 7:07am

Hi David,

It works.
Thanks a lot.

Best regards,

Vladimir Savin

DaveK · February 1, 2019, 8:37am

Great to hear that, could you maybe mark my answer as solution? This way it the thread gets displayed as solved in the overview.

Vladimir_Savin · February 1, 2019, 8:50am

Hi David,

It works.
Thanks a lot.

Best regards,

Vladimir Savin

system · February 8, 2019, 8:50am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.