Corpus Creation to Find a Distance

izaychik63 · October 13, 2020, 7:37pm

I’m trying to get string distance to use it for clustering. My result is NaN. Could you please review my FW to say, what I’m doing wrong?Corpus Generation.knwf (31.9 KB)

qqilihq · October 14, 2020, 7:59am

Looks like (1) some settings in the WF are off, (2) there’s a bug in the TF-IDF node which can produce the NaNs, (3) it would be good to have an example workflow which shows the node’s usage.

I will get back once I have a fix for these points. Will be with the next Palladian release. Some patience please.

–Philipp

izaychik63 · October 14, 2020, 11:54am

Thank you, @qqilihq. Could you please explain what settings are off. Also, I will be on KNIME version 4.1 until February 2021, so I’d like the fix works for the version.

izaychik63 · October 14, 2020, 12:03pm

@ipazin, if you want to reclassify my question, I’d like to have it as Text Mining as not Palladian solution can be advised to me.

Thank you
Igor

ipazin · October 14, 2020, 1:32pm

Hi @izaychik63,

sure. Thought it was closely related to Palladian.

Br,
Ivan

izaychik63 · October 15, 2020, 8:21pm

I fixed data and got distance from TF-IDF similarity node. How I can cluster the document now? I attached fixed WF.Corpus Generation.knwf (18.6 KB)

izaychik63 · October 21, 2020, 5:37pm

Could someone suggest how to simulate TF-IDF node with KNIME TF, IDF nodes?

ScottF · October 21, 2020, 5:52pm

Can’t you just calculate the TF-IDF multiplicatively using a Math Formula node?

izaychik63 · October 21, 2020, 5:57pm

I’m not asking for nodes result calculation. I’m asking for emulation of TF-IDF node from Palladian.

izaychik63 · October 22, 2020, 3:24pm

Philipp, please also take a look at the node performance. I noticed 2 things.

Node takes time for a bigger input.
It takes 100% of processor, as result in my case, VPN lost connectivity. After that any network operations became impossible. The processing time was about 30 min.

qqilihq · October 22, 2020, 4:05pm

Performance was not a top priority, but the node’s logic is rather simple, and I’m not sure what could be optimized. I remember runner it on some rather big text corpus back then (100,000 docs?!)

The issue you describe sounds more like the KNIME program hits a memory limit (thus the CPU usage). Suggest to keep an eye on the memory gauge (enable the “Heap status” in the KNIME prefs) next time you run it. If memory runs full, either increase it, or use simpler feature settings (word unigrams instead of n-grams, shorter length, …)

I fixed the issue you described above, but need to fix several other lose ends before releasing a new version. This will still take some time.

– Philipp

izaychik63 · October 22, 2020, 4:33pm

Thank you, Philipp. Memory is not an issue in my case. It is heavy use of CPU. If possible, job needs to be spread between cores. I have 8 cores/16 virtual cores.
Also, node needs a better description. Configuration parameters are not mentioned in it.

qqilihq · October 22, 2020, 4:46pm

Thanks for the feedback, I’ll see what we can do!

izaychik63 · October 28, 2020, 3:35pm

Couple more questions. k-Medoid node in case I set 5 clusters gives an error message, k-mean is not.
Document Vector node is loosing Prov column and I can not reference back to identify document owner.

izaychik63 · October 30, 2020, 4:49pm

This is my last attempt to simulate document clustering Corpus Generation.knwf (62.3 KB)

Anyway k-Medoids needs to be fixed for number of cluster higher than it can generate.

Saivinod · April 6, 2021, 3:07pm

Hi there,
I am new to web scrapping.
Just came to know that Palladian nodes are good for web scrapping through this thread.
I want to extract contents based on certain key words from set of web sites.
Further I want to use the text for measuring cosine similarity, LDA and so on
Any hints/workflows available.

system · October 6, 2021, 3:08am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.