Corpus Creation to Find a Distance

I’m trying to get string distance to use it for clustering. My result is NaN. Could you please review my FW to say, what I’m doing wrong?Corpus Generation.knwf (31.9 KB)

Looks like (1) some settings in the WF are off, (2) there’s a bug in the TF-IDF node which can produce the NaNs, (3) it would be good to have an example workflow which shows the node’s usage.

I will get back once I have a fix for these points. Will be with the next Palladian release. Some patience please.

–Philipp

Thank you, @qqilihq. Could you please explain what settings are off. Also, I will be on KNIME version 4.1 until February 2021, so I’d like the fix works for the version.

@ipazin, if you want to reclassify my question, I’d like to have it as Text Mining as not Palladian solution can be advised to me.

Thank you
Igor

1 Like

Hi @izaychik63,

sure. Thought it was closely related to Palladian.

Br,
Ivan

1 Like

I fixed data and got distance from TF-IDF similarity node. How I can cluster the document now? I attached fixed WF.Corpus Generation.knwf (18.6 KB)

Could someone suggest how to simulate TF-IDF node with KNIME TF, IDF nodes?

Can’t you just calculate the TF-IDF multiplicatively using a Math Formula node?

I’m not asking for nodes result calculation. I’m asking for emulation of TF-IDF node from Palladian.

Philipp, please also take a look at the node performance. I noticed 2 things.

  1. Node takes time for a bigger input.
  2. It takes 100% of processor, as result in my case, VPN lost connectivity. After that any network operations became impossible. The processing time was about 30 min.

Performance was not a top priority, but the node’s logic is rather simple, and I’m not sure what could be optimized. I remember runner it on some rather big text corpus back then (100,000 docs?!)

The issue you describe sounds more like the KNIME program hits a memory limit (thus the CPU usage). Suggest to keep an eye on the memory gauge (enable the “Heap status” in the KNIME prefs) next time you run it. If memory runs full, either increase it, or use simpler feature settings (word unigrams instead of n-grams, shorter length, …)

I fixed the issue you described above, but need to fix several other lose ends before releasing a new version. This will still take some time.

– Philipp

1 Like

Thank you, Philipp. Memory is not an issue in my case. It is heavy use of CPU. If possible, job needs to be spread between cores. I have 8 cores/16 virtual cores.
Also, node needs a better description. Configuration parameters are not mentioned in it.

Thanks for the feedback, I’ll see what we can do!

1 Like

Couple more questions. k-Medoid node in case I set 5 clusters gives an error message, k-mean is not.
Document Vector node is loosing Prov column and I can not reference back to identify document owner.

This is my last attempt to simulate document clustering Corpus Generation.knwf (62.3 KB)

Anyway k-Medoids needs to be fixed for number of cluster higher than it can generate.

Hi there,
I am new to web scrapping.
Just came to know that Palladian nodes are good for web scrapping through this thread.
I want to extract contents based on certain key words from set of web sites.
Further I want to use the text for measuring cosine similarity, LDA and so on
Any hints/workflows available.

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.