Encountered duplicate row ID while using OPTICS cluster compute

paul_louis_cloo · August 21, 2020, 3:00pm

Hi,

I am very new to Knime. Currently, I am exploring OPTICS algorithm for clustering in Knime. I have with me a CSV file with column ‘Word’ containing a phrase and 1024 columns followed by called embedding_0…embedding_1023, which contain the BERT embedding of this particular phrase.

I have around ~4900 such rows with me for which I would like to run OPTICS algorithm on.
For running OPTICS I am using combination of two nodes, OPTICS cluster compute and OPTICS cluster assigner.

However, whenever I configure OPTICS cluster compute to run with distance selection cosine, the execution fails with “Execute failed: Encountered duplicate row ID “Row N””, where N is the number where it failed. When running with another distance selection such as levenshtein it works well.

I tried looking at the data itself but there are no duplicates in it. DBscan on the same data was able to produce an output.

This is how my workflow looks like:

Any help is much appreciated.
justOPTICS.knwf(my workflow) (9.2 KB)
link to input file

Iris · August 24, 2020, 3:42pm

Hi @paul_louis_cloo

thank you for bringing this up.

That is a bug I can validate.
I will send this to our dev team.

Thank you for reporting.

paul_louis_cloo · August 24, 2020, 5:52pm

Thanks @Iris for letting me know and taking it forward!

HeliaPakzad996 · August 27, 2020, 7:14pm

Is BERT embedding like Google Bert?1

paul_louis_cloo · August 28, 2020, 4:07am

@HeliaPakzad996, Yes this is Google BERT. To be precise, this CSV was created using sentence-transformers library in Python using the model bert-large-nli-stsb-mean-tokens.

system · September 4, 2020, 4:07am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.