Encountered duplicate row ID while using OPTICS cluster compute

Hi,

I am very new to Knime. Currently, I am exploring OPTICS algorithm for clustering in Knime. I have with me a CSV file with column ‘Word’ containing a phrase and 1024 columns followed by called embedding_0…embedding_1023, which contain the BERT embedding of this particular phrase.

I have around ~4900 such rows with me for which I would like to run OPTICS algorithm on.
For running OPTICS I am using combination of two nodes, OPTICS cluster compute and OPTICS cluster assigner.

However, whenever I configure OPTICS cluster compute to run with distance selection cosine, the execution fails with “Execute failed: Encountered duplicate row ID “Row N””, where N is the number where it failed. When running with another distance selection such as levenshtein it works well.

I tried looking at the data itself but there are no duplicates in it. DBscan on the same data was able to produce an output.

This is how my workflow looks like:
image

Any help is much appreciated.
justOPTICS.knwf(my workflow) (9.2 KB)
link to input file

Hi @paul_louis_cloo

thank you for bringing this up.

That is a bug I can validate.
I will send this to our dev team.

Thank you for reporting.

3 Likes

Thanks @Iris for letting me know and taking it forward!

1 Like

Is BERT embedding like Google Bert?1

@HeliaPakzad996, Yes this is Google BERT. To be precise, this CSV was created using sentence-transformers library in Python using the model bert-large-nli-stsb-mean-tokens.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.