Numeric Distances output 0 and DBSCAN raise error about missing columns

Hi! I am new to KNIME platform and I am trying to test it on a workflow involving clustering certain text items with DBSCAN. I am reading data from a csv file, filter some row and ditching some column and then I am using the Text Embedder node with OpenAI Embeddings Connector, to obtain the embeddings of the text items. Because the embeddings come as a list I am using the Split Collection Column to obtain the embeddings value as separate columns for feeding them in the Numeric Distances node which I turn need to feed the DBSCAN Node. The OpenAI embeddings come with 1532 dimensions. Everything seems to work just fine until I am running the Numeric Distance node, which, although does not raise any error, seems that it cannot determine the distances and make DBSCAN to raise an error of type: Missing columns: “Split Value 1”, “Split Value 2”, “Split Value 3”, “Split Value 4”, … <243 more>. In the Output panel of Numeric Distance node I have folder-like structure where folder distance-characteristics is array-size [xint] → 1 and 0 [xstring] → METRIC

I am not very sure what does mean, but it seems that the node cannot calculate the distances necessary to be fed in to the DBSCAN Distance Model Port. So I am kind of stuck, as there is no error message when Numeric Distances node is executed. Does anyone know more about what could be the problem. and how can be fixed? Thanks!

Hi,
Hard to tell, but have you tried to test your flow with a reduced dataset? Let’s say with 10 columns? And then 50, 250, 1000?

1 Like

Hi. Thanks for your reply. I have tried again with a reduced dataset of 20 rows but the result is the same. The Output panel of the Numeric Distance has a columns folder where is displayed array-size[xint]->128 which i suppose is related to the dimensionality of the embeddings (I have reduces the OpenAI 1532 dimensions to 128 when I configured the Numeric Distance node) Then the columns folder is listing each 128 column like this: Split Value 1 … Split Value 128. The odd thing is that these column that are supposed to have float numbers are displayed as [xstring]. I do not know if this may cause the issue, but there is no error displayed when executing Numeric Distance node. When I a configure the Distances Node with Cosine Distance instead of Euclidian in the folder distance_characteristics appears a new element: 0 [xstring] → FAIL_ON_MISSING_VALUES. But the rows does not seem to have missing values, unless you consider some few cases of 0 and -0 as missing values. But such message does not show for Euclidian option even if a choose Fail on missing value option. In addition, the TextEmbedder is configured to fail is there is missing value in the text column, but it does not fail.