I did the ‘spark k-means’ to cluster my data, then I finished, I want to extract the data for further analysis by using ‘table to spark’. However, I got error. I checked KNIME log but cannot find the solution:
Then I think maybe first fetch 100 rows to see the output of ‘spark k-means’ node, but it still did not work, as shown in KNIME log:
The error message said checking KNIME log, but I am still confused.
Can you help me to solve this problem? I want to see my output spark data from ‘spark k-means’ and ‘spark to table’ node.
If you could provide us with an example or more information (like the LOG file) about the versions and the data it might be helpful.
In general my experience with Spark (and KNIME) is that spark is very picky when it comes to:
- Non-Double numbers like integers (yes sounds strange)
- string variable (might have to be converted via dictionary or label encoding)
- continuous variables (the same value in all entries)
- NaN in columns
I had success overcoming problems with KNIME and spark nodes by eliminating all of these problems.
@DerekJin do you have a workflow and some test data to reproduce this?
As already mentioned by mlauber71, Spark does not like missing values and produce sometimes cryptic errors on that. You can use the Spark Missing Value node to eliminate them.
Actually there is no missing value in my data, my data is boolean matrix like this:
It only includes 0 and 1 for each cell, and they are double when I checked the types of columns.
Then my workflow is also simple like this:
the data can be read from ‘table to spark’, then I did spark k -means, it was also executable. But I got such an error when I tried to see the output from spark k -means.
i can’t reproduce this. Running local Spark with K-Means on a double table with 0 and 1 values works as expected (see attached workflow). Can you share more details about your setup? What Spark version you are running? How does the input data looks like, can you attach the data (with renamed column names if required)?
SparkKMeansBooleans.knwf (11.9 KB)
I am sorry to applying this message very late. I find my input dataset includes string, so I fix this problem. After I filter the string in the input matrix, I still get error about outofbound exception. My input matrx has 8030 columns, and the error is 8029 (ArrayIndexOutOfBoundsException) when I tried to get the output view from spark k-means. And I cannot use spark to table node after spark k means node(with same error).
Here is the sample data I used in my workflow. data.zip (247.3 KB)
looks like the Spark K-Means node has trouble with tables containing a column called
cluster. As a workaround you can use a Spark column filter or Spark column rename node in front to remove or rename the
cluster column before the Spark K-Means node. Thanks for reporting this.
This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.