I am new to Knime and relatively new to Big-Data.
I have logfile data which I get via Hive. My problem is, that all columns are automatically set as strings. Therefore the Spark statistics node doesn’t recognize any usable columns. How can I change that?
Furthermore I detected a few faulty rows in the data. How can I filter them out within Spark(I found only one approach to turn Spark to table, use rename nodes and then turn table to spark, but that is not an real option)?
I don’t know if my suggestion could be of any help, but you can change the datatype of each column of your spark dataframe basically in 2 ways:
by using the spark java snippet node (this means that you need to write your own custom solution using java, by modifying/override the current dataframe datatype schema of each column). I think that some moderators could provide you a simple example in order to understand how you can perform certain kind of operations within it.
by using the Spark SQL query (for example, if you want to get as output an integer instead of a String, you can type: “SELECT int(
your_column) FROM #TABLE#”; ). This is just a very short example, but if you take a look on the function list provided in spark sql you’ll fastly realize that you can perform a lots of transformations on the data.
Regarding filter in or filter out some specific rows, it’s the same approach that needs to be applied with spark SQL node. You need to write custom SQL queries in order to exclude those specific records by using for example specific WHERE condition, or specific subqueries for exclude something.
Anyway, this is just my personal suggestion. I’m not sure about its correctness.
Let’s also wait a better answer from a more expert user/moderator.
I would also recommend the Spark SQL node to convert e.g. Strings to int. Since we don’t have a Spark Row Filter yet (it is on the list), the Spark SQL is also the easiest Spark option for this.
You mentioned that you are pulling data from Hive. If possible you could also filter the data via the Database Row Filter node and then use Hive to Spark to get the result into Spark.
Finally, if the faulty rows are faulty because they contain missing values, you can also use the Spark Missing Value node.
Hope that helps,
the filtering via Database row filter went fine (though I had to find out first that you need type ’ value ’ ).
Using SQL did not work, because every value is within quotes. I dunno why it was like that. So I am going to need some stringt Manipulation for every freeking values sigh.
My temporary solution was to use Spark to table node, make the String manipulations and then Table to spark node. Ugly, but I worked for first try.
There may be a way to do the string manipulation in SQL (preferably). In any case it can still be done in Spark DataFrame Java Snippet.
What String manipulation are you doing here? (i.e. with the standard KNIME String Manipulation node).