I think there is a missing feature for data manipulation. I would see two nodes welcome (SPSS has Data/Restructure if I recall well with similar functionality):
The easier node: Rows to Columns (This reduces redundancy.)
Assume we have the following table:
|Label|Category|value|
|apple|1|150|
|apple|2|100|
|banana|1|200|
|banana|2|170|
After transformation it should be:
|Label|value-1|value-2|
|apple|150|100|
|banana|200|170| or
|Category|value-apple|value-banana|
|1|150|200|
|2|100|170| or a combined one:
|value-apple-1|value-apple-2|value-banana-1|value-banana-2|
|150|100|200|170|
The bit more complex node: Columns to Rows (This increases redundancy.)
From the tables generated by the previous node it recreates the original one.
I know it is not easy to do these nodes, but I think these are not impossible to do (we have already have Domain Calculator node).
What is your opinion? Does this make sense, or is this doable somehow with the current infrastructure?
Thanks, gabor
With the Pivoting node it is possible to realize the first two cases of your example, with the following settings:
Pivot column: Category
Group column: Label
Aggregation: Value
Pivot column: Label
Group column: Category
Aggregation: Value
Third case - the combined one - is not possible, but we want to improve the Pivoting node to support several pivoting columns. Unfortunately, the reverse operation (“unpivoting”) is also not possible - but the referring node is already in the works.
This is very important functionality in many data preparation tasks. Perhaps we can collaborate on specs to get a very clear picture of the scope of the functionality.
Other things (Also in SPSS and/or Clemetine and similar systems) include an easier way to define a large number of derived(Transformed) variables and a data exploration node to profile data columns.
Some more advanced functionality would be the so called “windowing” functionality which several of the major database providers now have (moving and cumulative windows, aggregates, offsets, ranks, etc…). This is very useful for the data preparation stage of the data mining process.
I had a short absence from posting on here and working with Knime so I’m looking forward to discussing the topic further.
Hi Jay, I think the question in this thread also the same. If no one else is going to implement these nodes I am going to write them. I guess in the next two weeks I will be able to finish this, but I know my technical writing skills are too bad. Does anyone want to create documentation to these nodes? Any ideas what features I have missed? (I know what I want, and hopefully described well in previous messages, but if in the design phase I can identify other aspects that may help to be more general it is not a bad thing.) Does anyone else working on similar nodes? All the bests, gabor
Hi Guys, I have added some comments on this thread describing a easy work around for pivoting based on ONE categoric column. But I agree with you that the Pivot node needs some brush-up in order to allow multi-column pivotation and additional aggregation methods for categoric values. Cheers, Thomas
Thanks for your help. After I have decided how will I implement these nodes I would like to ask you to create some diagrams and descriptions to show the usage and some corner cases. In the mean time: would you be so kind and could you give me some ideas how to name the nodes, those columns that are the grouping variables, those that are the “identifiers” (their values are the same in the groups), and those that are split to different columns.
Thanks, gabor
PS.: As I reread the answer from fabian.dill, maybe I will do something useless. Well, that will be a good excercise for me independently from this fact. (Hopefully the time frame will be enough to implement it before Easter.)