New nodes for rearrange data

aborg · January 15, 2009, 2:21pm

Hi all,

I think there is a missing feature for data manipulation. I would see two nodes welcome (SPSS has Data/Restructure if I recall well with similar functionality):
The easier node: Rows to Columns (This reduces redundancy.)
Assume we have the following table:
|Label|Category|value|
|apple|1|150|
|apple|2|100|
|banana|1|200|
|banana|2|170|
After transformation it should be:
|Label|value-1|value-2|
|apple|150|100|
|banana|200|170|
or
|Category|value-apple|value-banana|
|1|150|200|
|2|100|170|
or a combined one:
|value-apple-1|value-apple-2|value-banana-1|value-banana-2|
|150|100|200|170|

The bit more complex node: Columns to Rows (This increases redundancy.)
From the tables generated by the previous node it recreates the original one.

I know it is not easy to do these nodes, but I think these are not impossible to do (we have already have Domain Calculator node).

What is your opinion? Does this make sense, or is this doable somehow with the current infrastructure?
Thanks, gabor

fabian.dill · January 15, 2009, 3:51pm

With the Pivoting node it is possible to realize the first two cases of your example, with the following settings:

- Pivot column: Category
- Group column: Label
- Aggregation: Value
- Pivot column: Label
- Group column: Category
- Aggregation: Value

Third case - the combined one - is not possible, but we want to improve the Pivoting node to support several pivoting columns. Unfortunately, the reverse operation (“unpivoting”) is also not possible - but the referring node is already in the works.

Jay · January 16, 2009, 4:54pm

Hi,

This is very important functionality in many data preparation tasks. Perhaps we can collaborate on specs to get a very clear picture of the scope of the functionality.

Some previous comments: http://www.knime.org/node/244

Other things (Also in SPSS and/or Clemetine and similar systems) include an easier way to define a large number of derived(Transformed) variables and a data exploration node to profile data columns.

Some more advanced functionality would be the so called “windowing” functionality which several of the major database providers now have (moving and cumulative windows, aggregates, offsets, ranks, etc…). This is very useful for the data preparation stage of the data mining process.

I had a short absence from posting on here and working with Knime so I’m looking forward to discussing the topic further.

Have a great day!

Jay

aborg · March 30, 2009, 6:06pm

Hi Jay, I think the question in this thread also the same. If no one else is going to implement these nodes I am going to write them. I guess in the next two weeks I will be able to finish this, but I know my technical writing skills are too bad. Does anyone want to create documentation to these nodes? Any ideas what features I have missed? (I know what I want, and hopefully described well in previous messages, but if in the design phase I can identify other aspects that may help to be more general it is not a bad thing.) Does anyone else working on similar nodes? All the bests, gabor

gabriel · March 31, 2009, 10:10am

Hi Guys, I have added some comments on this thread describing a easy work around for pivoting based on ONE categoric column. But I agree with you that the Pivot node needs some brush-up in order to allow multi-column pivotation and additional aggregation methods for categoric values. Cheers, Thomas

Jay · April 1, 2009, 3:44pm

Hi Gabor,

I am happy to help by writing up the documentation and/or contributing to specs however I can. Please let me know how I can help.

Best regards,

Jay

aborg · April 3, 2009, 5:10pm

Hi Jay,

Thanks for your help. After I have decided how will I implement these nodes I would like to ask you to create some diagrams and descriptions to show the usage and some corner cases. In the mean time: would you be so kind and could you give me some ideas how to name the nodes, those columns that are the grouping variables, those that are the “identifiers” (their values are the same in the groups), and those that are split to different columns.
Thanks, gabor

PS.: As I reread the answer from fabian.dill, maybe I will do something useless. Well, that will be a good excercise for me independently from this fact. (Hopefully the time frame will be enough to implement it before Easter.)

aborg · April 7, 2009, 12:26pm

Hi Jay,

A working prototype (really ugly yet, and not feature complete) is present on the HiTS repository: http://code.google.com/p/hits/source/browse/#svn/ie.tcd.imm.knime.util/trunk/ie.tcd.imm.knime.util (for checkout you might want to use the following address: http://hits.googlecode.com/svn/ie.tcd.imm.knime.util/trunk/ie.tcd.imm.knime.util).
I am not sure, how familiar you are with KNIME development. If this information is enough to start, then I check in a sample workspace and you can try out and give feedback. If it is not enough, I am going to release an updateable plugin soon, but maybe just after Easter holidays.
For the documentation part: I prefer xhtml with png files.
Thanks, gabor