I have just added a preliminary support for roles to the RapidMiner integration extension. (Please test if you have time, although it will only generate them in the node output, but not set based on the node input.)
(A few notes about roles: it identifies the intended purpose of the column, can be used to set default configuration. For example if a column has label role, the learner node should by default use that column as class labels.)
My questions are:
Do others implemented a similar functionality for their KNIME nodes?
In case yes:
Did you also used Properties (as I see the properties are seldom used in base KNIME nodes, but there are a huge load of 3rd party nodes there)
What is your key for that? (I would like to be compatible with existing implementations)
What are your values? (Currently in my case: regular attributes are not signalled, id, label, prediction, cluster, weight, batch, attribute, outlier, cost, base_value, but other values are also possible) (Also, I would prefer to be compatible with existing solutions.)
Do you think it is important to have single columns with special roles? I mean: should this be a rule (violating it causes error), suggestion (violation -> warning), not rule (no action if not holds).
How would you handle the case when there are no roles specified (which would be required)? (Currently this is the case, so I guess no action.)
Usually when I think I find something useful, the answer from the KNIME developers that it is in progress of development :), so if you already have plans, I am open to be compatible with that solution.
I'm using the Column Properties for information about units in the MOE nodes. For example if a column contains energy values I set the column property "unit" with value "kcal/mol". I also have a unit converter node that parses this information.
I was also thinking about abusing the column properties for some audit trail information but haven't done this yet.
I think it's shame that column properties are not more widely used but also see the risk when different things get assigned to the same property name.
So I'm glad to hear that I'm no longer alone :-)
Thanks for the info. :) I see the problem of possibly incompatible information being present with same keys. That is the main reason I was asking for feedback. If you would like to cooperate in a design of this kind of functionality, that would be appreciated. I guess the KNIME guys are currently busy with the UGM, so it might worth waiting a bit more for their feedback.
Do you think it would work better if there were a RM specific role key (like RapidMiner_role or similar) and create a separate node to achieve compatibility to existing/future implementations? (Although it seems there is no existing yet.)
Or it would work better if this part (at least the specification) would be in a separate plugin and anyone who would like to join to this effort would just use that. (Would Apache 2.0 licence be suitable for everyone?)
Or another idea that did not crossed my mind yet? ;)
I would like to implement/finish this functionality in the near future. Do you have preference? I have read n the noding guidelines that the auto-configuration should be preferred when available. Do you think this would be a good option to achieve it when there are multiple options? Or it is better to still report an error?
I am still not sure yet which would be the best option implementationwise. Would others consider using the separate plugin, or it would be an unnecessary complication?
Checking a bit the default column names it might be better name the "label" role as "class". What do you think? Is there a recommendation for such a naming scheme? What should happen if there is a label role if we choose class as a name? (The id might be also suspicious, it is rowkey in KNIME (which do not have properties), but there can be other id columns too, for example numeric.)
My current preference to the rest: no warnings, no errors for possible problems and set the roles according to the properties. There would be no role translation node, each substitution would be done inside the RM nodes.
Do you have coments?
Just in case anyone is interested (or can understand each other through code):
I have decided to go on the API route for now, hopefully it will be in a usable state soon. If you are not satisfied, have other ideas, please open a ticket or send a pull request.