[FEATURE REQUEST] External Tool (Labs)

Geo · January 21, 2016, 4:20pm

In the help for the node "External Tool (Labs)" it is specified: "Make sure that the process generates valid row IDs in order to ensure joining of the result with the input table. E.g. if the output is CSV, make sure that the first column contains the original row IDs of the input table."

1) Would it be possible to create a checkbox which would allow to remove the restriction according to which the first column should contain the original row IDs of the input table ?

2) Furthermore, would it be possible to make the input port optional ? Currently, as a way to circumvent this requirement, it is already possible to create an empty table with a single variable and no observation as input, then to specify the command line which does not even require the said input table and have a completely different output table.

weskamp · January 22, 2016, 7:36pm

The whole point of the External Tool (Labs) node is to take the input table and to combine it with the output from an external tool by joining these two tables together - even if this input is processed in multiple parallel batches. This is where the requirement to have rowIDs in the output comes from.

If you just want to make a command line call and read in the result, there are probably easier and more flexible ways to do that than to extend this already quite complex node?!?

Geo · January 23, 2016, 9:10pm

So which easier and more flexible ways do you suggest then?

weskamp · January 24, 2016, 1:49pm

What about using the normal (non-labs) "External Tool" node? If you really need an optional input port, it makes IMHO more sense for this node since e.g. the number of generated calls is fixed and does not depend on the size of the input table.

If this node is not flexible enough for you, you can always point it to some dummy file and read the actual contents using a "File Reader" or "CSV Reader" node. This way, you have much more flexibility concerning column names and types, separators etc. You could use a "red" connection between the two nodes to enforce their sequential execution.

Geo · January 24, 2016, 2:46pm

Well, let me expose the use case for extending the (labs) External Tool node.

I use StatTransfer to import data in formats that Knime cannot natively read, such as data in e.g. SPSS or Stata format. StatTransfer can be either used with a shell command or with a specifically configured batch file.

To allow "on the fly" import of files, the most user-friendly solution I've stumbled upon so far is the labs-ET node. Using table creator with creating a column is an easy workaround (and much more convenient than a dummy file) and can be happily hidden in a meta node, so that feature request is indeed secondary. However, the 1st-column rowID restriction is much more intimidating, for it makes it impossible to import the column name of the first column, while all the rest of the node works perfectly for this use case.

I do not see how making this restriction optional would really increase the complexity of this node. Alternatively, I guess it would be trivial for the ET node to simply copy the first column and import it both as rowID and regular column.

weskamp · January 24, 2016, 7:38pm

Well, then I would argue that you actually want an “SPSS Reader” or “Stata Reader” node for you use case, which is probably relevant for quite a number of users.

I don’t know how much effort it is to make the the join within the labs-ET node optional, in my comment concerning “complexity” I was mainly referring to the complexity of the interface from an end user perspective.

Geo · January 24, 2016, 9:13pm

I’d agree with the relevance of such nodes, yet, they would also come with their own complexity and maintenance. On the other hand, the labs ET could be so much more powerful with only one more checkbox.