The output of File Reader and CSV Reader nodes (single source) don't match

armingrudd · August 15, 2018, 12:11pm

Hi,

Attached is a workflow in which I’ve shown that the output data of File Reader and CSV Reader nodes don’t match.
I’ve read a CSV file named adult.csv with these nodes and used the output to build a decision tree model. When I use File Reader node to read the source file, I can use a Table Creator to make some prediction based on the model, but I cannot use the same Table Creator node for prediction if I use CSV Reader to provide source data. In the workflow I have 2 Table Creator nodes, one to use in case I use CSV Reader and one to use in case I use File Reader and I cannot use them interchangeably (The one for CSV Reader was created accidentally by one of my students and I cannot create another one!!!). I tested the output of each Reader node directly in prediction and the same results came out.
CSV vs File Reader.knwf (2.4 MB)

ScottF · August 15, 2018, 3:05pm

Hi @armingrudd -

I took a look at your workflow. This discrepancy has to do with the format of the adult.csv data - if you load it up in a text editor you will see that each field is delimited by both a comma and space. In your workflow, the CSV Reader node does not explicitly account for the space, while the File Reader node has checked the option to “ignore spaces and tabs”.

Similarly, the data in the Table Creator node for the CSV Reader includes leading spaces, while the data in the Table Creator node for the File Reader does not. This explains why you can’t use them interchangeably. Does that makes sense?

(This only jumped out at me when I was looking at the outputs of the Table Creator nodes in the Node Monitor - when clicking between the two I could see the space shifting the text over. Three cheers for the Node Monitor! )

armingrudd · August 15, 2018, 3:22pm

Great! Thank you so much @ScottF .
I checked the data file and as you mentioned the columns are delimited by commas for headers and commas and spaces for the rest of data. When I set the column delimiter to ", " (comma and space) the problem gets solved but the column headers cannot be read(using CSV Reader). So is there any option to solve this (automatically in KNIME - e.g. any option to make the node skip spaces) or that’s what we should handle manually?

ScottF · August 15, 2018, 3:31pm

I think in this case I would use the File Reader node, since it handles the discrepancy gracefully. The CSV Reader node makes the assumption that the headers and data have the same format, and I don’t think there’s a way around that (apart from reading the header and data separately and then concatenating, which is a lot of extra work).

armingrudd · August 15, 2018, 3:36pm

Thank you so much again for taking the time to solve my issue.

system · August 22, 2018, 3:36pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.