Data quality of a big csv file

Hello,

When I try to a large csv file (631Mb, via the File Reader or CSV Reader node) to do a sum of a column, I can see that the data is misread (the File Reader fails the exeution and the CSV Reader also fails unless and allow short lines) as the sum in a column is false. Could you please help me to handle the issue ?

The input is a CSV file with a comma (",") as a column separator, it has a column header, it double quotes chars.
I calculate the sum with the Group By node (in Manual aggregation, I select the column in which I want the sum).

Thank you in advance,
victorcib15

@victorcib15

Welcome to the KNIME forum. A bit more information on the structure of your file and the way you calculate the sum in KNIME (which nodes / code) would be helpful.

gr
Hans

2 Likes

The input is a CSV file with a comma (",") as a column separator, it has a column header, it double quotes chars.
I calculate the sum with the Group By node (in Manual aggregation, I select the column in which I want the sum).
Do I need to provide more information ?

Thank you in advance.

It is still not clear that’s your issue. For dealing with missed values use

2 Likes

Hmm . What I can think of is use a Statistics node, that will give you some information on your variable you want to sum. Or use the Colsum function in the Math formule node. But most likely to me is, that there are some records (or just 1 :wink:) record with a different value-type e.g. string (or numbers with a decimal character a dot instead of a comma (or the other way around)).

gr
Hans

Well, even a Statistics and a Math Formula (w/ Col_sum) node give a bad result. The result of the sum should be ~545, but Knime gives ~444. When I do the sum of the same column from the same CSV file with another ETL (Pentaho and Alteryx), the result is correct.
The CSV file comes from a HeidiSQL (MySQL) export.

Generally speaking, the data loaded to KNIME have not recognized as numbers or some lines skipped. If you can, please provide your data source for analysis.

Thank you for your help. My problem is solved: it’s because one of my column has values with a “#” and I didn’t saw that I had to remove the “Comment Char” form.

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.