Read CSV file with format errors

mpreusse · November 4, 2021, 11:31am

I use the File Reader node to read a bunch of CSV files. The files are tab separated.

Unfortunately, some of the files are not properly formatted and contain lines where the tab is replaced with a space. This seems to be an error on data creation side.

The file reader throws an error because it cannot cast the types of the columns anymore:

Execute failed: Row with ID 'Row106314' can't 
be converted to the configured data types.

Is it possible to tell the file reader to skip lines with errors?

ipazin · November 4, 2021, 3:19pm

Hello @mpreusse,

I guess you mean columns are tab separated? Best would be if you can share your file. Then I or someone else can take a look at it and give it a try. If data is confidential create dummy with couple of rows that represent real one good enough regarding issue you are experiencing. That shouldn’t take long.

Br,
Ivan

mpreusse · November 5, 2021, 11:19am

Here is a minimal example. The first row is tab separated. The second row has a space between wrong and 2. With the setting Support short data rows you can read the file. However, a type conversion of column 2 will fail.

header1	header2	header3
correct	1	here
wrong 2	here

test.txt (52 Bytes)

This is not really an issue for individual small files because the error will be found when configuring the node.

In my case I was iterating a set of large CSV files where the error appeared in row number ~650k.

I can work around that of course by removing the rows that are not parsed propery (they will have null values in the last column) and doing the type conversions in a separate step.

However, I was wondering if there is a setting for the File reader to skip those lines during reading.

(For reference, I am reading the removed records from all RefSeq releases: Index of /refseq/release/release-catalog/archive).

ipazin · November 5, 2021, 3:13pm

Hello @mpreusse,

tnx for example. I see. Seems like an issue indeed as there is no option to skip faulty lines. (In general filtering options from Row Filter would be useful to reader nodes and as such could cover this use case as well.)

Possible solution is to use Transformation tab and force all columns to be String. This way File Reader won’t fail. Then if you want to fix “short data rows” you can filter them out and once fixed put them back in or simply leave them out of equation. And to get proper column type on remaining data use Column Auto Type Cast node. Unfortunately don’t have better idea…

Br,
Ivan

mpreusse · November 5, 2021, 5:30pm

Thanks @ipazin! That’s how I solved it. More options for the File reader would be nice but right now that’s the best solution.

system · November 12, 2021, 5:31pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.