CSV Reader long IDs

garrard · February 11, 2020, 10:03am

I’m using a CSV reader to import datatable from CSV file which has several columns with IDs related to other tables. The IDs are 18 digits long, When the reading is done, the columns are of a double type and rounded, so the least significant digits are all 0.

How to set the column type to a string and avoid this? One suggestion in another thread was to read IDs as Row IDs, however as I said my tables have many columns with ID related to other tables (as foreign keys), so this is no solution.

(File Reader node is not usable because of the CSV format.)

armingrudd · February 11, 2020, 10:27am

Hi @garrard,

How about unchecking the “Has Column Header” option in the configuration window of the CSV Reader node? (Also, you have to limit the number lines to be scanned in the “Limit Rows” tab > “Scan limit” option > input 1 or 0) So all column types would be string. Then you can handle splitting the first row (Row Splitter) and use it to assign column headers. Transpose the single row table and then use the Insert Column Header node to assign the headers.

But, may I ask why you have found the File Reader node not applicable here? The node has so many “Advanced” configurations which may help you read the file. Maybe you can share a sample file with us?

garrard · February 11, 2020, 10:47am

Thanks. I don’t have a header row. Setting Scan limit to 1 or 0 gives

Execute failed: For input string: “\N” In line 381 (Row380) at column #12 (‘Col12’)

(empty “cells” are denoted with “\N” in data file.)

I’m reading several files in the loop, where each file has a different column structure. If I use the File Reader instead of CSV Reader, it reports this kind of errors:

Execute failed: For input string: “some string” In line 1 (Row0) at column #3 (‘Col3’).

Strings are not enclosed in any quotes.

garrard · February 11, 2020, 1:31pm

To make it more clear, here’s my data (as in Notepad++).

\t is column delimiter, \n is line delimiter, there are no other delimiters. There are no quotes or other symbols used to encapsulate strings

If I import the same data with CSV Reader, I get:

Please note that the 18 digits columns are of type double. Interestingly, Col0 is at first glance OK however the last digits are wrong. The upper limits reported are wrong for all columns except for Col1.

This is the setup:

File Reader gives the same result. Already the preview is wrong:

with settings as for CSV Reader (where applicable).

armingrudd · February 11, 2020, 2:10pm

In the table preview section of the File Reader node, double click on column headers and then you can modify the corresponding column configurations (name, type, …).

Try this and let me know if your issue is solved.

garrard · February 11, 2020, 2:30pm

Thanks - as I mentioned above, I’m reading several files in the loop, where each file has a different column structure.

armingrudd · February 11, 2020, 2:52pm

What do you mean by different? Is it the order? Different number of columns? Types?

And if you provide me with a few file samples, I think I can help you better by providing an example workflow.

garrard · February 11, 2020, 2:55pm

Different data tables have different number of columns, different types, since they represent different content. I cannot provide the data due to confidentiality.

armingrudd · February 11, 2020, 3:05pm

Are you appending the the content of the files in the loop?
So how about reading all columns as a single column for each file and then use the Cell Splitter to split them?

garrard · February 11, 2020, 3:22pm

No, I have a directory of ‘raw’ CSV files which are read sequentially, the resulting tables equipped with column names and then separately written as .table files. The internal structure of tables should be persisted (column order).

garrard · February 11, 2020, 3:52pm

Following your advice, I tried to read entire lines with Line Reader and then use the Cell Splitter, however the Line Reader doesn’t preserve the tabs (huh?!). Is there any other way to read lines as they are?

mlauber71 · February 11, 2020, 5:00pm

You could try and use R’s readr package

In this example all columns are imported as strings first

read_tsv should do the trick

armingrudd · February 11, 2020, 5:36pm

You could use the CSV Reader without column delimiter so all columns will be in a single column and then you can split them with the Cell Splitter. Or you can follow the solution provided by @mlauber71.

garrard · February 11, 2020, 6:18pm

Nice trick @armingrudd, however CSV Reader doesn’t preserve tabs in the data, too, similar to Line Reader as I mentioned above.

I thought I could do the import and data preparation in “pure” Knime, but it seems it’s not sufficient for my challenge.

armingrudd · February 11, 2020, 9:26pm

Never give up! Here is why:

Actually, the CSV Reader node and also the File Reader node (if uncheck “ignore spaces and tabs”) preserve tabs but we cannot see them!! Just try the Cell Splitter with \t as delimiter and you will be surprised!

Here is an example workflow in which I have used both nodes to read a CSV file similar to your case and split it to have all fields as strings:
csv_tab.knwf (24.0 KB)

Pure KNIME, like it?

beginner · February 12, 2020, 6:54am

How many files are we talking here and is this a repetitive thing or a one-of thing?

I’m gonna say the heretic thing here but if it’s a one-of thing and not too many files, opening them with excel, setting the columns to text and then using excel reader might also be an option.

Albeit I do feel with a proper example from your side and some thinking it should probably work with pure knime.

garrard · February 12, 2020, 8:17am

Thanks, @armingrudd - I followed the settings in your example, the crucial thing was result as a list from Cell Splitter and using Split Collection Column afterwards.

@beginner, no worries, I have >200 files on a monthly basis (API calls), >20 GB plain text.

PS: It works with Line Reader, too

system · February 19, 2020, 8:17am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.