Load file with columns divided by one or more tabulations or manipulate rows by custom criterion

I’m loading a dataset which has its columns divided by tabulations. I used File Reader with \t as column separator, but it turns out that there are some rows where columns are divided by more than one tab. So, when I load the dataset, I get three or more extra columns with mostly missing values, except for those rows where more tabs exist. In such cases, the row has in between missing values corresponding to the columns divided by more than one tabulator.

A regular row looks like this

Column1 Column2 Column3 Column4
5.4 4.2 0.5 2.8

While a row with more than one tab between columns looks like this

Column1 Column2 Column3 Column4 Column5 Column6
5.4 4.2 ? ? 0.5 2.8

Causing than normal rows then look like this

Column1 Column2 Column3 Column4 Column5 Column6
5.4 4.2 0.5 2.8 ? ?

Any way of avoiding this by dividing the columns by one or more tabulations?

As I didn’t find a way to do this, I considered an alternative approach. I would load the dataset like that and then, somehow I would manipulate each row by moving column values when a missing value is found. After that, as all the aditional columns would only contain missing values, I could usse a missing value column filter node to get rid of them.

So, any way to manipulate rows like that? I thought about using Java Snippet node but I can’t find how to acces rows like vectors

1 Like

Hi @xandor19

The problem here is that (as far as I know :)) one cannot configure the -File Reader- node so that it interprets consecutive tab separators “\t” as being just one to get the file read and formated as you expect.

The idea so far that comes to my mind to solve this problem, is to firstly read the whole file using a -Line Reader- node, then use a -String Manipulation- node to replace several consecutive tabs “\t\t” by just one when it happens and then save the table again as a CSV file to eventually read it correctly using a -File Reader- node.

Would this solution be good enough in your case ?

Hope it helps.

Best
Ael

3 Likes

Hello, the problem is that when I load the file with Line Reader the tabs seem to be dismissed and I get all columns without any space between them in the String, check this out

1 Like

Hi @xandor19

Most probably the tabs are not visible in the table but they are present and hidden.

If the data is not confidential or sensitive, could you please upload your CSV (:blush: ) file here ?

Of course, in fact, its from UCI Machine Leanring Repository. It is not an Excel, just a regular txt
https://archive.ics.uci.edu/ml/datasets/seeds

1 Like

And it works as you say, tabs aren’t visible but are still there. I’m working on the regex

1 Like

Thanks @xandor19 for having validated the solution :smile: !

In the meanwhile I have implemented the following workflow if this can be of help too :wink:

20221004 Pikairos Load file with columns divided by one or more tabulations or manipulate rows by custom criterion.knwf (55.9 KB)

Best wishes
Ael

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.