Reading .txt files and separating them into columns

Saishiyam · February 22, 2022, 2:09am

Hi,

I have a file which looks similar to this:
I have to read the highlighted tables which is present in the file:

I tried splitting it using space delimeter, but it takes up each space and ends up each word in a new column.
Any help would be appreciated!
Thanks!

elsamuel · February 22, 2022, 3:03am

Can you share the file? What node are you using to read the file?

I tried splitting it using space delimeter, but it takes up each space and ends up each word in a new column.

Have you tried using multiple spaces or a tab?
What about specifying a constant width column?

Saishiyam · February 22, 2022, 3:45am

Hi, sure, here is the sample file. Haven’t tried that, pretty new to knime and also using 4.1.1 version.
test sample knime.txt (2.1 KB)

duristef · February 22, 2022, 9:49am

this workflow seems to solve your problem (at least on the test file) read text file – KNIME Hub

ScottF · February 22, 2022, 3:40pm

Looks like @duristef has already provided an interesting possible solution. You could also take a look at the Fixed Width File Reader node as another approach.

Saishiyam · February 23, 2022, 3:28am

Uhm haven’t tried that yet, will take a look. Thanks

Saishiyam · February 23, 2022, 3:29am

Hi, thank you. I can see that “^\d+.+$” has been used here. Can you please explain what does that regex do, as in how it works. Thanks!

Saishiyam · February 23, 2022, 3:36am

@duristef It works in the sample data, but creates an empty result actually when I apply the same on the original data Will check again

duristef · February 23, 2022, 6:36am

“^\d+.+$” matches rows beginning with at least one digit (^\d+), followed by at least one character (.+$). This is used to filter out all the rows that do not contain data.
The main problem I see in the file is that columns are right/left filled with both tabs and spaces, which makes it difficult to split them by position. That’s why I didn’t use the Fixed Width File Reader node myself

takbb · February 24, 2022, 10:27am

One of the problems here, as @duristef has already pointed out is the mix of tabs and spaces as delimiters. Clearly the report was designed to be read with fixed width spacing and if viewed in Notepad, all of the columns that you are interested in line up. Converting the white space into a delimiter character is therefore going to be problematic under some circumstances.

So, instead of trying to convert the white space into column “delimiters”, you might therefore get a better result if you convert tabs into a specific number of spaces (as an approximation to what Notepad does), and then attempt to split the columns positionally. I found that for the sample data, converting each tab into 8 spaces provides sufficient “wiggle room” that a positional cell split can then occur. After then trimming white space and converting to numerics, you get a reasonable result. Maybe this will work with your “real” data.

I borrowed from @duristef’s Line Reader but removed the regex and opted for a straight replacement of tabs with the “Cell Splitter By Position” replacing the “Cell Splitter”

Split Columns from report.knwf (14.2 KB)

system · May 25, 2022, 10:28am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.