File Reader Node - can we force line endings only on CR/LR rather than LR alone?

Ted_M · April 3, 2019, 5:53pm

I’m creating a workflow to do pre-processing of some .txt files before passing them to a database.

The source data has non-printing characters that are causing some headaches. When I open it in Notepad++, I’m seeing Line Return characters in the middle of lines (which is why I can’t go directly to SQL Server through SSIS), which thankfully the FileReader node doesn’t care about as long as I allow “Quoted Strings may extend over multiple lines”.

Unfortunately, there seems to be a line where they include some combination of tabs and quotation marks that throws the file reader for a loop, and interprets the next 10,000 rows as being one text string, despite including standard Carriage Return + Line Return line ending within that “string”.

I tried letting the Line Return characters read as such, then using row numbers and joining the lines back together afterwards, which is a mess, but I end up with quotation marks that I can’t really get rid of everywhere, which isn’t ideal either.

What I’m really hoping for is some sort of solution that would say “this is a new line if and only if you get CR&LR together, otherwise obey the quoted strings may extend over multiple lines rule”, or something that would let me clean up the un-paired LR values prior to the use of the File Reader node.

Adding to the annoyance, I’m not allowed to run other programming languages on this machine, so I can’t try to tackle this with some sort of Powershell/Python/R Regex magic.

Any help?

Mark_Ortmann · April 4, 2019, 7:05am

@Ted_M,

is it an option for you to share that file with me, or create another file that has the same problem? In addition it would be nice if you could share your FileReader configuration with me.

Ted_M · April 4, 2019, 5:48pm

Absolutely.

I’ve attached a minimalist recreation of the file in question. It’s public-facing data, but I’m exercising an overabundance of caution.

(I’m also willing to accept that the problem may be intractable and needs to be addressed by asking them to sanitize their database inputs and/or extracts.)

Settings on the File Reader node as screen caps below.
I’ve attached a minimalistic recreation of the file in question. It’s public-facing data, but I’m exercising an overabundance of caution because the table was shared with us by a government agency, and I’d prefer to stay in their good graces and keep getting the data shared with us explicitly as files instead of trying to scrape their website.

(I’m also willing to accept that the problem may be intractable and needs to be addressed by asking them to sanitize their database inputs and/or extracts.)

Settings on the File Reader node as screen caps in the Word doc attached.

KNIME_TestFile_v2019_04_04.txt (4.5 KB)

NodeSettings.docx (57.7 KB)

Mark_Ortmann · April 8, 2019, 6:51am

@Ted_M,

to make a long story short, I don’t think with our current file reader it’s possible to parse this kind of csv file.

Are you allowed to access the DB directly from KNIME?
Are you allowed to make use of additional Java libraries?

Ted_M · April 8, 2019, 2:18pm

Good to know that I wasn’t missing some sort of obvious fix!

(This file’s formatting is basically corrupted, and I knew it was a “heavy lift”!)

Can’t access the DB directly, but I’m working with the owners to see if they can clean up their data and/or give us an extract with more unique delimiters like pipes instead of tabs.

Additional Java libraries are… probably not something I should use, but I’m interested to learn more about the possibilities there.

Mark_Ortmann · April 17, 2019, 7:24am

@Ted_M,

sorry it took me so long. I send you a private message with a workflow that hopefully solves your problem.

Best
Mark