I’m creating a workflow to do pre-processing of some .txt files before passing them to a database.
The source data has non-printing characters that are causing some headaches. When I open it in Notepad++, I’m seeing Line Return characters in the middle of lines (which is why I can’t go directly to SQL Server through SSIS), which thankfully the FileReader node doesn’t care about as long as I allow “Quoted Strings may extend over multiple lines”.
Unfortunately, there seems to be a line where they include some combination of tabs and quotation marks that throws the file reader for a loop, and interprets the next 10,000 rows as being one text string, despite including standard Carriage Return + Line Return line ending within that “string”.
I tried letting the Line Return characters read as such, then using row numbers and joining the lines back together afterwards, which is a mess, but I end up with quotation marks that I can’t really get rid of everywhere, which isn’t ideal either.
What I’m really hoping for is some sort of solution that would say “this is a new line if and only if you get CR&LR together, otherwise obey the quoted strings may extend over multiple lines rule”, or something that would let me clean up the un-paired LR values prior to the use of the File Reader node.
Adding to the annoyance, I’m not allowed to run other programming languages on this machine, so I can’t try to tackle this with some sort of Powershell/Python/R Regex magic.
is it an option for you to share that file with me, or create another file that has the same problem? In addition it would be nice if you could share your FileReader configuration with me.
I’ve attached a minimalist recreation of the file in question. It’s public-facing data, but I’m exercising an overabundance of caution.
(I’m also willing to accept that the problem may be intractable and needs to be addressed by asking them to sanitize their database inputs and/or extracts.)
Settings on the File Reader node as screen caps below.
I’ve attached a minimalistic recreation of the file in question. It’s public-facing data, but I’m exercising an overabundance of caution because the table was shared with us by a government agency, and I’d prefer to stay in their good graces and keep getting the data shared with us explicitly as files instead of trying to scrape their website.
(I’m also willing to accept that the problem may be intractable and needs to be addressed by asking them to sanitize their database inputs and/or extracts.)
Settings on the File Reader node as screen caps in the Word doc attached.
Good to know that I wasn’t missing some sort of obvious fix!
(This file’s formatting is basically corrupted, and I knew it was a “heavy lift”!)
Can’t access the DB directly, but I’m working with the owners to see if they can clean up their data and/or give us an extract with more unique delimiters like pipes instead of tabs.
Additional Java libraries are… probably not something I should use, but I’m interested to learn more about the possibilities there.