CSV / File writer conflict during parallel execution

mwiegand · June 16, 2020, 4:51pm

Hi all,

while processing in parallel I write results to a file i.e. CSV). I’ve noticed once a while, at different rows, even though I am processing the same data, noticed that the CSV structure gets broken because of “Too many data elements”.

Now my assumption is that the file / CSV writer notes run occasionally into conflict with each other. Anyone got an idea?

Thanks in advance
Mike

AnotherFraudUser · June 16, 2020, 9:07pm

Hi @mw,

just to comfirm i understand you correctly.
You have a paralell loop which loads different csv files.
The loop execution stops with your mentioned error?

As far as I know thia error means the loop end cannot Concatenate the different input tables due to changing table structures (e.g. one file has 6 columns, the next one has 10 - it expects the same number of columns for each input with the default options)

Actually without having example files I would say that at least one file is read in with more columns then the rest (or the first loaded file has too few)
Could it be that e.g. due to missing/wrong quote characters some files have a different column structure?
Or that at least one file actually has a different column structure?

Can you try to enable the
“Allow changing table specifications” - Option in the loop end and see if that fixes your problem?

Else if you could provide example files, I could check what is wrong with them in more detail

mwiegand · June 17, 2020, 7:00am

Hi @AnotherFraudUser,

not exactly. The parallel execution finished correctly and afterwards I load the results which were saved in one CSV during parallel execution.

There I noticed the inconsistent behavior. I also executed the workflow w/o parallelism and the CSV structure was pretty fine. I inspected the CSV w/o splitting into columns and can confirm that i.e. one line, somewhere in the middle, started with data which actually belong to another line.

Best
Mike

trj · June 17, 2020, 7:22am

Hello @mw,

With your last description, it seems that you have special char in your CSV maybe ? Like a blank character or maybe a END OF LINE or something ?
Have you tried to import the CSV into Excel to cross check the file structure ?

Do you have an example of the file ?

Best
Jerome

mwiegand · June 18, 2020, 5:48pm

Hi @trj,

I can guarantee that this is not the case. First and almost because the CSV structure breaks occasionally and at different points in the entire data set.

Secondly because I checked the individual files and their processing and could not reproduce it. Control characters, line breaks, NULL (not missing) values were all eliminated before issuing this ticket.

Henceforth, my assumption that, due to the randomness of the effect only occurring during parallel execution, it’s due to the “cloned” CSV writer nodes writing the same file by accident.

Normally I’d assume the file / operating system prevents writing a file that is accessed by another process.

Best
Mike

Mark_Ortmann · June 18, 2020, 9:25pm

@mw,

are you writing to the very same file? If that’s the case do some rows contain more elements than the number if columns of the tables you’re writing and some less? Do you have the append option active?

mwiegand · June 19, 2020, 7:39am

Good morning @Mark_Ortmann,

no, the table structure is fixed. All columns are always present in the same order.

I’ve created a workflow to replicate the issue. The workflow:

creates random data
Removes quotes from the data just to eliminate quoting issues
Adds constant values to marke beginning and end of each row
Writes in parallel
Reads the result
If it fails it reads the raw data
Counts the RowStart and RowEnd constant
Extracts the original RowID
Removes valid RowID form original data to keep the faulty

It can be extended via benchmark nodes to test for randomness of the error but I was able to constantly reproduce it.

Best
Mike