Column header QC - can tis be done?

I just put together a workflow to concatenate a large number of data files (plus some downstream processing). Seems to work ok so far.

Unfortunately, the column headers between the files are not consistent (as I found out when the above workflow kept throwing up errors). The individual files were generated over the last 18 month and the system that produced them went through various upgrades etc., hence the mess.

To get a handle on the issue, I want to get an overview of the all the different column header layouts. So I tried to set up another workflow that extracts the column headers (line 1) from each file, transposes them, and then appends all these columns. But that attempt wasn't very successful so far.

I start with "List Files", then proceed to "TableRow To Variable Loop Start". In my original workflow this is followed by the "Variable Based File Reader", but this relies on consistent column headers, so I can not use it here. Instaed, I was hoping the "Line Reader" would do the trick (limiting the number of rows to 1). Now when I configure the "Line Reader" to use the "Location" variable (instead instead of an URL of file name), it dutifully reports "The 'url' parameter is controlled by a variable". But after hitting "ok", I get "Invalid settings: Invalid (empty) URL".

What's the culprit here? Is there any way to make this work?


 - j.


You should select a URL (any valid URL will be ok) in the normal part of the dialog and set the flow variable. That way you will not get an error and will work as expected.

Hope this helps, gabor

Thanks! That helped a great deal and made it work.

Next conundrum, however, is that my data files are tab-delimited and the line reader lumps everything together in one line and strips out the tabs (I assume, commas in a csv would be preserved)...

I do not think Line Reader strips out tabs. Maybe those are just not visible.

Hmm, it looks like the line reader does strip them out. I wrote the output to a csv and just opened that in notepad and this is how the first row looks like (truncated):


Experiment CodeExperiment NameProjectProject CodeScreen TypeAssay Name


whereas it should be (I manually added commas instead of tabs to show the original columns):


Experiment Code,Experiment Name,Project,Project Code,Screen Type,Assay Name


Here is an example workflow where it seems the tabs kept as they are. Could you share a workflow where it does not work as you expected? Thanks, gabor

Thanks for looking into this. I have attached a zip file that contains the basic workflow, two small example files (tab delimited) and the result file (with the tabs stripped off).

It seems everything is there. The misleading result is part of the Advanced/Pattern written out between data values setting in CSV Writer, which is \t in your case. Without that, you will get back the tabs.

With the Cell Splitter (\ escape, \t delimiter) I was able to recover data to 23 new columns.

Hope this helps, gabor

Thanks, this was incredibly helpful. Everything works as advertised.