Column header QC - can tis be done?

pipetman · July 23, 2013, 8:22pm

I just put together a workflow to concatenate a large number of data files (plus some downstream processing). Seems to work ok so far.

Unfortunately, the column headers between the files are not consistent (as I found out when the above workflow kept throwing up errors). The individual files were generated over the last 18 month and the system that produced them went through various upgrades etc., hence the mess.

To get a handle on the issue, I want to get an overview of the all the different column header layouts. So I tried to set up another workflow that extracts the column headers (line 1) from each file, transposes them, and then appends all these columns. But that attempt wasn't very successful so far.

I start with "List Files", then proceed to "TableRow To Variable Loop Start". In my original workflow this is followed by the "Variable Based File Reader", but this relies on consistent column headers, so I can not use it here. Instaed, I was hoping the "Line Reader" would do the trick (limiting the number of rows to 1). Now when I configure the "Line Reader" to use the "Location" variable (instead instead of an URL of file name), it dutifully reports "The 'url' parameter is controlled by a variable". But after hitting "ok", I get "Invalid settings: Invalid (empty) URL".

What's the culprit here? Is there any way to make this work?

Thanks

- j.

aborg · July 23, 2013, 8:47pm

You should select a URL (any valid URL will be ok) in the normal part of the dialog and set the flow variable. That way you will not get an error and will work as expected.

Hope this helps, gabor

pipetman · July 24, 2013, 2:44am

Thanks! That helped a great deal and made it work.

Next conundrum, however, is that my data files are tab-delimited and the line reader lumps everything together in one line and strips out the tabs (I assume, commas in a csv would be preserved)...

aborg · July 24, 2013, 11:04am

I do not think Line Reader strips out tabs. Maybe those are just not visible.

pipetman · July 24, 2013, 2:18pm

Hmm, it looks like the line reader does strip them out. I wrote the output to a csv and just opened that in notepad and this is how the first row looks like (truncated):

Experiment CodeExperiment NameProjectProject CodeScreen TypeAssay Name

whereas it should be (I manually added commas instead of tabs to show the original columns):

Experiment Code,Experiment Name,Project,Project Code,Screen Type,Assay Name

aborg · July 24, 2013, 3:06pm

Here is an example workflow where it seems the tabs kept as they are. Could you share a workflow where it does not work as you expected? Thanks, gabor

tabhandling.zip

pipetman · July 24, 2013, 3:59pm

Thanks for looking into this. I have attached a zip file that contains the basic workflow, two small example files (tab delimited) and the result file (with the tabs stripped off).

columnheaderqc.zip

aborg · July 24, 2013, 4:50pm

It seems everything is there. The misleading result is part of the Advanced/Pattern written out between data values setting in CSV Writer, which is \t in your case. Without that, you will get back the tabs.

With the Cell Splitter (\ escape, \t delimiter) I was able to recover data to 23 new columns.

Hope this helps, gabor

tempflow.zip

pipetman · July 25, 2013, 3:06pm

Thanks, this was incredibly helpful. Everything works as advertised.