How to concatenate select cells in a table?

lparsons42 · July 20, 2018, 8:56pm

I am trying to read in a FASTA protein file in KNIME, but the Bio Sequence Reader tool does not work for this particular file. My file looks something like this

PROTEIN_ACCESSION123 protein data
sequenceline1
sequenceline2
PROTEIN_ACCESSION234 protein data
sequenceline1
sequenceline2
sequenceline3

What I want to create is a table that would then be like
(PROTEIN_ACCESSION123 protein data) (sequenceline1.sequenceline2)
(PROTEIN_ACCESSION234 protein data) (sequenceline1.sequenceline2.sequenceline3)

In other words I want to concatenate the sequence lines that belong to each protein, and then have the sequence for a given protein be the cell in the table going with the accession. In this file each line ends with a new line (\n). I can easily do this in Perl by removing a new line any time a line does not correspond to an accession (which can be easily distinguished by the starting “>” (greater-than) sign) and making sure that every line that does correspond to an accession begins with a new line - then splitting this into a table. I would like to accomplish this in KNIME but I’m not sure how to.

I read the FASTA file in to KNIME with Line Reader to get the table structure where each line of the file is one line in a column. If I can combine the sequences for each protein and then pair them to their corresponding accession I should get the result I’m after.

thank you!

s.roughley · July 20, 2018, 11:11pm

Have a look at the FASTA Sequence Extractor node in the Vernalis contribution (https://nodepit.com/node/com.vernalis.nodes.fasta.FASTASequenceNodeFactory_2). If you load the FASTA file(s) with either the Load text-based files (https://nodepit.com/node/com.vernalis.nodes.io.txt.LoadTxtNodeFactory) or Load Text Files (https://nodepit.com/node/com.vernalis.knime.io.nodes.load.txt.LoadTextFilesNodeFactory) nodes (also from Vernalis) you should be able to get to the result you want

Steve

lparsons42 · July 23, 2018, 10:59am

Steve

Thank you for the suggestion. This is close to what I need though I found it erroneously parsed one accession line (from a shortened UniProt file that I recently gave it as input). In particular there was one UniProt entry that had two consecutive equal signs in it and that seems to have caused this tool to hiccup and split one line into two (producing one line in the table that has only an accession and the next line in the table having only the sequence). This is much closer than I have gotten with anything else so far; I might see if I can clean up the output downstream and go with it.

thank you!

s.roughley · July 23, 2018, 11:23am

If you are able to post it, could you put the content of the entry which caused the hiccup into a post on the Vernalis forum and I will see if we can figure the problem?

Thanks

Steve

lparsons42 · July 23, 2018, 1:16pm

Steve
Thank you for your reply. I just posted it to Vernalis moments ago. Upon closer inspection I found the problem seems to come from an accession line that includes a greater-than symbol (which of course usually designates the start of such a line) in the middle of the description (where it was used to describe an enzymatic process).

s.roughley · July 23, 2018, 5:59pm

Thanks @lparsons42. For anyone looking at a later date, this bug was fixed - see FASTA Sequence Extractor Error for thread and links.

Steve