FASTA Sequence Extractor Error

I recently tried the FASTA Sequence Extractor downstream from Load Text Files, with an interest in reading a UniProt db into a table for a larger KNIME workflow. Ideally I would like to get a table out of this with a column for the accession and a column for the entire protein sequence from the db; it would be a bonus if we could also get a column with the description (from the accession row) and another with the length but I can certainly live without those.

The problem I have run in to is that at least one UniProt entry (in this case P26439) was split erroneously into two lines in the table created by FASTA Sequence Extractor. Looking at the accession line for this entry, I see that there is actually a “>” in the description:

sp|P26439|3BHS2_HUMAN 3 beta-hydroxysteroid dehydrogenase/Delta 5–>4-isomerase type 2 OS=Homo sapiens GN=HSD3B2 PE=1 SV=2

The result then is the line in the table for this accession has no sequence, and the next line in the table has the sequence without an accession. This entry is from Human UniProt, downloaded 15 November 2016; I just downloaded Human UniProt again this morning and found the same accession line still exists.

thank you

Thanks. That is really helpful. I have just pushed out a fix on the nightly build (v1.17.2). The current parser assumed that ‘>’ would only ever appear at the start of a header line. I will put this onto the stable release in the next week or so. (See Update to v 1.17.2 (Nightly only) for details)

Steve

Thank you Steve! I will keep an eye out for the Vernalis nightly to become available through KNIME. I was myself certainly surprised to encounter the additional greater-than sign in the middle of a line - the first time I looked at the table output from FASTA Sequence Extractor I thought it was the double dash that caused the hiccup until I looked at the accession line itself.

1 Like

I just reran FASTA Sequence Extractor (1.17.2v201807231541) and now I have a slightly different problem in the output. For the same accession line that I mentioned before - which had a greater-than sign in the line - the accession is now trimmed at that greater-than sign. In other words the sequence is correct - and correctly mapped - but the accession is trimmed short and I don’t have the full description of the protein.

Could you tell me what other settings you are using in the node dialog? I am out of office for a few days but will try to get back to this when I am back

Steve

“Select a column containing the FASTA Sequence Cells”: “(String) File contents”
“Delete FASTA Sequence colum” - unchecked
“Select the FASTA Sequence source or type”: “SWISS-PROT”
“Extract full Header Row(s) from FASTA” - unchecked
“Extract the Sequence?” - checked

Thanks for that. I now have in my output using those settings (and removing any other nasty places where the ‘>’-only-at-start-of-header-line assumption was made):

Accession:    P26439
Name:         3BHS2_HUMAN 3 beta-hydroxysteroid dehydrogenase/Delta 5-->4-isomerase type 2 OS=Homo sapiens OX=9606 GN=HSD3B2 PE=1 SV=2
Sequence:     MGWSCLVTGAGGLLGQRIVRLLVEEKELKEIRALDKAFRPELREEFSKLQNRTKLTVLEGDILDEPFLKRACQDVSVVIHTACIIDVFGVTHRESIMNVNVKGTQLLLEACVQASVPVFIYTSSIEVAGPNSYKEIIQNGHEEEPLENTWPTPYPYSKKLAEKAVLAANGWNLKNGDTLYTCALRPTYIYGEGGPFLSASINEALNNNGILSSVGKFSTVNPVYVGNVAWAHILALRALRDPKKAPSVRGQFYYISDDTPHQSYDNLNYILSKEFGLRLDSRWSLPLTLMYWIGFLLEVVSFLLSPIYSYQPPFNRHTVTLSNSVFTFSYKKAQRDLAYKPLYSWEEAKQKTVEWVGSLVDRHKETLKSKTQ

I will push this as version 1.17.3 to the nightly build again tomorrow (need an update to another project to build first as this build will also fix another issue)

Steve

OK, that should now be fixed in v1.17.3 (Again nightly only at the moment)

Steve