I know this should be an easy task, but with a large FASTA file (say a Uniprot db with ~42,000 sequences over ~460,000 lines) it turns out to be anything but. I know we have ‘Bio Sequence Reader’ from “NGS Tools” and ‘FASTA Sequence Extractor’ from Vernalis but neither works correctly in this case. The former does not output sequences from a protein (or uniprot) FASTA file - instead it outputs the accessions twice - and the latter stalls out indefinitely when trying to process (it can be left over 24 hours without ever completing reading the file).
The output I want is a table that has accessions and sequences. Sequence lengths could be useful but they could of course be derived afterwards. One of the main challenges here - as eluded to with the numbers above - are that some sequences span multiple lines and hence need to be concatenated to themselves to reach the full length.
Has anyone had any luck with something like this? There are tools in OpenMS that can open FASTA files (such as DecoyDatabase
) but I haven’t found any in their collection that reads FASTA into a table. I don’t see anything in the SeqAn collection that seems to be applicable here either.
thank you!