Hello I need help regarding gene and protein sequences

Hello everyone. I am currently working on gene and protein sequencing in a bioinformatics project using knime.
I am trying to apply the k-mer counting method widely used in python. I actually, want to create a filter of five characters like “ATACG” and see the total number of such combinations in gene sequence.
The combinations should be checked by skipping on letter, like taking the filter of five above mentioned characters and slipping it over the sequences and recording the number of times it is found. The sequence example is given below.
Looking forward to hearing from anyone. since this is very urgent and important for me so even a slightest help would be highly appreciated. Thank you in anticipation .

Hi @Faheeemahmed ,

Are you looking for a solution that gives you a list of k-mers and their counts in a given biological sequence? If yes, this is going to be a computationally intensive task to do in KNIME. And there is the question of how you want the result to be presented. Because for every row in a KNIME table with a sequence, you can have arbitrary number of k-mers and the same amount of counts.

Can you please point me to the widely used method in python that you mentioned? - We could use that same method from KNIME using the python integration.

I am also missing the sequence example you mentioned.

Best,
Temesgen

Hello @temesgen-dadi, I am really glad to hear from you. I am actually working on a bioinformatics project, and I want effective conversion of the biological sequences for effective algorithm training and finding out useful patterns as I am running out of time given by my instructor.
For K-mers, I want to create a filter like ‘AATGCATTA’
and slide it over the sequence in a row. For every filter matching in a row, it should count 1 and so on for all the rows. OR it should take two or more filters and continue the same process.
The example row is given below.

ATGTCAGAAACTTCCAGGACCGCCTTTGGAGGCAGAAGAGCAGTTCCACCCAATAACTCTAATGCAGCGGAAGATGACCTGCCCACAGTGGAGCTTCAGGGCGTGGTGCCCCGGGGCGTCAACCTGCAAGAGTTTCTTAATGTCACGAGCGTTCACCTGTTCAAGGAGAGATGGGACACTAACAAGGTGGACCACCACACTGACAAGTATGAAAACAACAAGCTGATTGTCCGCAGAGGGCAGTCTTTCTATGTGCAGATTGACTTCAGTCGTCCATATGACCCCAGAAGGGATCTCTTCAGGGTGGAATACGTCATTGGTCGCTACCCACAGGAGAACAAGGGAACCTACATCCCAGTGCCTATAGTCTCAGAGTTACAAAGTGGAAAGTGGGGGGCCAAGATTGTCATGAGAGAGGACAGGTCTGTGCGGCTGTCCATCCAGTCTTCCCCCAAATGTATTGTGGGGAAATTCCGCATGTATGTTGCTGTCTGGACTCCCTATGGCGTACTTCGAACCAGTCGAAACCCAGAAACAGACACGTACATTCTCTTCAATCCTTGGTGTGAAGATGATGCTGTGTATCTGGACAATGAGAAAGAAAGAGAAGAGTATGTCCTGAATGACATCGGGGTAATTTTTTATGGAGAGGTCAATGACATCAAGACCAGAAGCTGGAGCTATGGTCAGTTTGAAGATGGCATCCTGGACACTTGCCTGTATGTGATGGACAGAGCACAAATGGACCTCTCTGGAAGA

Hi @Faheeemahmed

Your question is not still not clear for me. I am a trained bioinformatician. I don’t mean to be rude, but if this is a homework and you couldn’t do it in time yourself, you better talk to your instructor and get a better understanding of the problem as well as how to solve it.

If what you are looking for is a KNIME workflow that

  • takes a list of query strings (you are calling them filters/kmers) and another list of longer DNA sequences like the one you provided above
  • calculates the occurrence count of each query string in each DNA sequences

then the attached workflow might help.
count_occurrence.knwf (18.1 KB)

Best,
Temesgen

Hello again @temesgen-dadi it pleasure to hear from you again.
I am trying but may be I could not make you understand.

Actually, I am a PhD student and this is part of my research work. I am supposed to finish this project as the deadline is approaching.
Can you share with me your email so that I can specifically contact you and explain it to you in a better way to seek your help?

Thank You in advance

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.