Finding a Pattern in a Dataset

abuchberger · September 21, 2018, 1:48pm

Hello! I have a data set that is a combination of different letters (i.e., amino acids forming proteins). I was wondering if there is a way to determine a pattern/similar motif in these data sets (i.e., XXn or nXX, where X can be any letter). Do I require a learning node for this?

Jeany · September 26, 2018, 2:10pm

Hello! Thank you for your question. Do you already know the motif and want to filter your data for it? If yes, then you could use the row filter with wild cards/regular expressions. Using wild cards, the following input for the option “use pattern matching” would filter all the sequences with the pattern GXG (X is again any letter): *G?G*. Another option would be the String Manipulation Node with the regexMatcher function.
Finding the motif from scratch is its own research question. You might want to start with an Alignment, for that you can use our community nodes. SeqAn offers different options for Sequence Alignments (Community Nodes/SeqAn/Sequence Alignment).

Hope that helps,
Jeany