Create visual output based on partial string finding

Hi all,

I am not sure if KNIME can perform this task, hopefully someone can help me with that.

I have a list of partial sequences:

EZEO
HDZEZWI
SVBWG
ZSZS
KCOEOIUI
OIUIUPPEGB

and a list of full sequences:

HSVSGEZEODPDNCHDZEZWIQSNCD
GHSVBWGTDIEODMDBVAHWZSZSHS
TZUEUZDFNKCOEOIUIUPPEGBOPOJD

Now I want to find all the partial sequences in all the full sequences (the partial sequences might even overlap) and create an output which looks like this:

HSVSGEZEODPDNCHDZEZWIQSNCD
GHSVBWGTDIEODMDBVAHWZSZSHS
TZUEUZDFNKCOEOIUIUPPEGBOPOJD

Is this possible?

Best,
Sere

Hi @Sere,
how would you like to output that formatted string? To check whether a string exists within another string, you can use the String Manipulation node with the indexOf() function. It outputs the position in the queried string at which the searched string was found. However, this only works if it occurs only once. I might be able to give you more helpful advice if you tell me which output format you would prefer.
Kind regards,
Alexander

Hi @AlexanderFillbrunn

I will have a long list of partial sequences and a long list o full sequences, so a few partial sequences will definitely be matched more than just once. As output I don´t really care, word-file, excel-file, RTF would do it.

Hi,
what I meant was: would a partial sequence occur multiple times in a full sequence?
Kind regards
Alexander

Oh sorry. No, definitely not.

Hi @Sere,
to answer your question: Yes, it is possible, but only by really bending KNIME to your will using a bunch of veeery dirty tricks. Please have a look at the attached workflow. But disclaimer: If you are dealing with a lot of sequences, it might be easier to use one of the scripting nodes, e.g. Python or R.

The workflow goes through all the subsequences in a recursive loop and does the replacement one by one, always passing the table with the treated full sequences back to the loop start. Within the loop a String Manipulation node uses regular expressions to add parentheses around the detected subsequences. After the loop, a Column Expression node goes through the sequences again, inserting opening and closing HTML bold tags (<b> and </b>). Then we add some HTML tags at the beginning and end and write it using the CSV Writer. This is as ugly as it can get, but if it is stupid and it works, it ain’t stupid :wink:
Subsequences.knwf (37.6 KB)

1 Like

Hi @Sere

I was working on a different approach like this wf identify occurence multiple substring.knwf (316.6 KB) . For every partial sequence it gives you if it occurs and on what position.

gr. Hans

Hi @AlexanderFillbrunn

Thank you very much! …and I agree, if it works it works :slight_smile: The workflow works nicely, but when I change the substrings and sequences in the first two tables, I run into some problems. It seems that only the last substring sequence is being highlighted in the html output and all other sequences are being ignored, although they are definitively present. Do I need to adjust some settings? Find attached the workflow with my updated sequences.

Thanks!

Subsequences2.knwf (42.4 KB)

Hi @Sere,
I tried your downloaded and had the same result. However, when I paste your sequences into a text editor and search for the partial sequences, I also don’t get any hits except for the last sequence. Are you sure the partial sequences are contained? Or maybe it would be better to use tools made for sequence data, such as SeqAn?
Kind regards
Alexander

@AlexanderFillbrunn

That´s weird. If I paste them into a text editor and then search for the partial sequences, I do find them. The partial sequences are definitely there. I´ll have a look at SeqAn. Thanks for the recommendation!

Cheers,
Sere

@Sere,
can you give me an example where, for example, the first subsequence “AALEKDYEEVGADSAEGDDEGEEY” is found?
Kind regards
Alexander

“AALEKDYEEVGADSAEGDDEGEEY” is found on the 3rd sequence at the end:

MRECISIHVGQAGVQIGNACWELYCLEHGIQPDGQMPSDKTIGGGDDSFNTFFSETGAGKHVPRAVFVDLEPTVIDEVRTGTYRQLFHPEQLITGKEDAANNYARGHYTIGKEIIDLVLDRIRKLADQCTGLQGFLVFHSFGGGTGSGFTSLLMERLSVDYGKKSKLEFSIYPAPQVSTAVVEPYNSILTTHTTLEHSDCAFMVDNEAIYDICRRNLDIERPTYTNLNRLISQIVSSITASLRFDGALNVDLTEFQTNLVPYPRIHFPLATYAPVISAEKAYHEQLTVAEITNACFEPANQMVKCDPRHGKYMACCLLYRGDVVPKDVNAAIATIKTKRTIQFVDWCPTGFKVGINYQPPTVVPGGDLAKVQRAVCMLSNTTAIAEAWARLDHKFDLMYAKRAFVHWYVGEGMEEGEFSEAREDMAALEKDYEEVGADSAEGDDEGEEY

Hi @Sere,
you have some trailing spaces in your partial sequences, that is why it does not work. You can either change that in the table creator or simply use a String Manipulation node with the strip() function to remove any leading and trailing whitespace.
Kind regards
Alexander

@AlexanderFillbrunn

Ok, I had a similar suspicion! Thanks for the big help! :raised_hands:

Cheers,
Sere

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.