Search string within multiline columns

Hello,

I am using the Chemical Identitiy Resolver node to search for CAS Numbers based on SMILES codes. I want to check know if the CAS number I have for the SMILES is among the result provided by the CIR node.

In other words I want to check if a string is among a list of strings. The problem is the format of the data. The CIR node gives the results as multiple lines in one column. I think that therefore, the rule engine does not do it´s job. Please find below an example:

1 Like

The problem isn’t the CAS CIR column, it’s your rules.

As stated in the description box, the LIKE function checks whether the value of the left expression is like the wildcard pattern defined by the right expression. You’ve flipped this around for some reason.

The rules should be:

$CAS CIR$ LIKE $CAS Search$ => TRUE
TRUE => FALSE
1 Like

Hi elsamuel,

changing the order at least parly solved it, but not for the multiline entries:

Hi @Alkaline
You can give a try to this (removing the asterisks) :

image

Check:

 regexMatcher($CAS CIR$, 
 	join( 
 		"((.|\n)*)(?i)(.*?(" ,
 		$CAS Search$ ,
 		")[^$]*)$"  
 	) 
)

image

BR

4 Likes

In that case, one option is to convert to a single line string using a String Manipulation node using the expression

replace($CAS CIR$, "\n", " ")

then use the Rule Engine node.

5 Likes

Hi,

thanks to your reply.

@gonhaddock
I added the asterix by purpose, as I thought I can thereby search using wildcard. Removing it did not help:

@elsamuel:
This expression helped to get it into one line. However, the rule enginge did not provide the expected result with or without wildcard

Hi @Alkaline
Maybe I wasn’t clear or extensive in my previous explanation.

‘Removing the asterisks’ meant a call of attention for your $CAS Search$ column in the case that you attempt the ‘regex match’ approach, as in this case, they won’t be needed. Not referred to the ‘Rule Engine’'s method.

Regards

1 Like

Hi gonhaddock,

ah I understand. Your method works, but for me only if the CAS of interest is the first CAS in the list. An idea how to solve the last part of this?


1234
res

Ah! only the first CAS…
I’ll rearrage it

Hi @Alkaline, I can see that this appears to be a bug (or undocumented feature!) in that in my test even just a wildcard * with no other characters fails a pattern match as soon as the string contains lines feeds. Certainly that would not be expected behaviour I think.

To be honest, I would say that the suggestion from @elsamuel

… probably provides the least complex workaround to your problem.

Depending on how the multi-line strings were created you might have to replace just newlines \n, or possibly replace carriage return +line feed \r\n with an alternative character that could be just a space or some more obscure character such as ¬ just so that the problem goes away, as suggested

Having done that, your code should work. (It worked for me when I set up test data similar to yours)

3 Likes

This is odd, as it seems to work for me:

image

Can you share the data you’re using?

1 Like

Try this, with a Cell Splitter node

2 Likes

@elsamuel , @takbb
The ‘Row0’ (in elsamuel table) has to return ‘False’ as $CAS Search$ is not in first position in $CAS CIR$ list. The same is happening for your ‘Row2’, because the Rules are not requesting it.

image

This table is what I understand now.

BR

1 Like

Hi @all,

being late to the party, I would agree with @takbb that this is a bug: newlines (\n) should be treated just as a character, from my understanding.

I’d go with the workaround to replace the newlines with some other character, as @elsamuel suggested. @Alkaline, if you want to match only for the first CAS, you can then take your first idea and have only the asterisks at the end of the CAS Search, that is make the string only match if the beginning matches using the rule engine logic proposed by @elsamuel.

Of course, @gonhaddock’s solution with the cell splitter works just as fine :slight_smile: I just wanted to tell you that I filed a bug under reference AP-17933 to let you know the newline issue is in our system. I don’t know whether it is a good idea to change the behavior from a backwards compatibility perspective, but instead have an explanatory text in the description of “LIKE” that newlines are not treated as regular characters would help.

Thanks all for reporting and proposing solutions/workaround!

Best regards,
Lukas

5 Likes

Hi all,

thank you all for your help. For me it looks that everyone has some slightly different results. But in the end we managed to solve the issue with the suggestions from everyone :slight_smile: . And thank you for taking care that the bug will be fixed.

3 Likes