Using NOT REGEX_LIKE in SQL DB QUERY

Hi,

I’m having issues trying to create a non-greedy match in a SQL query.

To summarise what I am trying to do:

(1) Inital query to bring in information that matches the biomarker I am searching for:

Which brings in 722k rows.

(2) I need to exclude some information that isn’t applicable for this search.

Which returns 42k rows. Also if i didn’t include the NOT condition it pulls out the number of rows i would expect to meet the undesirable criteria which was around 211k. It’s the NOT that seems to cause the issue.

But if instead of performing (2) I use a rule engine instead:
$RESULT_NAME_REPLACE$ MATCHES “(.?)VEGFR(.?)” OR $RESULT_VALUE_REPLACE$ MATCHES “(.?)VEGFR(.?)” OR $RESULT_COMMENTS_REPLACE$ MATCHES “(.?)VEGFR(.?)” => “Y”

For each requirement and filter out the values that match this condition it returns 522k rows, all of which meet the criteria of what I want.

After some desk research I think its to do with how it matches i.e. greedy vs non-greedy but I’ve currently not been able to find a work around in a DB friendly manner.

Note: Have also tried using a wildcard search i.e LIKE ‘%VEGFR%’ and this hasn’t worked either, as well as column IN (‘%VEGFR%’).

Any help would be appreciated as this has be wracking my brain all yesterday evening.

Have created a workaround by using a CASE statement to assign either a label to “filter” or “retain” which works but would be interested to know if anyone has any answers for why the NOT condition could not do the same.

Hi @gem4236 ,

MAYBE you have to create 2 groups for test your conditional as you need.

The first one brings “OK” values and de second one brings “Don’t want”. You can use logical test for it as:

(

image

) AND NOT (

image

)

I bring your images because you didn’t write here like “text/code”, but could you try it?

OR

Bring the data to inside knime and then make filters and manipulations that you need…

BR, Denis

1 Like

Thank you for your reply. That set up of query was my inital step but unfortunately didn’t behave as expected and pulled the data down to 42k rows.

Hi @gem4236 ,

Your question is somewhat difficult to give specific answers to because we don’t actually have much detail about what it is you are doing, even though you have clearly tried to provide detail.

What I mean is, we don’t know what your data looks like, and your SQL contains flow variables which you haven’t explicitly given us the values of. We can probably assume that they contain the values you have given in their names and I guess they also contain single quotes at the beginning and end for this to work syntactically.

I am also confused about whether you are saying the database query is/isn’t working, or whether it is the rule engine you are having trouble with (or both).

Database:
I was a little confused about why you would have two separate database queries, one containing the data you wanted, and the other containing the data you didn’t, as I couldn’t understand what you then intended to do to have these queries work together to somehow filter your data. From a database query perspective, and from what I can read into your question, I would say that the answer given by @denisfi is spot on, (paying particular attention to wrapping each group of “OR” conditions in parentheses) so if that isn’t working, then either your regex is wrong, or your data isn’t as expected, or we are misunderstanding what you are asking.

In your comment, you said you are trying to use non-greedy regex, but if we are to assume that your flow variables contain the regex that you display in their names, then they are not non-greedy (i.e. they are greedy)

Rule Engine:
Does the regex in the Rule Engine match the regex you had in your second database query? It doesn’t look like it, but as already mentioned, we don’t know what the values of your flow variables are, so we cannot be sure.

Further Wildcard searches in Rule Engine with LIKE use “*” as wild cards, not “%” but I wasn’t sure if your mention of trying LIKE ‘%VEGFR%’ was related to the Rule Engine, or the SQL. In SQL terms, IN ( ) is never going to work with wildcards unless you have a DB with an implementation of SQL I’ve not come across before. It would be nice sometimes if it did though! :wink:

The regex you have supplied in the Rule Engine is also “Greedy”. To make it non-greedy (or lazy) you would use .*? rather than .? wouldn’t you?

From regex101.com

? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
*? matches the previous token between zero and unlimited times, as few times as possible, expanding as needed (lazy)

So from this, the following Rule Engine condition:
$RESULT_NAME_REPLACE$ MATCHES “(.*?)VEGFR(.*?)”

would only find VEGFR with at most one character either side, I believe, and so would match any of:
xVEGFRy
VEGFRz
pVEGFR

but would not match
xxVEGFRy
VEGFRzz
ppVEGFR

Is that what you intended? I can’t tell without seeing some data and expected results, but I suspect it isn’t.

Maybe you could post a few rows of samples of data and the details of what your flow variables contain, plus your SQL statement in text form so we can work out what you might expect to match the different regex and we can help further.

1 Like

Its the SQL Query I’m having trouble with.

The regex to exclude works if i take it without the not condition, i.e. it filters down to the 211k rows i want to exclude. But its incorporating it into a NOT condition where it isn’t working. Filtering the total dataset down from 700k rows to 40k. Whereas is should be around the 500k mark.

I’ve tried variations of the NOT statement including using (.*?) but it has not worked. As well as looking at using wildcard searches i.e. NOT $result_value_replace$ LIKE “%VEGFR%”.

But if i read the data out after searching for EGFR in SQL and use the rule engine with rules like "$result_value_replace$ MATCHES “(.?)VEGFR(.?)” etc. It filters the data correctly.

Also there are no flow variables used in these queries. the $ dictates column names. And the $$ in the regex means to match everything. I’ve tried the regex without $$ and using .* and .*? but has not filtered correctly.

For the column values:

Result_name_replace; can take the form eGFR or KDR (VEGFR2).
Result_comments_replace: Estimated GFR was calculated…

All of which the rule engine identifies correctly.

The only workaround i have found is using a case statement in SQL to label the exclusions and then using a DB row filter. But this isn’t the most ideal and just frustrating why the SQL wont work. The database i am connected to is Snowflake, and perhaps this is the reason why it wont work.

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwifj9SGzer_AhWNUEEAHf1rBHUQFnoECA0QAw&url=https%3A%2F%2Fcommunity.snowflake.com%2Fs%2Farticle%2FRegular-Expression-with-non-greedy-search-in-Snowflake&usg=AOvVaw0YWEcHnyxD9RKkAE4M5aPm&opi=89978449

Hi @gem4236, ah ok, my apologies I saw the $$ and immediately thought flow-variables but of course the syntax is slightly different.

So this is using Snowflake and I now understand that $$ is the same as a single quote,

So
regexp_like ("RESULT_NAME_REPLACE", $$.*EGFR.*$$, 'c')
is the same as
regexp_like("RESULT_NAME_REPLACE", '.*EFGR.*', 'c')

Just as an aside, given you’re using regex, is there any particular reason for using $$. Given a $ sign is special to regex, whilst it won’t change anything, it surely makes it less readable. Maybe that’s a snowflake thing?

Ok, given that the issue is with SQL , have you tried running the SQL directly in snowflake (e.g. in snowsql, or snow-sight. If it isn’t doing any variable replacement, then KNIME doesn’t do anything with the SQL so in effect this becomes a Snowflake SQL question, rather than a KNIME one, which is fine, but in situations like this I generally find it easier to test the SQL directly in the database and then port it to KNIME than to try to debug it in KNIME.

I take it that when you write the query directly in snowflake, that it still doesn’t work the way you want it?

A different approach

If I have a query

select * from TABLE_A 
where regexp_like(X,'.*wanted.*')

and I want to return all those rows except where Y contains ‘notwanted’,

i.e not include these rows:

select * from TABLE_A 
where regexp_like(Y,'.*notwanted.*')

then I have a couple of options:

First option is the one that should work for you but you say it isn’t. I cannot say why it isn’t working without seeing some example data that it is returning (and shouldn’t be)

select * from TABLE_A
where  regexp_like(X,'.*wanted.*') 
AND NOT (regexp_like(Y,'.*notwanted.*') )

an alternative option, without using NOT

select * from TABLE_A
where  regexp_like(X,'.*wanted.*') 
MINUS
select * from TABLE_A
where regexp_like(Y,'.*notwanted.*')

so maybe you can try the MINUS set operator instead?

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.