Delete last two characters of a string in column

8bastian8 · September 5, 2021, 2:20pm

Hello knimers,

how can I delete the last two characters within a specific cell and take these two characters in a extra column. Should I use a string manipulation node?

Here the example:

publication_number
US2014377478A1
US9997800B2
WO2007036361A1
WO2004066414A3

INTO
publication_number
US2014377478
US9997800
WO2007036361
WO2004066414

next column:
A1
B2
A1
A3

Thank you !

BR
Bastian

aworker · September 5, 2021, 5:00pm

Hi @8bastian8

This question can be answered in many ways, for instance using a regular expression but here a will use a -String Manipulation- node to show a different possible solution purely KNIME coded.

Basically the problem is to know how to split starting by the end of a sentence instead of its beginning when the information to extract is at the end of the sentence. I take this opportunity to show how to answer this question using the string function “reverse()” which is not very often used in the forum examples. In this case, the solution using the string function reverse() in the -String Manipulation- node can be the following:

20210904 Pikairos Delete last two characters of a string in column.knwf (21.2 KB)

Essentially, in the first -String Manipulation- node, I’m reversing the string to extract from the end the two desired characters, instead of doing it from the beginning. This is done without needing to know the string length at all. Sometimes, this function can be very handy when needing to manipulate strings of variable unknown size starting from the end and not from the beginning.

Hope it helps.

Best

Ael

takbb · September 5, 2021, 6:02pm

Hi @8bastian8

As @aworker said, there could be a number of ways to do this, and as he mentions, a regular expression can be used. For example, you could use the “Regex Split” node, with the following regex pattern:

(.*)(.{2})

This would capture two groups (denoted by the brackets) and so create the two columns. The second column would capture the final two characters. The first column would comprise all of the other characters. You would then need to rename your columns as appropriate using Column Rename and maybe remove your original column with a Column Filter node, so ultimately may require more nodes than the String Manipulation.

The solution already provided for the String Manipulation node is perfectly valid, and is a nice example of using Reverse() for such a manipulation.

You can also use the length function to return the required parts of the substring:

e.g.
substr($publication_number$,length($publication_number$)-2)
and
substr($publication_number$,0,length($publication_number$)-2)

and obviously use whichever approach works for you.

bruno29a · September 5, 2021, 11:33pm

Hi @8bastian8 , what if some data had empty spaces at the end? Do they count as characters?

For example, if you had these (I enclosed them in quotes to show the empty spaces):
"US2014377478A1 " (2 spaces)
"US9997800B2 " (1 space)

Should that be splitted into:
“US2014377478A1” and " "
“US9997800B” and "2 "

Or should the spaces not count and give you the following?
“US2014377478” and “A1”
“US9997800” and “B2”

If you don’t want the spaces to count as a character, then for whichever of the above proposed solution you go for, you should also apply a strip() on the data to remove spaces at the beginning and at the end before applying any of the above solutions.

Daniel_Weikert · September 6, 2021, 1:43pm

You should be able to do it this way with string manipulation as well

substr($YOURCOLUMN$,length(strip($YOURCOLUMN$))-2)

br

8bastian8 · September 7, 2021, 5:49am

Many thanks for your professional help!

Now I have seen the problem, that not in every case the two last characters should be deleted:

When the publication number is: AT171013T

In this case the end is starting with a letter not a number: Can I implement a rule, to delete two characters when the end is starting with a number?
->AR080317A1
->delete: A1
And only one character when the end is starting with a normal letter?
->AT171013T
->delete: T

Many thanks!

Your solutions are very helpful so far!

BR Bastian

8bastian8 · September 7, 2021, 6:04am

And as well the approach that the spaces like @bruno29a said, should not be counted as characters?

8bastian8 · September 7, 2021, 6:54am

I tried this so far:
Delete last two characters_last character of string in column.knwf (7.6 KB)

8bastian8 · September 7, 2021, 7:24am

Last but not least I have found a solution.
I used Regex Split because it can easily work with patent numbers:

→ Parsing Patent Numbers

Patent identifiers such as “US5443036-X21” consisting of a (at most) two letter country code (“US”), a patent number (“5443036”) and possibly some application code (“X21”), which is separated by a dash or a space character, can be grouped by the expression ([A-Za-z]{1,2})([0-9])[ -](.*$) . Each of the parenthesized terms corresponds to the aforementioned properties.

Best regards,
Bastian

takbb · September 7, 2021, 7:34am

Hi @8bastian8 ,

Thank you for uploading your sample workflow. It always makes it much easier to assist when we can see what has been attempted so far.

You were on the right lines. The “simple” way to check if it ends in a digit is to use the regexMatcher function rather than lastIndexOfChar. Unfortunately lastIndexOfChar doesn’t work with regex patterns and so you can only use it for specific characters rather than character ranges. Alternatively, you could have used the indexOfChars and inspected to see if the last character were present in a string.

As your original question was looking to see if the last character was a digit, I modified your code also to see if the last character was a digit, rather than a letter which means I have effectively reversed your logic (and so you will see that the operations responding to true and false are the other way round to the way you had them)

Borrowing from @aworker’s idea, on using reverse, which as he demonstrated makes it easy to retrieve the last character, we could test this for example:

if(indexOfChars(substr(reverse(strip(column("publication_number"))),0,1), "0123456789") > -1)

My preference here though would be regexMatcher as the resultant code is less complex:

if(regexMatcher(strip(column("publication_number")), ".*[0-9]"))
{
    substr(strip(column("publication_number")),length(strip(column("publication_number")))-2)
}
else
{
   substr(strip(column("publication_number")),length(strip(column("publication_number")))-1)
}

The regexMatcher in the Column Expression node returns a boolean true or false and so can be happily used as the condition.

In the attached workflow, I also added a further column expression in the same node to perform the “removal” from the original string.

Another way to achieve this, is to use my “favourite” relatively undocumented feature of String Manipulation, which is the “ternary operator”. This allows for some conditions to be included in the script.

To get the last two characters, from the “stripped” string, only if the final non-space character is a digit, you can use the following script:

regexMatcher(strip($publication_number$),".*[0-9]").equals("True")
?substr(strip($publication_number$),length(strip($publication_number$))-2)
:substr(strip($publication_number$),length(strip($publication_number$))-1)

and to return the body of the string with the final two characters removed, under the same conditions, use this:

regexMatcher(strip($publication_number$),".*[0-9]").equals("True")
?substr(strip($publication_number$),0,length(strip($publication_number$))-2)
:substr(strip($publication_number$),0,length(strip($publication_number$))-1)

Note that I also wrapped the “strip” function around all references to the column, so that the resultant data is stripped of spaces. This would have the effect of stripping spaces from the front of the column too. If you only want to strip spaces from the end, then replace this with the stripEnd function.

What both of these do is check that the string matches the regex pattern “ends in a digit”. If the “regexMatcher” function returns the string "True", the result of the operation performed after the ? is returned.

Otherwise, the operation after the : is performed and the result returned

Note also that regexMatcher functions differs in the String Manipulation node as it returns a String "True" or "False" rather than a boolean, so the way of checking the result of the function differs from the way it appears in the Column Expression node.

I have attached a workflow demonstrating all of the above methods.

Delete last two characters_last character of string in column v2.knwf (12.7 KB)

ipazin · September 7, 2021, 9:32am

A post was split to a new topic: Calculate yield loss

8bastian8 · September 7, 2021, 7:47am

Many thanks for your help @takbb and your grateful time

The KNIME Forum is a wonderful place, where you find some help for every case! Thank you KNIMERS!

takbb · September 7, 2021, 7:52am

hi @manojmadhu , welcome to the community.

Please can I ask you to re-post this as a new question on the forum, as currently it is a reply to an existing question, which possibly won’t be seen, and will also add confusion. That way others will more likely see it and be able to assist.

Also, when you ask, it would be best to add some example data and state what you would expect the resultant outputs to be, for a simple case, so that people can better understand your requirement. many thanks

takbb · September 7, 2021, 7:56am

You’re welcome @8bastian8. It’s nice to have easily-understood requirements, and also great when it’s a team effort, with a variety of useful hints, ideas and approaches.

manojmadhu · September 7, 2021, 8:50am

ok
were to be post this …

ipazin · September 7, 2021, 9:43am

Hello there!

@manojmadhu I moved your question to another topic. To open new topic there is + Open Topic option on forum home page in right upper corner.

@8bastian8 sems following regex works as well in Regex Split node:
(.*)([A-Z].?)$

Br,
Ivan

system · September 14, 2021, 9:44am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.