Remove Part of Column Header

Hi,

Every month i would have different column header like JunApple, JulyApple, AugApple, DecemberApple, etc…
I know that we could use column rename (regex) but I am unfamiliar with regex

Could anyone help to provide a regex formula to retain “Apple” and remove any other characters for column header ?

Hello @mmngeoh
Can you develop further your challenge? I mean…
How your data looks like? Do you have many columns like that? if so, How do you expect to differentiate them, as standard Apple, Apple (#1), Apple (#2)… ?
Otherwise, do you expect just one $Apple$ column in your table?

Besides that, what about upper or lower case occurrences? Is it always a literal match “Apple”?
Why regex as there are some other options available?

BR

PS.- As a general rule ‘Column Rename (Regex)’ node can do it, with the following code:

Search String (regexp):
(.*?(Apple)[^$]*)$
Replacement:
Apple

5 Likes

@mmngeoh

Regex is not the most user friendly tool in the bag, but learning it does help in many cases where you need to modify strings.

What you might find useful are the websites (such as Regex101) that allow you to enter some text and a regex expression to see what the impact will be.

If you enter you text in the second Test String box and gonhaddock’s search string in the first Regular Expression box you can see how it works.

DiaAzul
LinkedIn | Medium | GitHub

1 Like

@gonhaddock,

hi, the abovementioned is just an example. To simplify it, there is one column called XXXApple, the XXX is in month format, like JunApple, JulApple, AugApple, etc. it would change monthly.

1 Like

Hi @mmngeoh , one of the problems here is that without real-world examples of your data, it is difficult to know what exact solution to give to your problem especially as regex is quite a precise tool. This is why in part @gonhaddock was asking “Besides that, what about upper or lower case occurrences? Is it always a literal match “Apple”?”.

You originally posed the question
“Could anyone help to provide a regex formula to retain “Apple” and remove any other characters for column header ?”

and that had me wondering whether you really wanted regex to keep “Apple” (which is what @gonhaddock supplied) or are you really asking for some other part of the string to be removed (and you actually don’t care if it’s an Apple or an Orange). So your question has now subtly changed and you really want to remove the prefix (month), which makes things a little clearer.

But this leads to a further question, because in your very first post you gave the example:
JunApple, JulyApple, AugApple, DecemberApple, etc…

But this means you sometimes have 3 letter months (Jun, Aug) but interchange these with whole names (July, December). Is that really the case? Without being specific in requirements, it is difficult to be specific in a solution. In your more recent post you do imply it is a three letter-prefix, but I will try to be slightly flexible :wink:

Anyway, I will assume that your requirement is that you have month names as prefixs to some value (for example “Apple”). Your month names are either 3 letters or whole names, and that they use Title case, e.g. Jan, Feb, November, December

In this situation, the following regex applied to a Column Rename (regex) node will return you everything that comes after the month:

JunApple, JulyApple, AugApple, DecemberApple, AprOrange etc…

(Although, lol, I hope you don’t have a column named Apricot, as you will just get back icot :wink: )

Search String:
January|February|March|April|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec(.*)

Replacement: $1

image

Note that it was important that I listed the whole names first, and the 3-letter variants later in the regex. Otherwise it will match a 3-letter and leave the rest of the month in the returned value.

e.g MarchApple would be returned as chApple

Also, if more than one column becomes “Apple” as a result, they will automatically be given a numeric # suffix:


becomes:

4 Likes

Thanks @takbb , we always learn something else from you.

All my questions are because when there is a generic question asking for regex code, followed by an “I’m not familiar with”; it is just when the backs and forwards start.

I posted a code that matches as group $1, literal “Apple” surrounded by any characters. Depending the data context, you want or you don’t want to match case as part of the validation. Afterwards if the condition is returning a TRUE valid $1, then replace with string sequence “Apple”

In fact I used the node as a ‘RegexMatch, Then’ function.

I like regex because I like problem solving… but regex code is like any other coding. You can develop a use case in a more or less efficient approach, but any unexpected casuistic or additional requirement will return you a bug. So when you ask for a regex code, better provide all the relevant information aiming to get a good piece of software.

When I see now @takbb 's provided code more questions come into my mind, like… It is Always the month in preceding position? We always get lots of fun and literature with these type of challenge.

3 Likes

If it’s always apple (for whatever reason) I would just replace it with that

@takbb Thanks for the Dollar-One reminder. Haven’t used it for quite a some time which means forgetting it exists
br

Hello @Daniel_Weikert
Doing it in this way, you will replace all the columns of the table with the same string sequence. KNIME will name them as array (Apple, Apple (#1), Apple (#2)…)

BR

@gonhaddock
Yes,
and I thought that is what he wanted?
br

My gut is that he is is processing monthly reports or exports that have variations on just one column name per Regex match. (At which point duplicate handling shouldn’t be an issue) If multiple column matches exist for the Regex formula, then a simple Regex rename approach wouldn’t make sense.

Another approach is to just read in all of the month columns as they exist in the original files “JanApple, FebApple, etc”, Then simply Sum all of the columns by Regex match in the GroupBy node into a new “Apple” column. This way you could apply it to a full table instead of having to do it monthly or process it in a loop.

Assuming that I am interpreting your underlying challenge / goal correctly that is…

If you need to retain the month information from those column headers and store them at the row level, then you are better off doing a transform to get that info down into the table 1st, and then proceed with the regex column rename or groupby sum approach.