Extract number from a String (by using Regex, String replacer or cell splitter)

Hi,

I have a pretty easy problem, I just dont know how to right properly the Regex.

My column is the following:

 

I want to extract only the trial (x), actually only the numbers that are in trial in a separate column. that means...

?

?

18

2

5

11

 

I tried regex split or string manipulation (remove chars), or string replacer or cell splitter, but I dont get it right.

Would you please help!

 

Hi,

Often it's easier not to perform the "Unique concatenate with count" in the first place, but to GroupBy instead (by hal_type in this case), and to count some other field. Then filter where hal_type = trial and you're there.

Regex only if the source data give you no other option!

Cheers
E

Hi Ergonomist. thank you for replying. but it is not the right solution.

Maybe, Ididnt explain it right.

I want to average the trials that are included only in ownership. so the new column should be like:

?

?

18

2

5

11

 

and then average 18, 2, 5, 11.  Not any other of the standalone trials.

Thats why I think the method you propose is not working. 

I am 100% sure this can happen with the Regex node.

I would be great someone to tell me how to split it as it look below:

Split1                Split2       Split3       Split4

Ownership        128              trial              18

Hi,

Indeed I missed out on the combinatory condition, sorry! I would use the Pivot node then, with Group column(s) identical to the GroupBy node, hal_type as Pivot column and Counts on another avaialble field as per my original proposal. You can filter the result for non-missing ownership, drop all columns but the trial one, and join the final result back into the original table or GroupBy result if so required.

Or maybe the pivot format is already convenient - it's not quite what you define as "great format" above yet, but you can get it there with some additional KNIME magic. :-)

Cheers
E

Many thanks Ergonomist.

I did what you mentioned with the pivot node, it works partially, I can get the ownership and trial pivoted by the purchase. Although, I am getting the message "skipped groups: "ownership, trial" because "group contains too many unique values

Therefore, I think the Regex would work ideally for me in this case.

If anybody knows how to write the Regex in order to do the following split please let me know.

Split1                Split2       Split3       Split4

Ownership        128              trial              18

 

The difficulty in this problem is that ownership or trial are not having standard position. Sometimes one is first, sometimes the other second etc. see below:

 

I wrote on Regex this: .*?(\d{1,}).* and it gives me only the number that comes first. How can I standardise it give me only the number that is in brackets of ownership?

SOLVED

I manage to solve it in an extreme long way. So, what I did is:

1) two "cell split" nodes, enter a delimiter: ownership, enter a delimiter: trial

2) I ended up with 4 columns as seen below.

 

the 3nd column is all my ownership ...so need to be cleaned. I took a "string manipulation" and run  remove chars for ) , (

my 1st and 4th column is the trial separated. I run string manipulation and removechars ) , ( trial

3) run a column combiner node for 1st and 4th column

4) removechars ","),"?") egain for the new combined column

5)Transformed to integer with string manipulation toInt($my new column$)

and Voila! it is solved.

If someone has a faster easier way, with Regex please let me know.

 

Good job! :)

If you were ever to further pursue the "Pivot" approach, you could define the "maximum unique values per group" at the bottom of the pivot node config screen. No promises about what's more efficient and/or elegant at the end of the day though, that may well depend on tastes, datasets and evolving needs over time.

-E

All,

 

I ran into a similar problem with the Groupby node using "Unique concatinate with count". It appears that the concatinate part generates a list in an order according to the first encountered element. In this particular example "trail" and "ownership" must have appeared in different orders for the unique groups. To fix this, I put in a "Sorter" node. Now the data come always in the same sequence [ie: trial(100), ownership(50)]. If you follow this with the column splitter using "ownership(" as the delimiter, then you get two columns:  [trail(100),] and [50)]. Follow this up with a few string manipulation nodes to remove the unwanted text and you're done.