Splitting cell by position

anjeesanjeet · January 20, 2023, 2:07pm

Hello, I have a dataset where I want to extract the data from a certain position but the data is mixed up, so I have to split it. I am attaching the data file here.
data Extractn.knwf (6.2 KB)

Wherever the row looks like : (F4-32-05 -) I want to remove the alphanumeric column and wherever, the data looks normal I want to keep it the same.

E.g. In the First row, the data looks like this: (Apple, F4-32-05 -), So, I want to remove till 05 -.

Wherever the data looks normal like Papaya, Greeny 4kg, I want to keep it as it is.

I am not sure, how I proceed. I have tried cell splitter and regex splitter too but didn’t get success. Can someone please help, it is necessary.

aworker · January 20, 2023, 2:34pm

Hi @anjeesanjeet

Thanks for the workflow with the data. I have the impression I’m missing something from your explanation. Your data is the following:

Apple, F4-32-05 -2kg Apple red in color
Mango, 32-4F-60 -5kg Mango
Oranges, yellow color 3kg
Papaya, Greeny 4kg
Grapes, 74-G4 -4kg

Could you please copy & paste in a post here how your data should look like for all the rows? The best is to do it using the “</>” from the post editor options. It would make easier for us to understand what you expect as result. Thanks in advance.

Best
Ael

anjeesanjeet · January 20, 2023, 4:38pm

Thank you @aworker for your quick response.

I want my result as:

Apple, F4-32-05 -2kg Apple red in color </> 2kg Apple red in color
Mango, 32-4F-60 -5kg Mango </> 5g Mango
Oranges, yellow color 3kg </> Oranges, Yellow color 3kg
Papaya, Greeny 4kg </> Papaya, Greeny 4kg
Grapes, 74-G4 -4kg </> 4kg

aworker · January 20, 2023, 4:46pm

Hi @anjeesanjeet

My pleasure to help.

Maybe is better to solve it using the following regex expression in a -String Manipulation- node:

regexReplace($Title$, "^.*\-", "")

The regex instruction replaces any sentence from the beginning until the last found “-” sign as seems to be the case in the kind of sentences that you need to remove.

Would this be a solution in your case ?

Hope it helps.

Best
Ael

anjeesanjeet · January 20, 2023, 5:01pm

Thank you @aworker. The solution you have given is quite closer to the solution. Actually, I have a different dataset that I can’t show publicly and the dataset I have given is just the demo one.

In a few of the rows, the solution you have removed everything and just left with the last word.

For eg: In the case of the following

Apple, F4-32-05 -2kg Apple red in color </> 2kg Apple red in color

I got the answer as color only.

anjeesanjeet · January 20, 2023, 5:15pm

@aworker Here is the original dataset. Kindly check for the reference

data Extractn.knwf (6.2 KB)

And I want the output as:

aworker · January 20, 2023, 5:48pm

As shown in my previous snapshot, the result I get with this regex solution on the sentence
Apple, F4-32-05 -2kg Apple red in color
is
2kg Apple red in color.

Maybe the regex you entered is not exactly the same as mine.

Besides this, the new workflow you have uploaded seems to be the same as the initial one with the same data. I cannot hence check the regex behaviour on the data you show in your last snapshot.

I’m uploading here the workflow I implemented to produce the results:

20230119 Pikairos Remove beginning of Sentence until - character using string manipulation regex.knwf (21.5 KB)

Hope it helps.

Best
Ael

anjeesanjeet · January 20, 2023, 5:51pm

Please check once the latest data I have given you. The file name is the same as before, but the data inside is different and the original one. The data file is in the message with the picture.

aworker · January 20, 2023, 5:53pm

I have double checked and the last workflow you have posted is exactly the same. Could you please make sure that it contains the new data and save it with a different name so that there is no confusion ? Thanks.

anjeesanjeet · January 20, 2023, 5:56pm

Sure, Here is the New dataset @aworker
Data.knwf (7.0 KB)

aworker · January 20, 2023, 6:03pm

Thanks @anjeesanjeet. Now I got it

When applied, this is what I get as result:

From the nature of your data example, I have the impression that the rules needed to solve your problem are a bit more complex that what it was explained initially.

Could you please show me what should be the expected result for all the rows of this last data?

Thanks
Ael

anjeesanjeet · January 20, 2023, 6:05pm

This is the resulting data @aworker

aworker · January 20, 2023, 6:11pm

Thanks @anjeesanjeet

I’m afraid, there is no easy rule for your problem because of content ambiguity. To put it simple, the rule you stated initially doesn’t hold for all the cases covered by your data and honestly, I believe that setting up a rule that fits all possible scenarios of your data is not straightforward. I wish I’m wrong but I’m concerned I may be not

Maybe others can help on this better than me and eventually bring a solution. Any volunteers in the forum ?

Best,
Ael

anjeesanjeet · January 20, 2023, 6:13pm

@aworker Thank you so much for your efforts. Hope so, someone come up with any solution.

aworker · January 20, 2023, 6:15pm

Thanks @anjeesanjeet, my pleasure. Fingers crossed

ArjenEX · January 20, 2023, 10:52pm

Hi @anjeesanjeet

To be honest, I have to echo @aworker here. The dataset and the structure of the text is all over the place meaning building a solid ruleset is extremely hard.

I did manage to get the expected output while using a Regexreplace function with: (^[A-Za-z \/]+[ ,])( [A-Z0-9- ]+)[ ][-][ ]\b[0-9-]{3,}\b|(^[A-Za-z \/]+[ ,])( [A-Z0-9- ]+)[ ][-][ ]|(^[A-Za-z \/]+[ ,])[A-Z0-9-\/]+[ ]|(^[A-Za-z \/]+[ ,])( [A-Z]+ [0-9A-Za-z]+)[ ][-][ ]

See results:

But again, the expected output is extremely subject to inconsistencies. To illustrate:

In row 2, 698-2700 should be omitted.
In row 5, 4.3-10 should be kept.
In row 13, 7-16 should be omitted.

The only way of capturing this is by saying that a substring of digits with a dash should be considered but excluded whenever it contains a period. I have to hardcode it’s length of 3 chars or longer by using:

\b[0-9-]{3,}\b

When omitting this, the Regex will also start to match sequences that it shouldn’t:

This is one example, but in a larger dataset with this kind of information, it’s quite likely that there will be other cases where the presence of a period should be omitted. As such, you would have keep stacking rule upon rule which is not very convenient.

Maybe a better Regexer than I am has a more efficient approach.

Mentioned Regex in KNIME:
parts data with regexreplace.knwf (15.2 KB)

To play around with this Regex: regex101: build, test, and debug regex

Hope this still helps in some way!

aworker · January 22, 2023, 4:23pm

Hi @anjeesanjeet,

As experimentally proved by @ArjenEX with quite sophisticated regex equations, the splitting you need to do cannot easily be done by just using a set of regex rules, or if it possible, it still needs more complex regex equations …

I have a few questions concerning your data and what eventually you need to achieve:

How big is your database, how many rows does your data set have ? ~100, ~1000, ~10,000, more?
I realize that the codes (i.e. “62-29-03 -” of “D2-83FE -”) that are used to split in your sentences, corresponds to a kind of code of products from a sort of warehouse. Do these codes appear uniquely once in your whole data set or might they appear more than once? If so, how often?
Do you need an exact solution (no errors at all) or would you accept a solution that produces a small % error that you could handle manually afterwards?

If the answers to previous questions are (1) more than thousands, (2) splitting codes only happen once and (3) a small % error is acceptable, then I guess a different approach could be tried to solve (split correctly) most of the rows your data set has. Please reach out if so to discuss further a possible alternative solution.

Best,

Ael

anjeesanjeet · January 23, 2023, 6:49am

@aworker As of now, I have a small dataset that I provided here but eventually, I have to work on a larger dataset later which is approx 27000, and errors there will create problems so the model should not have any errors.

aworker · January 23, 2023, 7:02am

Hi @anjeesanjeet

Are the product codes just before the splitting unique ? Eventually, do they appear only once in the dataset or could they appear more than once?

anjeesanjeet · January 23, 2023, 7:23am

@aworker They are unique values.