Regex Extractor - regex splits at different size as expected

Hello,
I have found this topic and followed the regex presented there: Regex split to split the string without breaking the word

I used in in the node Regex Extractor: (.{1,60}[^\s]*)?\s? to split string at 60 char mark. However, it splits it at 66 or 68, depending on the length of the word instead of moving the last word to the splitted group.

Is there a change required to the Regex in order to impose the split at max 60 char?

If possible always provide a sample file for others to help you out

2 Likes

Need to split a string (eg. Bay PT C.P.R. Station Grounds being PT of locations 8 3C PIC SRO PT 4 & 5 55R2614 Except PT 1 55R3261 ) into multiple; first is max 40 char, next is max 40 char etc. whole words

Hi @IrynaK I’m not entirely certain that this can be done with pure regex. Your requirement is subtly different to the post you referenced.

What the regex pattern that you have used does is match up to the first 60 characters and then also capture everything after that up to but not including the next white space. This ensures it didn’t break mid-word. That’s why you see the results you are getting. It isn’t restricted to 60 characters. It returns 60 or so…

Does this absolutely have to be a regex solution? If it does then maybe an alternative is to give a guess at what the longest word you might encounter is likely to be and subtract that from 60, then use that number in your regex in place of the 60 value. So, say it gives you, say, “50 or so” instead…

Although in your latest comment you are talking about 40, so I’m now slightly confused about the requirement.but gleefully you can see what I’m saying.

Not saying it definitely can’t be done with regex but I can’t think of a way at the moment.

1 Like

Could you split by space and then divide into chunks a 40?
(With sample we rather mean an actual file :))

Hi @IrynaK,

As @takbb already pointed out, you are using a regular expression that splits input on the first whitespace after 40 characters. This leads to string lengths of >= 40.

Just use this expression instead:

(.{1,40})

I also attached an example workflow to my NodePit Space:

This already works out of the box. The example string split by 40 characters looks as follows… Just configure the Regex Extractor to split matches in rows or columns as you prefer.

Best regards,
Daniel

3 Likes

OK, maybe I take back what I said about regex :wink:

(.{1,40})\s.*

Would collect up to 40 characters without breaking on the middle of a word… I think.

You can try that actual expression out as it’s saved on the following link…

1 Like

Thank you very much for your help guys! (.{1,40}) seems to be working, but I will be testing the whole file later.

This forum is awesome and I cannot thank you enough :slight_smile:

2 Likes

So I ended up using (.{1,40})[\s.]*\s
it seems to split the string to max 40 or less with full words intact, however I am unable to catch the last part of the string with this regex. Can you help?

What happens if you put brackets around the second part… So you’d capture both. You might need to trim whitespace afterwards depending on your use case

e.g
(.{1,40})([\s.]\s)
Or
(.{1,40})([\s.]
)\s
Or maybe…
(.{1,40})\s*([\s.]*)

Unfortunately those did not work.

This formula works if I add a space at the end of the string: (.{0,40})[\s]
So I am wondering how can I say that the last group needs to be captured from the end not the whitespace?

Hi.
Building on the regexes above, how about this one: (.{1,40}\b)|(.{1,40}\b$)

As @takbb states, you may have to trim white space depending on your needs.
There’s probably a way to tweak the regex to exclude the whitespace.

Hope this helps.
-Don

2 Likes

Thank you, it seems to work!

2 Likes

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.