Using Regex for splitting a cell as a 'template'

patrickoliv · May 14, 2018, 12:51am

Hi, guys,
I am new to Knime and data mining and I would like to have a support using RegEx splitting.

After scrapping data from a website, I have several rows like this:
NameJohnAge23SexMaleNetWorth$700000
NameMaria AngelinaAge34SexFemaleNetWorth$800000

I would like to split that in columns using a template:
Name | Age | Sex | NetWorth

How could I solve that?

Thanks in advance,
Patrick

morebento · May 14, 2018, 1:08am

Looks like fun content!

First of all, I would look at the way you scrape that data. Can you ensure it is done in a structured manner using the Palladian HttpRetriever and HtmlParser nodes, then the XPath node to parse the HTML? Use Google Chrome to find the XPath construct here.

Secondly you can use the https://regex101.com/ website to interactively build a regular expression, which for yours could be something like “Name(.+)Age(.+)Sex(.+)NetWorth(.+)”

In Knime, feed your rows into a Regex Split node and put the expression above into it, and that will split the data into columns. Then all you need to do is rename and select the columns you want.

patrickoliv · June 18, 2018, 11:43am

Hi, morebento,

Thanks for your answer! I tried at Reg101 and it’s definitely what I was looking for.

However, I still have problems on Knime for splitting the data: “Regex Split 0:39 293 input string(s) did not match the pattern or contained more groups than expected”

I am afraid that the data may contain hidden or special characters, but trying “N(.+)e” (that would theoretically return “am”) also did not work.

Would you know what may be the problem?

Thanks and best,

system · June 2, 2023, 9:45pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.