Extract number sequence

I have transformed the data of a JSON in a column. Now in each cell is a big amount of numbers and letters. It looks like this:

{“rel”: “self”, “href”: “https://.net/api/report/eExecution/12345678”}]{ “rel”: “self”, “href”: “https://oup.net/api/ret/cution/12345677”}]

I need to filter the number sequences of 8 digits like in this example 12345678 and 12345677 in a new column but sometimes there are also 4 number sequences like this.

How can this be done?

You can do this with a regular expression. Here is a prototype using Regex Extractor node from Palladian extension:

Prototype
extractSequences.knwf (80.8 KB)

Overview:

Comments:

  • Regex is:
\b\d{8}\b
  • if used in e.g. Expressions node you need to escape the "":
\\b\\d{8}\\b
  • Regex Extractor node is good in that it easily outputs all matches (could not quite get this to work with Expressions node using regex_extract function - maybe someone else can have a go at that to avoid using Palladian)

  • I group by sample column and then aggregate using concatenate function - you can also change that function to set, list etc. or just keep the “Full Result” in its own row.

5 Likes

Thank you for your reply. I am working with knime 5.2.5 i dont find the “Regex Extractor” Node. My workflow looks like this if this might help to understand better. So the data was in an JSON and I transformed it with the JSON path node in a new column as a string.

1 Like

You can get it from here:

If you’re using v5.2 make sure to chose the proper KNIME version in the drop down menu at the top of the page. The download site URL for KNIME 5.2 is:

https://download.nodepit.com/5.2

-Philipp

2 Likes

Here’s the link to the installation guide as well:

I dodged it for a while, but never regretted installing it :-).

As long as your column is of type string it should work - you’d only have to point the Regex Extractor node to the correct columns (in my example it is the column “sample”).

1 Like

Hello @simonmng

It seems to me that the structure of a JSON file is not standard, the square brackets correlation is missed; besides that, there’s a comma missed in the group separation… the structure should be like this (being aware of quotation type):

[{"rel": "self", "href": "https://.net/api/report/eExecution/12345678"}, { "rel": "self", "href": "https://oup.net/api/ret/cution/12345677"}]

Take a look in this post:

Then a standard JSON extract workflow can be applied. You should check your data gathering process about why this is happening…


20240903_extract number sequence_v00.knwf (10.4 KB)

BR

PS.- Then you can use String Manipulation vs your latest column:

regexReplace($hrefs$, ".*?(\\d+)", "$1")
4 Likes

Works fine! Thank you very much :slight_smile:

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.