How to extract the URL from a string filled with nonsense

andrems · July 12, 2023, 7:10pm

Hello! I’m new to Knime and have read a lot of topics here, most of them really helpful, which allowed me to create my very own functional workflow! So thanks in advance.

My current issue is: I have successfully extracted the data from a sharepoint list and created an excel file with the results, however, one of the columns contains an URL inside a lot of nonsense characters which appear to be coming from the Sharepoint list (the URL is correct in the list, i think the problem appears when it is imported in Knime).

The URL appears like this:

{“Description”:“https://www.example.com”,“Url”:“https://www.example.com”}

Any idea how to get rid of all the elements that don’t belong and have only the URL link inside the quotation marks ?

I have tried the Manipulation String but I have little to zero experience with Regex

Thanks a Lot!

iCFO · July 12, 2023, 7:22pm

If you want to keep it simple, then you can run the below 2 basic regex replace formulas in sequential column expression nodes. You have to escape the special characters, which makes it a bit less straightforward.

regexReplace(column(“column1”), “.*,\\“Url\\”\:\\“”, “”)

regexReplace(column(“column1”), “\\”\}”, “”)

andrems · July 12, 2023, 7:58pm

Thanks a lot, I tried the method above, however it gave me the following Error:

Any idea of what might have happened?

iCFO · July 12, 2023, 8:07pm

The 1st part of the formula that references the column doesn’t follow the correct syntax. Try deleting that part and inserting the column reference via the column selection at the top left of the formula area. The formulas I added above are also meant for the Column Expression node in case you were using them elsewhere.

If you are using it in String nodes, then I believe that you would need to remove 1 of each of the escape characters. So it would appear as
2 \ characters instead of 3 \ characters before the special characters.

ArjenEX · July 12, 2023, 8:15pm

For the Column Expression, you can also nest the functions in one:

regexReplace(regexReplace(column("column1"), ".*,\\“Url\\”:\\“", ""),"\\”\\}", "")

Some alternatives:
Use a cell splitter and do a cleanup thereafter with a String Manipulation

Another alternative: after the cell splitter add the Regex Extractor node, select the predefined template for URL’s and click Use.

andrems · July 14, 2023, 4:29pm

Thanks for the help!

andrems · July 14, 2023, 4:29pm

Thanks a lot, it worked!

system · July 21, 2023, 4:30pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.