Extracting links from a text

Hi everyone,

I have a text (actually part of a html) in which there are several <a> tags with href attributes which I need their values (the links) to be extracted.

For example:<ul class="user-results"><li class="user-card"><div class="user-card__content mod-host"><a class="user-card__profile-link" href="/people/ali-irani-98">… (and there are more <li> tags containing links)

I need the value of the href attributes.

How can I do that? Do I need to use regex filter? If yes, what pattern should I use?

Thanks,
Armin

My quick take on it. Split Cells by blanks, transpose and filter the hrefs.
You would then have to clean the remaining strings.

kn_example_html_separate_href.knwf (22.8 KB)

3 Likes

xpath node maybe?

Thanks Markus. That worked fine.
But I think there should be a better approach as in this solution, transposing takes too long when one has multiple texts to extract the links.

My first idea was using xpath, but this text is part of a html and the node cannot read it. Do you have any idea to convert it to the correct format so that the xpath node can read it?

In case you want to use the HtmlParser from Palladian, you can apply the following workaround: Convert the input column which holds the string to a binary cell, and use this as input for the HtmlParser, then use the XPath nodes as common.

(the simple reason that the input to the HtmlParser needs to be binary is, that strings are treated as file paths)

– Philipp

3 Likes

Thank you so much Philipp.
This solution made everything much faster and cleaner. :star_struck:

@armingrudd have you tried applying the string to xml node before using the xpath node?

:Yes I did. That didn’t work. But the solution introduced by @qqilihq solved the issue and converted the text to xml perfectly.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.