Extracting links from a text

armingrudd · November 25, 2018, 5:54pm

Hi everyone,

I have a text (actually part of a html) in which there are several <a> tags with href attributes which I need their values (the links) to be extracted.

For example:<ul class="user-results"><li class="user-card"><div class="user-card__content mod-host"><a class="user-card__profile-link" href="/people/ali-irani-98">… (and there are more <li> tags containing links)

I need the value of the href attributes.

How can I do that? Do I need to use regex filter? If yes, what pattern should I use?

Thanks,
Armin

mlauber71 · November 25, 2018, 10:47pm

My quick take on it. Split Cells by blanks, transpose and filter the hrefs.
You would then have to clean the remaining strings.

kn_example_html_separate_href.knwf (22.8 KB)

Geo · November 25, 2018, 10:49pm

xpath node maybe?

armingrudd · November 27, 2018, 4:36am

Thanks Markus. That worked fine.
But I think there should be a better approach as in this solution, transposing takes too long when one has multiple texts to extract the links.

armingrudd · November 27, 2018, 4:38am

My first idea was using xpath, but this text is part of a html and the node cannot read it. Do you have any idea to convert it to the correct format so that the xpath node can read it?

qqilihq · November 27, 2018, 12:16pm

In case you want to use the HtmlParser from Palladian, you can apply the following workaround: Convert the input column which holds the string to a binary cell, and use this as input for the HtmlParser, then use the XPath nodes as common.

(the simple reason that the input to the HtmlParser needs to be binary is, that strings are treated as file paths)

– Philipp

armingrudd · November 27, 2018, 2:40pm

Thank you so much Philipp.
This solution made everything much faster and cleaner.

Geo · November 28, 2018, 9:43pm

@armingrudd have you tried applying the string to xml node before using the xpath node?

armingrudd · November 29, 2018, 12:43am

:Yes I did. That didn’t work. But the solution introduced by @qqilihq solved the issue and converted the text to xml perfectly.

system · December 6, 2018, 12:43am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.