Extracting multiple URLs from cells

iagovar · December 12, 2019, 12:57pm

So I’m working with a database dump from a forum.

I have a column that contains the forum message content. This content has more than one URL inside, be it because they post images, urls to other sites and so on.

My intention is to extract all those URLs and put em in a new table.

I’m not sure how to do this. The URL extractor palladian node only extracts one of them, and, as I said, most forum comments contain more than one URL.

Any ideas?

qqilihq · December 12, 2019, 1:05pm

The Palladian node works fine for me. For each URL, it’ll create a new row in the output table:

If it doesn’t work, please post a sample workflow.

iagovar · December 12, 2019, 1:20pm

But there’s more than one URL fro each cell, that’s the problem. Yes it extracts more than one url for the whole column but not more than one for each cell, that’s the reason why I want a new table, because the number of rows would be different.

qqilihq · December 12, 2019, 1:21pm

You could avoid this e.g. with a chunk loop where you process items row-wise and then group them back by grouping the URLs to a collection cell. This way the row count remains the same.

– Philipp

iagovar · December 12, 2019, 2:06pm

Hmm, never used it, guess I could try with nodepit examples.

Ill try and come back if I have more questions, thanks.

qqilihq · March 4, 2020, 6:03pm

Coming back to an old topic:

Palladian 2.0 has a new Regex Extractor node which (among others) has a preset for extracting URLs. It allows to output extraction results in separate rows (and will give a back-reference to the RowID from source), only to extract the first occurrence, extract a “Collection Cell”, or even a fine-grained JSON object which contains the results including the offsets within the input string.

More details about the Palladian 2.0 release are available here:

system · September 3, 2020, 6:03am

This topic was automatically closed 182 days after the last reply. New replies are no longer allowed.