Text processing on URLs

marit · January 29, 2018, 2:24pm

I'm trying to do text processing on a list of URLs to see which words appear more frequently in the paths of the URLs, but when I replace the symbols [!#$%&'\"*+,.\?:;]+ with a space(" "), it still treats the whole document as one word, for example if I try to use the "Bag of Words creator".

As an example, I want that this URL

https://www.modernghana.com/news/788266/church-must-join-sexual-health-education-campaign.html

gives me something like ["church", "sexual", "health", "education", "campaign"], but what I get is just "church must join sexual health education campaign".

It's a bit hard to explain, but I hope you understand the problem.

julian.bunzel · February 6, 2018, 4:46pm

Hey marit,

I can't reproduce your problem. Which nodes did you use?

I attached a screenshot of my solution. I haven't got any problems with the tokenization of the document.

Cheers,

Julian

workflow.png

system · June 2, 2023, 9:46pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.