Not able to strip out all Html tags along with javascript in html content

angansen · November 2, 2015, 7:01pm

I am trying to strip out all the html Tags and script tags from an html page to extract only useful information, Now for striping I am using regex in string replacer node.

doing following steps<

1. using "\r\n|\n|\r" to remove newline

2. applying "(<script.?)script>/ig" for stripping out inner content of the <script> tag

3. then at last applying "<([^>]+)>" to remove all html tags.

However I am not able to remove the <script> tags at all. Can anybody please give an idea how can I remove all html and javascript tags completely.

kilian.thiel · November 3, 2015, 5:00pm

If you have problems with the String Replacer node you could try the Java Snippet node and define a few lines of java code to replace all tags, e.g.

String input = "<b>some text</b>";
String stripped = input.replaceAll("<[^>]*>", "");

Note that with the Java Snippet node you can also include external libraries, e.g. JSoup to parse html documents and extract the text.

Cheers, Kilian

angansen · November 9, 2015, 8:27am

Hi Killan,

I want to use knime node set to read content from multple URL using htmlparser and want to send it to a R snippet list variable . how can I do that. Is it possible using

kilian.thiel · November 16, 2015, 3:31pm

First define a list of URLs, e.g. with the Table Creator (or read a CSV file with URLs) provide these URLs the HTTP Retirever node of the Palladian Extension. This node will download the content of the URL. Then use the Html Parser node, to convert the result to XML cells. Then use the XPath node to extract the content of the html code.

Cheers, Kilian

gcarmich · March 15, 2016, 12:35pm

I implement the flow suggested above but my XPath node results return some HTML and doesn't put a space between some words that are separated by HTML code on the page. I set the XPath Query string as:

/* and of type String(Single Cell)

Does the XPath Query string need to be more specific to pull only the visible the text correctly?

Thanks,

Gilbert

kilian.thiel · March 18, 2016, 1:53pm

Hi Gilbert,

another way to remove HTML tags from strings is to use the JavaSnippet node in combination with the JSoup Java lib. In the JavaSnippet node it is possible to reister and use external libs. Use the JSoup to parse the string and remove all html tags.

Cheers, Kilian

mobcdi · June 22, 2016, 6:18pm

Hi Kilian,

Being new to KNIME and no experience of jsoup would the xml output from the HtmlParser not cause problems by wrapping the html code with extra XML? I was also wondering if the same could be done using the python snippet and the BeautifulSoup4 module to get_text() feature ?

Michael

system · June 2, 2023, 9:48pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.