HtmlParser Issue when a video is embed

Hello,

I'm having a weird issue with HtmlParser. I'm collecting data from a forum:

HttpRetriever -> HtmlParser -> XPath

I use XPath so I can get the elements and contents I want from the page. It works fine, but...

When a forum page contains a embed video from YouTube, after the video, all characters < or > are replaced by &lt; or &gt (ASCII code for their respective characters). With that, I lose all possibilities of using XPath.

Basically, this is what happens, after the video:

<div class="name">text</div>

changes to

&lt;div class="name"&gt;text&lt;/div&gt;

I've tried a few things: string replace and later string to XML, used JavaSnippet to make HTTP retriever results string, and later String to XML, nothing works.

The only error I got is the following, after replacing all &lt; for their respective tags:

Cell in row:"Row0" and column "Document" could not be parsed: The element type "iframe" must be terminated by the matching end-tag "</iframe>". Add missing value.

Any guidance would be extremely helpful.

Thanks!

Gustavo

Hi Gustavo,

few things to note:

1) HtmlParser is part of Palladian, it would have been better to post this inside the Palladian & Selenium board. Plus it it not really easy to help you since you didn't post your workflow nor indicated which site you are trying to parse. One can only guess, so here we go...

2) It seems that the video is contained in an iframe, which is often the case for embedded videos inside forums, but for some reasons the closing iframe tag seems to be missing. Did you check the source of the page? Is the iframe closing tag indeed missing?

3) With a missing closing tag (and probably something else before it due to poor copy/pasting), it is likely that the parser "believes" it is still inside a string parameter, hence the < and > are turned into &lt; and &gt; This is the proper behavior given the fact the DOM is malformed.

Bottom line. There is really no easy solution beside "fixing" the page before parsing it. It could be that a web browser is more tolerant than HtmlParser (based on validator.nu) and can fill in the missing closing tag for visualization purposes, so you are able to display the page somehow. Check anyway with the Developer Tools if any error is triggered. Parsing properly the page with that defect may require a different solution though.

Cheers,
Marco. 

Hi Marco,

Thanks for your report. The 2nd point seems to make sense. I'll see if I can build a workaround, like adding a </iframe> where it should be.

Thanks!

Gustavo

Fixed. Thanks Marco for your insights! Guessing most part of the time is extremely helpful as give one ideas for fixing it. :)

The fact is that the iframe tag causing problem was closed, by like this:

<iframe source="aaa" atribute="aa" />

this is what caused problems with the HTMLParser. Using a string replacer, I could then close the tags as

<iframe></iframe>

The workflow is like this:

HTTPRetriever -> HTTPResultsExtractor (to extract a string) -> String Replacer -> String to Binary Objects (for some reason, String to XML was still presenting an error) -> HTMLParser

Again, thanks for your ideas.

Gustavo

Glad you found a solution. Well done!

Cheers,
Marco.