Html Parser Node Character Coding Problem

Hi Guys,

I am quite new to Knime and I have a small but essential problem reggarding the Html Parser Node.

I am trying to extract data from Turkish web sites and I use Parser Node a lot. However, the node's output file shows the parser does not retrieve Turkish characters successfully. I attached just an example to show the problem. All Turkish characters (ğ, ü, ş, ç , ö, ı) are represented by the same square-shaped symbol. Therefore, I cannot replace them in any way.

Your help will be highly appreciated.

Hi,

the parser usually detects the encoding automatically. Can you post a URL where you encounter the issue?

Best,
Philipp

http://www.etstur.com/Yurtdisi-Tatil-Turlari/Kisa-Yunanistan

 

Fixed this. Updated nodes will be available during next days.

Thanks a lot.

Hi Guys,

Have you set any release date for the update?

Bora

Hi Philipp,

I check updates everyday and some updates have been installed since last week, but the Problem with Html Parser node is still there.

I just checked the Palladian Feature for KNIME Workbench and the number is as follows:  1.2.0.201411101944

Something seems wrong with my version.

Bora

 

Hi Bora,

you will need the nightly community link added into your knime build to get the most latest of updates. Check this page out which gives more details.

http://tech.knime.org/community

The risk is the nightly release of the nodes are tested less so may have bugs present, but the advantage is you get all the latest developments.

simon.

Hi Simon,

I checked the community link in my knime build and looks like I already have it. However, Palladian's latest update is not installed to my system.

Hi boraster,

the update is already available since last week (apologies, if I was not clear enough).

Have you already run the "Check for Updates" menu? If this does not show any new versions: Can you please check in the "About KNIME" window > "Installation Details" about the "Palladian Feature for KNIME Workbench" version? This should read: 1.2.0.201408051613 [edit] sorry, this was wrong, it should be 1.2.0.201411101944 of course.

If there are still any issues, please let me know.

Best,
Philipp

Sorry Bora,

the version number which I posted above was wrong, you obviously already have the latest available build (which includes the mentioned fix). I will double check this again later, when I'm at work. Will get back to you.

Best,
Philipp

Hi Philipp
Thanks a lot for the prompt reply.
I am looking forward to hear from you soon.
Regards
Bora

Hi Bora,

can you please download, import and run the attached workflow? (If you get an error message when opening the workflow, stating that the "Table Difference Checker" node is missing, you can ignore this.) Please check the output table of the Column Filter node (Node 6). This should be a string with correct Turkish encoding (i.e. no placeholder symbols).

If the string is not in correct encoding, please do the following, so that we can track down the issue:

  1. Go to KNIME's preferences > KNIME GUI and enable DEBUG logging. 
  2. Reset the HtmlParser node
  3. Re-run the HtmlParser node and post the Console output here.

Best,
Philipp

Hi Philipp,

the workflow you sent works fine!!

What should I do get rid of the problem in my workflows?

Could you upload an example of how you're using the parser? Then I can have a look.

I was using the parser alone, not along with the retriever.

Now, I have added the retriever just in your example workflow and there seems to be no problem any more.

That's it, huh!!! No need to do something else?

Just out of curiosity, should I always use retriever and parser in combination? (of course, for language specific characters, etc)

Yes, using the combination of Retriever THEN Parser is the recommended way. (the Retriever e.g. extracts the page encoding and also handles cookies).

There have been some issue reports lately which were caused by downloading URLs directly with the parser. We'll add a warning message to future releases to avoid such problems.

Best,
Philipp

Thanks, Philipp.

Best regards,

Bora