Html Parser Node Character Coding Problem

boraster · November 10, 2014, 4:18pm

Hi Guys,

I am quite new to Knime and I have a small but essential problem reggarding the Html Parser Node.

I am trying to extract data from Turkish web sites and I use Parser Node a lot. However, the node's output file shows the parser does not retrieve Turkish characters successfully. I attached just an example to show the problem. All Turkish characters (ğ, ü, ş, ç , ö, ı) are represented by the same square-shaped symbol. Therefore, I cannot replace them in any way.

Your help will be highly appreciated.

qqilihq · November 10, 2014, 5:48pm

Hi,

the parser usually detects the encoding automatically. Can you post a URL where you encounter the issue?

Best,
Philipp

boraster · November 10, 2014, 6:07pm

http://www.etstur.com/Yurtdisi-Tatil-Turlari/Kisa-Yunanistan

qqilihq · November 10, 2014, 8:32pm

Fixed this. Updated nodes will be available during next days.

boraster · November 11, 2014, 6:52am

Thanks a lot.

boraster · November 19, 2014, 3:54pm

Hi Guys,

Have you set any release date for the update?

Bora

boraster · November 20, 2014, 6:23am

Hi Philipp,

I check updates everyday and some updates have been installed since last week, but the Problem with Html Parser node is still there.

I just checked the Palladian Feature for KNIME Workbench and the number is as follows: 1.2.0.201411101944

Something seems wrong with my version.

Bora

richards99 · November 20, 2014, 6:56am

Hi Bora,

you will need the nightly community link added into your knime build to get the most latest of updates. Check this page out which gives more details.

http://tech.knime.org/community

The risk is the nightly release of the nodes are tested less so may have bugs present, but the advantage is you get all the latest developments.

simon.

boraster · November 20, 2014, 8:55am

Hi Simon,

I checked the community link in my knime build and looks like I already have it. However, Palladian's latest update is not installed to my system.

qqilihq · November 20, 2014, 9:56am

Hi boraster,

the update is already available since last week (apologies, if I was not clear enough).

Have you already run the "Check for Updates" menu? If this does not show any new versions: Can you please check in the "About KNIME" window > "Installation Details" about the "Palladian Feature for KNIME Workbench" version? This should read: ~~1.2.0.201408051613~~ [edit] sorry, this was wrong, it should be 1.2.0.201411101944 of course.

If there are still any issues, please let me know.

Best,
Philipp

qqilihq · November 20, 2014, 10:02am

Sorry Bora,

the version number which I posted above was wrong, you obviously already have the latest available build (which includes the mentioned fix). I will double check this again later, when I'm at work. Will get back to you.

Best,
Philipp

boraster · November 20, 2014, 10:21am

Hi Philipp
Thanks a lot for the prompt reply.
I am looking forward to hear from you soon.
Regards
Bora

qqilihq · November 20, 2014, 2:58pm

Hi Bora,

can you please download, import and run the attached workflow? (If you get an error message when opening the workflow, stating that the "Table Difference Checker" node is missing, you can ignore this.) Please check the output table of the Column Filter node (Node 6). This should be a string with correct Turkish encoding (i.e. no placeholder symbols).

If the string is not in correct encoding, please do the following, so that we can track down the issue:

Go to KNIME's preferences > KNIME GUI and enable DEBUG logging.
Reset the HtmlParser node
Re-run the HtmlParser node and post the Console output here.

Best,
Philipp

htmlparserencodingtest.zip

boraster · November 20, 2014, 3:09pm

Hi Philipp,

the workflow you sent works fine!!

What should I do get rid of the problem in my workflows?

qqilihq · November 20, 2014, 3:35pm

Could you upload an example of how you're using the parser? Then I can have a look.

boraster · November 20, 2014, 3:41pm

I was using the parser alone, not along with the retriever.

Now, I have added the retriever just in your example workflow and there seems to be no problem any more.

That's it, huh!!! No need to do something else?

Just out of curiosity, should I always use retriever and parser in combination? (of course, for language specific characters, etc)

qqilihq · November 20, 2014, 4:24pm

Yes, using the combination of Retriever THEN Parser is the recommended way. (the Retriever e.g. extracts the page encoding and also handles cookies).

There have been some issue reports lately which were caused by downloading URLs directly with the parser. We'll add a warning message to future releases to avoid such problems.

Best,
Philipp

boraster · November 20, 2014, 4:47pm

Thanks, Philipp.

Best regards,

Bora

system · April 21, 2023, 9:41pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.