I am quite new to Knime and I have a small but essential problem reggarding the Html Parser Node.
I am trying to extract data from Turkish web sites and I use Parser Node a lot. However, the node's output file shows the parser does not retrieve Turkish characters successfully. I attached just an example to show the problem. All Turkish characters (ğ, ü, ş, ç , ö, ı) are represented by the same square-shaped symbol. Therefore, I cannot replace them in any way.
the update is already available since last week (apologies, if I was not clear enough).
Have you already run the "Check for Updates" menu? If this does not show any new versions: Can you please check in the "About KNIME" window > "Installation Details" about the "Palladian Feature for KNIME Workbench" version? This should read: 1.2.0.201408051613 [edit] sorry, this was wrong, it should be 1.2.0.201411101944 of course.
If there are still any issues, please let me know.
the version number which I posted above was wrong, you obviously already have the latest available build (which includes the mentioned fix). I will double check this again later, when I'm at work. Will get back to you.
can you please download, import and run the attached workflow? (If you get an error message when opening the workflow, stating that the "Table Difference Checker" node is missing, you can ignore this.) Please check the output table of the Column Filter node (Node 6). This should be a string with correct Turkish encoding (i.e. no placeholder symbols).
If the string is not in correct encoding, please do the following, so that we can track down the issue:
Go to KNIME's preferences > KNIME GUI and enable DEBUG logging.
Reset the HtmlParser node
Re-run the HtmlParser node and post the Console output here.
Yes, using the combination of Retriever THEN Parser is the recommended way. (the Retriever e.g. extracts the page encoding and also handles cookies).
There have been some issue reports lately which were caused by downloading URLs directly with the parser. We'll add a warning message to future releases to avoid such problems.