Palladian Node "HTMLParser" does not parse HTML documents in UTF-8

I am trying to parse German HTML sites with the Palladian Node HTMLParser.
The HTML files are already downloaded on my system.
Naturally the German sites contain a lot of umlaute (ä,ö,ü).

The Parser is not able to parse the documents in UTF-8.
Instead I am getting different symbols like question marks or squares.
Somebody got a workaround for this?

How are you reading the files and supplying them to the node?

Hi @qqilihq

I have them downloaded on my system, then I am reading them with the List Files Node and them I am parsing them with the HTMLParser.

Hi there,

so, what is the input which you’re supplying to the HTML Parser? Is it a URL or a file path to these files, or is this the HTML markup as a string?

There is a known issue with proper encoding detection in some cases, but I currently cannot debug this as I’m on the road.

I suggest you try the following: Read the files to binary data and pass the binary object to the HTML parser. This way the encoding detection will work for sure. You can use the following node from the KNIME File Handling Nodes for that:

Please let me know if this helps!

– Philipp

PS: This is on our list for the upcoming Palladian update.

Hi @qqilihq,

Thanks for your help.

I am supplying the Parser with a File Path to the files.

I have tried your solution, but I am getting the warning “URL column Not set”.

Could you please post a minimalized test workflow with an example file so that I can have a look?

Hi @qqilihq,

Here is a workflow with just two files.

UTF-8.knwf (6.5 KB)

Can you please share one of the HTML files as well? Thx.

Hey @qqilihq,

Unfortunately I am not allowed to share the files.

But I think I found the problem.
Every files starts with these 2 lines:

?xml version=“1.0” encoding=“UTF-8”?

html xmlns=“http://www.w3.org/1999/xhtml

So I have XML and not HTML files?

Hey,

this is an XHTML header, so the file is processable with the HTML Parser.

In order do debug your issue, I’d however need a test file. Feel free to strip sensitive data, but the general structure, header, and an example of invalid encoded characters are necessary. Also please keep the original encoding.

With this I can have a look.

Best,
Philipp

Hey @qqilihq,

Here is a file with a sample text.

It produces the same error with the umlaute.

Sample.zip (645 Bytes)

Thanks for the example file!

I had a look: This file does not contain any encoding clues, so the parser will simply assume a default encoding (afair this would be some ISO-tralala). It’s behaving the same way as a web browser would do:

42

Contrary to a web browser we currently do not allow to override the default encoding. But you could manually add the necessary header to the files which you want to parse:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

Then the Umlauts are properly decoded.

– Philipp

Hey @qqilihp,

thanks for your time so far.
When I open the the file it looks normal.

file

Adding the header manually is not possible since I have 35.000 files.
But again thanks for your help so far.

You could add these relatively easily through the KNIME workflow – just append the header, write it to a temporary file, and then supply this to the parser.

– Philipp

Hi @qqilihq

Can you recommend me a node which can append the header to every row?

There are, as usual :slight_smile: , several ways to achieve this. I’d personally go for the following:

  1. Use a Chunk Loop Start node to process each file in isolation
  2. Use a Java Snippet node with some custom code to add the necessary headers and write the content to a temporary file
  3. Parse this temporary file using the HTML Parser
  4. Delete the temporary file
  5. End the loop

Instead of the Java coding, it should also be possible with a combination of several non-coding nodes.

Sorry, I cannot give a more detailed walk-through now as I’m currently loaded with other work, but I’m sure someone else here in the forum can jump in.

– Philipp

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.