Palladian Node "HTMLParser" does not parse HTML documents in UTF-8

gnime · October 17, 2019, 7:51am

I am trying to parse German HTML sites with the Palladian Node HTMLParser.
The HTML files are already downloaded on my system.
Naturally the German sites contain a lot of umlaute (ä,ö,ü).

The Parser is not able to parse the documents in UTF-8.
Instead I am getting different symbols like question marks or squares.
Somebody got a workaround for this?

qqilihq · October 17, 2019, 9:15pm

How are you reading the files and supplying them to the node?

gnime · October 17, 2019, 9:42pm

Hi @qqilihq

I have them downloaded on my system, then I am reading them with the List Files Node and them I am parsing them with the HTMLParser.

qqilihq · October 18, 2019, 6:23am

Hi there,

so, what is the input which you’re supplying to the HTML Parser? Is it a URL or a file path to these files, or is this the HTML markup as a string?

There is a known issue with proper encoding detection in some cases, but I currently cannot debug this as I’m on the road.

I suggest you try the following: Read the files to binary data and pass the binary object to the HTML parser. This way the encoding detection will work for sure. You can use the following node from the KNIME File Handling Nodes for that:

Please let me know if this helps!

– Philipp

PS: This is on our list for the upcoming Palladian update.

gnime · October 18, 2019, 8:03am

Hi @qqilihq,

Thanks for your help.

I am supplying the Parser with a File Path to the files.

I have tried your solution, but I am getting the warning “URL column Not set”.

qqilihq · October 18, 2019, 8:16am

Could you please post a minimalized test workflow with an example file so that I can have a look?

gnime · October 18, 2019, 8:40am

Hi @qqilihq,

Here is a workflow with just two files.

UTF-8.knwf (6.5 KB)

qqilihq · October 18, 2019, 9:34am

Can you please share one of the HTML files as well? Thx.

gnime · October 23, 2019, 9:07am

Hey @qqilihq,

Unfortunately I am not allowed to share the files.

But I think I found the problem.
Every files starts with these 2 lines:

?xml version=“1.0” encoding=“UTF-8”?

html xmlns=“http://www.w3.org/1999/xhtml”

So I have XML and not HTML files?

qqilihq · October 23, 2019, 9:34am

Hey,

this is an XHTML header, so the file is processable with the HTML Parser.

In order do debug your issue, I’d however need a test file. Feel free to strip sensitive data, but the general structure, header, and an example of invalid encoded characters are necessary. Also please keep the original encoding.

With this I can have a look.

Best,
Philipp

gnime · October 23, 2019, 11:12am

Hey @qqilihq,

Here is a file with a sample text.

It produces the same error with the umlaute.

Sample.zip (645 Bytes)

qqilihq · October 23, 2019, 2:12pm

Thanks for the example file!

I had a look: This file does not contain any encoding clues, so the parser will simply assume a default encoding (afair this would be some ISO-tralala). It’s behaving the same way as a web browser would do:

Contrary to a web browser we currently do not allow to override the default encoding. But you could manually add the necessary header to the files which you want to parse:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

Then the Umlauts are properly decoded.

– Philipp

gnime · October 23, 2019, 2:21pm

Hey @qqilihp,

thanks for your time so far.
When I open the the file it looks normal.

file

Adding the header manually is not possible since I have 35.000 files.
But again thanks for your help so far.

qqilihq · October 23, 2019, 2:34pm

You could add these relatively easily through the KNIME workflow – just append the header, write it to a temporary file, and then supply this to the parser.

– Philipp

gnime · October 23, 2019, 2:52pm

Hi @qqilihq

Can you recommend me a node which can append the header to every row?

qqilihq · October 23, 2019, 7:07pm

There are, as usual , several ways to achieve this. I’d personally go for the following:

Use a Chunk Loop Start node to process each file in isolation
Use a Java Snippet node with some custom code to add the necessary headers and write the content to a temporary file
Parse this temporary file using the HTML Parser
Delete the temporary file
End the loop

Instead of the Java coding, it should also be possible with a combination of several non-coding nodes.

Sorry, I cannot give a more detailed walk-through now as I’m currently loaded with other work, but I’m sure someone else here in the forum can jump in.

– Philipp

system · April 21, 2023, 9:39pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.