Wrong encoding in web text scraping

HURIMOZ · June 23, 2022, 4:29am

Hi everyone, I’m struggling to get the encoding right in my web text scraping task. I want to make sure UTF-8 is being used throughout the whole workflow.

I’m using component Web Text Scraper which uses a Java based library called BoilerPipe and the output text is fine when the language is in English. However, the same task with a different language will result in characters not displayed properly.

See image below:

And here’s the workflow inside the Web Text Scraper:

It looks like the Java snippet for the text extractor is not using UTF-8. Or else?
Any help welcome!

HURIMOZ · June 23, 2022, 6:03am

I forgot to include the code snippet of the text extractor:

// system imports
import org.knime.base.node.jsnippet.expression.AbstractJSnippet;
import org.knime.base.node.jsnippet.expression.Abort;
import org.knime.base.node.jsnippet.expression.Cell;
import org.knime.base.node.jsnippet.expression.ColumnException;
import org.knime.base.node.jsnippet.expression.TypeException;
import static org.knime.base.node.jsnippet.expression.Type.*;
import java.util.Date;
import java.util.Calendar;
import org.w3c.dom.Document;


// Your custom imports:
import java.net.MalformedURLException;
import java.net.URL;

import de.l3s.boilerpipe.BoilerpipeProcessingException;


import de.l3s.boilerpipe.extractors.DefaultExtractor;
// system variables
public class JSnippet extends AbstractJSnippet {
  // Fields for input columns
  /** Input column: "link_123456789" */
  public String c_link_123456789;

  // Fields for output columns
  /** Output column: "text" */
  public String out_text;

// Your custom variables:
String text = "";
URL url;
// expression start
    public void snippet() throws TypeException, ColumnException, Abort {
// Enter your code here:
try{
url = new URL(c_link_123456789);	
	
}
catch(MalformedURLException e){
	out_text = "";
	}
try{
out_text = DefaultExtractor.INSTANCE.getText(url);	
	}
catch(BoilerpipeProcessingException ex){
	
	out_text = "";
	
	}	

		



// expression end
    }
}

mlauber71 · June 23, 2022, 6:20am

@HURIMOZ welcome to the KNIME forum.

You could try and set utf-8 as default in the knime.ini

and workspace

HURIMOZ · June 23, 2022, 8:24am

Hi @mlauber71 thanks for your help.
I’m however still getting those weird characters displayed.
I’ve changed the knime.ini file:

I’ve also changed the text file encoding in the preferences:

Not sure what I’m doing wrong here.

qqilihq · June 23, 2022, 10:01am

You’ll typically need to detect the encoding based on the input data (i.e. specified by the website). For example Asian language websites will often not use UTF-8 encoding.

Therefore, in the Palladian nodes, we have quite some “magic” to detect the encoding (based on explicitly given tags and headers and some heuristics of these do not exist) of a website and then to decode it properly. So, as an alternative solution, you might want to look at the following combo:

HURIMOZ · June 24, 2022, 10:44pm

Hi @mlauber71 it looks like the problem comes from the component input of the Web Text scraper being locked and excluding UTF-8. See below:

I can’t seem to unlock the component to take out the UTF-8 exclusion. Why is that?

mlauber71 · June 25, 2022, 10:26am

@HURIMOZ this UTF-8 seems to be a variable. I don’t think this is the point where you can set something. You could try and let the variable thru by disconnecting the component from the source but I doubt it will make a difference.

Could you provide some sample file or workflow where this happens so one might investigate.

HURIMOZ · June 25, 2022, 4:09pm

Hi @mlauber71 see package here.

mlauber71 · June 27, 2022, 3:49pm

@HURIMOZ I think with the suggested Palladian nodes you might be able to extract the data with a good encoding. The Java package you are using seems to be focussed on english language content. There are some discussions about using other languages also, but the settings were not immediately obvious.

kn_forum_44466_tahitian_web_extract.knwf (46.2 KB)

Another Approach could be to use the R package and configure that:

system · September 25, 2022, 3:49pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.