KNIME python script UnicodeEncodeError

Hi community,

I am using KNIME 4.7.0 Python script node to do some web scraping work. However, some of the content returned an error of:
UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\xd6’ in position 38: illegal multibyte sequence

I can run the script using VScode without any errors, but here in KNIME, I use the same environment and it cannot work.

The lines related to this problem should be:
req = Request(url, headers={“User-Agent”: custom_user_agent})
page = urlopen(req)
content = page.read().decode(‘utf-8’)
soup = BeautifulSoup(content)
title = soup.findAll(‘h2’)
for t in title:
print(t.get_text())

I was trying to convert to the ‘utf-8’ character encoding based on the anwsers from Google but it still returns error.
Here is the error when I was trying to execute the script in Python node:

Executing the Python script failed: Traceback (most recent call last):
File “”, line 29, in
File “C:\Program Files\KNIME\plugins\org.knime.python3.scripting_4.7.0.v202211291148\src\main\python_kernel_launcher.py”, line 652, in write
self._get_original().write(s)
UnicodeEncodeError: ‘gbk’ codec can’t encode character ‘\u0336’ in position 24: illegal multibyte sequence

I am wondering how can I solve this problem.

Thank you!
Haoran

Hi,

please refer to How to write a good post and provide additional details:

  • Minimal scripting example to directly reproduce the issue
  • KNIME Analytics Platform version
  • Workflow to reproduce the error
  • knime log (View -> Open KNIME log)

Steffen

And I found this question is quite similar as mine:
https://forum.knime.com/t/ascii-problem-unicodeencodeerror-in-python-for-symbols-and-umlauts/11886

The solution here is to add a line “export LC_ALL=en_US.UTF-8” at the beginning of the script, but in my case, this will return a syntax error.

I would print to see the text you receive. Also I would provide a parser to BeautifulSoup (not sure what the default one is)
and as already mentioned you probably want to upload a sample flow
br

2 Likes

@hryuann

I am guessing that you are using Windows Operating System with a Chinese locale. From your error, it appears that when you are printing your results Python is trying to convert the UTF-8 decoded string to a string with the same encoding as your operating system locale using the gbk codec which converts to a Chinese encoding which doesn’t cover all of the Unicode characters. Therefore, when you get to a character that cannot be converted Python complains.

I am guessing, that it will work in VS Code if the terminal window supports UTF-8 characters and Python then has an encoding target that supports all of the characters in your source material. As to why this is so I cannot answer; unless I have more information on your machine and how the KNIME Python node generates its terminal window.

Language locales is a messy topic and creates all sorts of problems that Unicode is intended to resolve. However, because there are many legacy documents (files) which use non-Unicode characters and legacy software that uses them (written in Python) the path to enforcing UTF-8 as a standard for reading and writing files is long and painful. Note: This is for user files, the standard for code files is and has been for a long time UTF-8.

One suggestion I can make is to force Python to ignore the locale setting and use UTF-8 encoding for strings. This cannot be done within KNIME and needs to be set before the Python interpreter has started.

You can do that two ways:

A/ Open PowerShell and enter:

$env:PYTHONUTF8=1
& 'C:\Program Files\KNIME\knime.exe'

This will set Python to use UTF-8 encoding for the session and then start the KNIME software. Note, you may need to adjust your path to the KNIME application. This may be a good way to test whether it works for you. If $env:PYTHONUTF8=1 doesn’t work then try $env:LC_ALL=en_US.UTF-8

B/ Set the environment variable

If the above works then you can set the PYTHONUTF8=1 environment variable permanently by opening system properties and adding it under Environment Variables…as a user variable.

Be aware that setting this environment variable to apply to every instance of Python that you will run may cause applications to break. Whilst there is a plan to make this a default setting in Python 3.15 there are still applications being used that may break. So, use with caution.

I’ve had to make a lot of guesses and assumptions, so there is no guarantee that it will work or is the root cause of your problem. But it may give you something to try and clues as to where to look for the answer.

DiaAzul
LinkedIn | Medium | GitHub

3 Likes

Thank you DiaAzul!
I tried $env:LC_ALL=en_US.UTF-8 and I had an error:

  • $env:LC_ALL=en_US.UTF-8
    • CategoryInfo : ObjectNotFound: (en_US.UTF-8:String) , CommandNotFoundException
    • FullyQualifiedErrorId : CommandNotFoundException

It is telling me that cannot recognize “en_US.UTF-8”, I was able to run “env:PYTHONUTF8=1” but it does not work unfortunately…

Haoran

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.