In KNIME, Is there a way to search for non-ascii characters in a csv file

HI,
I have a 128505 row of data with columns A-K. When I add it to my python script the file is being rejected because of non-ascii characters. Is there a way to search for these non-ascii characters in my csv file using KNIME or if need be I can do it in my excel file.
Thanks for your help,
Scott

This could solve your issue in Python.
https://linuxconfig.org/syntaxerror-non-ascii-character-python-with-utf-8-encoding

Thank you for the info. I would prefer to remove the non ascii characters rather than excepting them in my code.

Hi there @sgilmour,

you can read data into KNIME, remove all non-ascii characters with regex and write it back.

Br,
Ivan

1 Like

That would work. I am up for a good challenge. How would I go about doing that? Which Nodes would I need to setup to configure. When you say read data into KNIME can I do this from my CSV file?
Thanks for all your help
Scott

1 Like

Hi @sgilmour,

first read data into KNIME using File Reader or CSV Reader (this can not be done from within CSV file). Then you need to create loop to loop over every column and apply appropriate regex (this one could work: regexReplace($column_name$,"[^u0000-u007F]+" ,"" )) in String Manipulation node to remove all non-ascii characters. When done use CSV Writer to write data back. For column looping check this example:

Br,
Ivan

6 Likes

Hi Ivan,
I took my file I created with CSV Writer and imported it into my CSV Reader and I am already getting an error in CSV Reader.
Execute Failed: New line in quoted string: (for closing quote missing) in line 11.

I know I have a lot of cells that have a need for a new line that would need to be changed since I have over 100,000 rows of data. Is there an easy way to fix this issue when creating my file in CSV Writer.
Thanks
Scott

You could try and use R with the readr package

Or you could try and use the file reader

Or you could try the advanced settings and allow non complete rows (citing from memory since I am at my mobile device at the moment)

1 Like

Hi there @sgilmour,

you created this file with KNIME? If so then you can try to remove non-ascii characters prior to writing file with CSV Writer node?

Br,
Ivan

Thanks.
Is there a recommended node I should use?
I am going to try the file reader node rather than the csv reader and try the loop option that was mentioned before.

Hi @sgilmour,

try approach and nodes from my second reply.

Br,
Ivan

Hi Ivan,
That is what I am trying now. File Reader ----- Column List Loop Start ------String Manipulation ---- Loop End
Just working on how to configure the nodes

1 Like

In String Manipulation it is listing all my columns as one big column. How do I get them to list as seperate columns.
Thanks
Scott

1 Like

Hi @sgilmour,

so if you check your data after File Reader node you’ll see that all your data is in one column. You can either define column separator in File Reader node or use Cell Splitter node after you read your data. Instead of column_name in String Manipulation node you should then use real column name.

Br,
Ivan

The Cell Splitter worked. and then I did the column list loop start node


Then I added the node string manipulation and it adding all my columns as 1 again I was going to add the regex expression to each column.

Here is what I have so far for the workflow.

Hi @sgilmour,

check Split input column name for output column names in Cell Splitter node to have proper column names (if haven’t already). Also you can filter columns you don’t need prior to loop. Then check example workflow I linked in one of my previous replies regarding how to loop over multiple columns and applying same manipulation over each one of them.

Br,
Ivan

Hi Ivan,
Is this what you were thinking. I am still working on configuring but wanted to make sure I am on the right track.

Hi Ivan,
I modeled my setup after the Looping over all columns and manipulation of each workflow. I am getting stuck when I reach the Math Formula Node. it is giving me an error no such column: Anonymous column when it is added in n the previous node.
Column List Loop Start

Then the Column Rename
image
Then the Math Formula Node


Then I get the error

Do I need to add something to my workflow to get it to work?

sample output of file.
test_output.xls (26 KB)

Hi @sgilmour,

yes! That is the one I was thinking of. You need to use String Manipulation instead of Math Formula node. Also seems to me you should use Column Filter node to remove all these columns you have in Exclude in Column List Loop Start node.

Br,
Ivan