File Reader - First Column Special Characters

Hello Knime Community!

Something I have been trying to figure out and solve to no avail. I don’t even know what language to use when researching, but in a lot of cases my file reader will insert a couple special characters in my first column (see image). It is the strangest thing because if I use a folder read, some files will have it and others will not between the files, causing duplication columns in files.

I can use the transform and rename feature in the reader, but really want to solve and prevent this from happening all together.

Thank you in advance for all the support.

image

Hi @davehansen , that’s because it’s a CSV file, and csv file starts with a BOM character.

To overcome this, set encoding to UTF-8:

Same thing for File Reader

4 Likes

Fantastic! this did remove the special characters thank you for those insighs.

However, as with any data analytics solution another has risen…it seems to not be able to read the columns as seen in the first screen shot…but if i uncheck the box then I can see the headers…but will need them added back in. Ideas?

Image 1:
image

Image 2:
image

Hi @davehansen , I’m wondering if the quotes in the “header” is an issue. What is the format of your file? It looks like a csv. Can you try with the CSV Reader?

@bruno29a

Yes, the file is a CVS and so the reader is a CSV reader. I should also point out I am using a folder read structure because I have multiple files. But they should all be identical, simply different sales numbers in each.

Thanks!

Hi @davehansen , is it possible to share one of the files? Preferably one with the headers you pointed out in your screenshot?

I think .csv files can’t be uploaded to the forum. You can just rename the extension to .txt to upload it.

Here are two differnt files with just the headers:
Headers Only File 1.txt (599 Bytes)
Headers Only File 1.txt (599 Bytes)

Hi @davehansen they seem to be 2 same files that you sent.

Also, did you re-create them by copying and pasting the headers? This will not be a real sample.

1- Copy and paste will not create the BOM character
2- This sample is not csv (COMMA separated version) while your data is csv
3- I am pretty sure I will not have any problem with these headers (you can try it yourself, the Readers will have no problem identifying these as headers)

It would be better if you uploaded one of the files as is (of course with modified extension only to be able to upload)

1 Like

I just deleted the sales data from the original file and left everything else intact. I am not sure I am tracking on the modifying the extension without modifying the file.

But I will not be able to upload the original file with all the sales data included (security purposes)

But this is the original file structure:

The reason why I assumed it was re-created was because in your screenshot, you defined the delimiter as a comma, but the files you uploaded are delimited by tab

There is also no BOM character (I just confirmed by opening the file).

So, I’m not having any issue with the headers with the file - as I said, you can try it for yourself.

So this is not helping to troubleshoot the issue.

Yes, I can understand this. In this case, in order to give a file with similar content, you should:

  1. Make a copy of the file, let’s call it sample.csv
  2. Open sample.csv, delete everything except first line (headers)
  3. Save file (at this point, the file should still be sample.csv)
  4. Rename the file from sample.csv to sample.txt (rename, not save as)
  5. Upload sample.txt
1 Like

ok so i followed your process but my computer just auto changed the file to txt
sample.txt (602 Bytes)

Thanks @davehansen , this time the file is comma delimited, and I can see the BOM character.

However, I don’t have any issues having the line as headers:
Default encoding:

UTF-8:

In both encodings, the CSV Reader is able to recognize the line as headers.

Are you able to test with the sample.txt file and see if you have any issues with the headers?

@bruno29a

Thank you for hanging on and helping work through this. Yes, it works as you described. What I have found is that your methods are correct.

Where this is getting hung up is; I am using ‘Files in Folder’ to read the whole folder, usually no problem, but there are folders within the main folder read path, each with their own file (unfortunately these are received as zips and need unpacked). I have the ‘Include subfolders’ checked and it begins reading as it normally would but then faults out with the first error code presented above.

But once I pulled all the files out and into the main folder the issue went away. At this point I am assuming that it is having a challenging time reading through the folder structure (i.e. folders withing root folder).

Hi @davehansen , I’m guessing that there might be a file that’s not correct.

I would say just add a few files at a time to the folder and see which one makes the Reader fail. There might be a file where the headers is screwed up (delimiter, multi lines with \r\n, etc)

EDIT: Sorry, I missed your last part. So it looks like the files are OK… it’s only when it needs to read them via sub folders that it complains. I’m not sure why this would be.

That’s a very good guess, but the error message is not reflecting this, unless it’s a misleading message.

All of this solved my main issue with the BOM characters so that was a win. We will just need to add one more step to our process of file movement until I can solve the reading error. Not a deal breaker, we can manage that.

Thank you for the help!

No problem @davehansen , you can flag the post #2 as solution then.

For your other issue, you could open another thread if you want, but it’s really hard to reproduce.

I created this following structure where I copied sample.csv (renamed from sample.txt, and I also added 1 line of data) in different sub folders, but it still worked without any issue:

+ [New Folder]
+--[sub1]
|    +-sample.csv
+--[sub2]
|    +-sample.csv
+--[sub3]
|    +-[sub31]
|    |   +-sample.csv
|    +-sample.csv
+--sample.csv

So I have 1 file at the root level, 3 files in different 1 level sub folders, and 1 file in 2nd level of sub folder.

I am able to read all these 5 files via Files in folder and Include subfolders without any issues:

It’s going to be hard to reproduce.

1 Like

@davehansen you could use a try-catch setting to determine which file would cause problems and then skip the file or do some other manipulations.