Force KNIME to Read CSV Columns As Strings

ScottMcLeodPSLGroup · June 18, 2018, 1:39pm

This sounds easier than it is. I have to read files with an unknown number of columns and an unknown name of each column. I need KNIME to read all these columns as strings.

Example, when reading one file with postal codes from the US, it decided the column was an integer because there was no alpha-characters. In the process, postal codes with leading zeros are damaged. So I can’t convert them to strings after the read.

This is easy to do with CSV Reader and File Reader if you know the name of the columns in advance. But since I don’t know the column names in advance, I would have to change the File Reader configuration manually for every file submitted. That’s no an option.

Anyone else had to solve this problem?

mlauber71 · June 18, 2018, 4:57pm

You could try and combine this approach - using R package readr

with this thing:

The package seems to have the option to force all columns to strings.

izaychik63 · June 18, 2018, 5:35pm

As brute force solution: Load CSV without specifying it has header. So it will load everything as text (if there is no headers like 123). Then write it to the temporary CSV and read it specifying there is a header. For small files it has to work.

ScottMcLeodPSLGroup · June 20, 2018, 12:48am

Thank you both. The R solution would work well, but it does introduce something my team is not familiar with. I do love the option to read in the header. I can extract that first line and convert it to column names without too much effort.

josequirozramos · January 28, 2019, 10:43pm

Use the File reader node. Once you load a CSV file, you can click on the column header to change column properties. including changing a number to a string.

ricardo_martins · December 8, 2022, 7:58pm

You’re a genius!

Many thanks!

mwiegand · January 8, 2023, 9:50pm

Hi @ScottMcLeodPSLGroup and @ricardo_martins,

I happen to work on a workflow which allows to dynamically set the column types. Your challenge can be accomplished by exploiting default node behavior.

Important two know and adhere to are three things:

CSV must be saved by the source with enforced quotation (regardless if necessary or not i.e. in case of type int)
CSV Reader setting requires “Keep quotes” being defined (1st screenshot below)
Any unknown column must be forced to type String (2nd screenshot below)

Here is the workflow for further experimentation:

If this happens to be the solution you desired, I’d appreciate if you would mark it accordingly.

Best
Mike

ricardo_martins · January 9, 2023, 6:23pm

@mwiegand It works too, thanks!

mwiegand · January 9, 2023, 6:51pm

Gear to hear. If you could mark any of the proposed workflows as your preferred solution, that would be fantastic. Cheers, Mike

mlauber71 · January 9, 2023, 8:27pm

@ricardo_martins , @mwiegand another approach could be to use the bundled Python version and tell pandas to import all columns as strings:

import knime.scripting.io as knio
import pandas as pd
import pyarrow.parquet as pq

# ASCII 164 as a delimiter: ¤
df = pd.read_csv(var_csv_file, dtype=str, sep = '¤', engine='python')
df.to_parquet(var_parquet_file, compression='gzip')

I use the way via the parquet file from Python since the import directly into a KNIME table seems to have some problems.

mwiegand · January 9, 2023, 8:32pm

Quick question @mlauber71 and apologize if stray away form this conversations purpose with it but this might help me about another challenge:

Do you happen to know all / as many Py-Representations of the column types listed? I wonder if, via your solution, it would be possible to not only extract but also enforce data types during read.

mlauber71 · January 9, 2023, 8:56pm

@mwiegand that should also be possible and it took ChatGPT only about 4 tries to put this together and there still might be more data types out there - also shows the limitations of this automatic programming approach …

import pandas as pd
import pyarrow.parquet as pq
import ast
import pyarrow.parquet as pq

# Create a sample CSV file
csv_data = '''
col1;col2;col3;col4;col5;col6;col7
1.0;true;2020-01-01;1:00:00;cat1;434;{'a', 'b', 'c'}
2.5;false;2020-02-01;2:00:00;cat2;554;{'d', 'e', 'f'}
3.0;true;2020-03-01;3:00:00;cat3;677;{'g', 'h', 'i'}
'''

with open(var_csv_file, 'w') as f:
    f.write(csv_data)

# Read the CSV file and specify the data types of the columns
df = pd.read_csv(var_csv_file, dtype={'col1': 'float', 'col2': 'bool', 'col5': 'category', 'col6': 'int64', 'col7': 'str'}, sep=';', parse_dates=['col3'])

# Convert the col4 column to a timedelta object
df['col4'] = pd.to_timedelta(df['col4'])

# Convert the 'col3' column to a set data type
df['col7'] = df['col7'].apply(ast.literal_eval)

# View the data types of the columns
print(df.dtypes)

put this in the KNIME workflow (import complex CSV file and force all columns as strings using the bundled python version – KNIME Community Hub)

ricardo_martins · January 16, 2023, 1:43pm

@mlauber71

I used it that way, but using sep=None to auto detect the separator and encoding=‘unicode_escape’ to avoid some errors that were coming up and it worked too. I’m not a python pro, what do you think of this solution?

df = pd.read_csv(knio.flow_variables[‘file-input’], dtype=str, sep=None, encoding=‘unicode_escape’, engine=‘python’)

knio.output_tables[0] = knio.Table.from_pandas(df)

mlauber71 · January 26, 2023, 10:38am

@ricardo_martins maybe you can provide an example to check. Also there is this idea to detect the deparator: python - Auto-detect the delimiter in a CSV file using pd.read_csv - Stack Overflow