columns get mixed up when creating bar chart or group by

Learn2019 · December 12, 2018, 11:30pm

I imported a csv file and all the columns look fine in the file reader (have to use allow short line for it to be read correctly). However, when I create a bar chart or use groupby note, the columns get mixed up. For example, gender column only has male and female and NA values, but for some reason, the bar chart reads data from another column named hair color.

daria.goldmann · December 14, 2018, 6:02pm

Hi Learn2019,

What are the settings you are using in the configuration of the Bar Chart (JavaScript) node?
And what would you like to see as a result?

Kind regards,
Daria

Learn2019 · December 15, 2018, 12:44am

Daria,

Here is the configuration. I just need to see a bar chart with only three values. Male, Female and NA.

mlauber71 · December 15, 2018, 11:23am

Are you able to share the data and the workflow? If you are we might take a look. These things come to mind:

i would recommend doing a group by on the columns in question to see if they really do not contain the unwanted data (if they do the problem is with the import)
you mentioned you had to use “short line” to import the data. CSV files can be tricky i a separating character also appears in the data itself. KNIME sometimes is seemingly not so great in handling that.
from the screenshot there seem to be quotes and maybe even cascaded quotes in your data, this could easily mess up the columns on import
you could try doing the import with the R package Readr *1) or you could try the File Reader of KNIME instead of the CSV reader *2) - they give you more control about how to import data
if you can get the data via a more stable format then CSV you might want to use that

*1)

*2) KNIME Learning Center | KNIME

Learn2019 · December 17, 2018, 4:58pm

mlauber71,

Thank you very much for your help It is definitely the problem with importing. I looked at it more carefully and it seems that several rows are not imported correctly. I used File reader node but it did not solve the problem. I am new to KNIME. I tried R source node but I am not sure how to make it work. I will continue working on it and post a reply when I solve it.

mlauber71 · December 17, 2018, 5:11pm

If you are able to upload the data or a sample that shows the potential problems we would be able to have a look. I would recommend trying to use the R package or get the data in a more ‘stable’ format. Like using a more exotic separator like “|” or even “¤” (Ascii 207) - we used to call this the “run over turtle” and it is very uncommon though still an ascii sign.

Learn2019 · December 17, 2018, 6:21pm

Thank you! Here is the link:

https://drive.google.com/file/d/1_csuYvyT1daXAJZqFxelRDcrr1u2l5TW/view

mlauber71 · December 17, 2018, 8:50pm

with the help of R’s Readr I was able to load the data and now it seems there is no mixup in the data. Although some columns need serious work if you want to use them like first_appear which is a strange mixture of dates and some missing.

You should analyse the data with group bys and check some plausibility because also with Readr there is no guarantee it caught all the quirks. It looks like a mixture of several data bases with slightly different structures. You might want to check them.

Things like these form of quote very likely drives a KNIME node mad

kn_example_comic.knar (2.2 MB)

Learn2019 · December 17, 2018, 9:35pm

Thank you so much! Much appreciated.

daria.goldmann · December 20, 2018, 10:33am

You could use the Line Reader node to just read the file line by line. Connect it to the Cell Splitter node to split the file using specified delimiter. Both nodes allow to keep the Column headers, so you won’t lose this information. Then you could filter unwanted Rows and Columns, e.g. those with missing values, using the respective filter nodes.

Best,
Daria

Learn2019 · December 20, 2018, 9:36pm

Thank you!!! This is very helpful.

mlauber71 · December 20, 2018, 9:59pm

<smartypants mode>
The problem is you would loose the information when you might just as well have kept it. And in this specific dataset with just the split by “,” you will get a systematic shift further down the line because of the strange
, “1940, June”, dc
construct.
</smartypants mode>