clarification on why knime is treating String data type as integer for some data sets

Attached file for your reference, which I have made a comparison between the 2 data files. Can you please refer to those highlighted in yellow where I have indicated my questions?

why some are in integer while others are in string even though the data is in number form.

Please clarify on why knime is treating them as different data types.

Confusion_over_categorisation.xlsx (666.7 KB)


Hi @booramaravind,

Could you please share your workflow? I cannot reproduce this problem. For me, all numeric columns are integer. What is your KNIME version? Perhaps you have changed the types in the “Transformation” tab of the Excel Reader node.

What happens when you use a new Excel Reader node? Does the problem persist?

:blush:

1 Like

@armingrudd thanks for your reply
Part-B.knwf (18.3 KB)
DT &C_W3_PracticeExercise_Workflow- HR Attrition (9).knwf (97.5 KB)
Please go through highlighted questions in the excel sheet ,which i shared before, and give some clarity on why in one data set categorical data is considered as string, when the same scenario in another data set those are considered as integer, why?
KNIME veersion 4.5.2

Everything is working fine. You have provided an XLSX file but the workflow uses CSV. I converted the file to CSV and read it both with the CSV Reader node in your workflow and a new CSV Reader node.
The node in your workflow is configured to read those columns as string. A new CSV Reader node reads them as integer. If you just replace the CSV Reader in your workflow with a new CSV Reader node, the problem would be solved. Or you can change the column types when your read them in the “Transformation” tab of the CSV Reader node.

:blush:

2 Likes

@armingrudd Thanks for clarification,
so the problem is with data set type?

The problem is with your node configuration. You need to use a new node or change the configurations you have already set.

:blush:

1 Like

@armingrudd can you please tell me what needs to be change in node configuration.

In the node configurations, go to “Transformation” tab and change the column types for the desired columns. You can change types from string to integer.

2 Likes

@armingrudd thanks for reply , may i know the reason why to change data type.?

If you use a new CSV Reader then no need to change types, but if you wanna fix the current CSV Reader in your workflow to have integer values instead of string, you need to change the types in the “Transformation” tab of the node’s configuration window.

Hello @booramaravind
The data wrangling can become a very complex issue. We cannot answer with a single reason on why the automated selection interpreted one or the other format, without a detailed revision of the source and a process description on the data generation.

We cannot see the whole dataset in your pictures; The workflows that you shared are neither useful aiming to clarify your question, as the data is not embebed in the data section. Despite I can identify a CSV format sourcing the workflow based on the CSV reader :face_with_monocle:

I would ask some questions regarding the CSV source, like:

  • Was it generated as an output of a DB query?
  • Does it need some specific encoding type?
  • Was it originally an excel converted into CSV for some specific reason?
  • If it was an Excel, was it a collaborative document?
  • If it was collaborative, there were different regional configurations working with the document? And if do so, who saved it last?
    …

Why all this is important? Because when your column is not interpreting the the expected data content, then the way you process it, can lead you to loss of information issue. Some of them are easy to fix, some other are not that easy.

When you enforce a column to read numeric, it can lead you to a loss of information issue (guess part of the information stored as double, with comma 6,0 decimal or as text “150”), in this case this information will be replaced with nulls.

Reading the data as string preserves all the information but affects to the usability of the data, as extra manipulation is required further to fix the issue.

I’ve prepared a CSV with some of the typical issues interfering when reading from uncontrolled sources:

My suggestion is to spend some time trying to understand the origin of the issue, and judge the better digest approach. In the above example I could enforce to numeric for columns 1 and 3(!), but maybe is not a good approach for columns 2 and 4, where some action may be expected: or in the process, or in the source itself, or in the reading configuration, or post reading transformations …

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.