i am working on a dataset that has more than half of the variables as categorical i need to convert those columns into numerical, and there are couple of ordinal variables in that, could you please confirm me can we perform dummy encoding in KNIME? if so then how to do that and which node is used? if not then please suggest alternate solution?
Please repsond asap, as need this urgent. Thanks and advance .
If itâs urgent, I would recommend to be as complete as possible. It would go a long way if you can elaborate on current input, expected output, example data set, screenshots, workflows, etc.
There is a whole bunch of stuff you can do with a Variable Expression node.
Hello @booramaravind
Can you clarify a couple of points from your description? Aiming to help you, a more extended explanation would be a plus; like what is the end goal for your data preparationâŚ
When you say -âi need to convert those columns into numericalâ- there are two ways to do this:
convert to factor
dummy variables
Depending the end goal analysis you would prefer one or the other.
When you say -âthere are a couple of ordinal variablesâ- i interpreted âqualitative ordinalâ. If it is the case, you would need to provide some sample data, how your data is structuredâŚ
@ArjenEX has already anticipated to my answer, requesting an extended challenge description.
Thanks @ArjenEX and @gonhaddock
my goal is to predict patient length of the stay in hospital
Stay (in days) - already in numerical nature
below are the columns:
Department - gynecology, radiotherapy, surgery, TB & Chest disease, anesthesia
doctor_name - Dr John, Dr Simon, Dr Isaac, Dr Mark
gender - Male, Female
Type of Admission - Emergency,Urgent,Trauma
health_conditions - Asthama, Other, None,High Blood Pressure,Other, Heart disease
Insurance - Yes, No
Ordinal variable:
Severity of Illness - Minor, Moderate, Extreme
Please find the workflow attached below that i have done
@booramaravind if you want a quick way of preparing data for machine learning you could consider using vtreat package that would do it all in one step and also encode categorical data.
Thank you for sharing @booramaravind
However the data is not included in workflow. You can deactivate reset workflowâs check box when exporting the workflow. Anyhow, the upgraded description give us a more complete scope vision.
I have some previous work already deployed with KNIME base nodes, it can be used to develop a component if not in hurry. As Iâm not currently working at my PC
Maybe @ArjenEX 's proposed solution, is a way to explore as wellâŚ
The first step is the ordinal variables; you can replace them easily from a simple dictionary, with the help of âValue Lookupâ node replacing âCell Replacerâ deprecated node (try any node name).
Iâve put together some previous work⌠draft version
For the categorical ones, Iâve created a component that can do the job. Iâve tested it with a public dataset. Let me know if it works properly with your data.
Hi @booramaravind
âCell Replacerâ node being deprecated, this is the first I noticed about; but it may be an advance to version5 settings. âCell Replacerâ in your system should be available in your system.
There arenât any details about versions within node hub site.
It is a very simple substitution, from your ordinal table dictionary into data table. Once they are transformed (as numeric! ~integer) you can feed all the data into the component that I am sharing.
Once into âDummy_Variablesâ component, numeric columns are bypassed since it work out handling only string type columns, creating Boolean dummiesâ
The workflow that you shared hadnât any data saved within workflow, nor relative to workflow data area either. You can export your workflow with embedded data, by deactivating âReset Workflow(s) before exportâ 's check box; it is marked as default.
Hello @booramaravind
Any progress with this challenge?
As subject is related to hospital stuff and human health, that I consider it to be important; I wouldnât like to see you tangled in a multicollinearity mess up.