Can someone help with the below situation,
i am working on a dataset that has more than half of the variables as categorical i need to convert those columns into numerical, and there are couple of ordinal variables in that, could you please confirm me can we perform dummy encoding in KNIME? if so then how to do that and which node is used? if not then please suggest alternate solution?
Please repsond asap, as need this urgent. Thanks and advance .
If it’s urgent, I would recommend to be as complete as possible. It would go a long way if you can elaborate on current input, expected output, example data set, screenshots, workflows, etc.
There is a whole bunch of stuff you can do with a Variable Expression node.
Can you clarify a couple of points from your description? Aiming to help you, a more extended explanation would be a plus; like what is the end goal for your data preparation…
- When you say -“i need to convert those columns into numerical”- there are two ways to do this:
- convert to factor
- dummy variables
Depending the end goal analysis you would prefer one or the other.
- When you say -“there are a couple of ordinal variables”- i interpreted ‘qualitative ordinal’. If it is the case, you would need to provide some sample data, how your data is structured…
@ArjenEX has already anticipated to my answer, requesting an extended challenge description.
Thanks @ArjenEX and @gonhaddock
my goal is to predict patient length of the stay in hospital
Stay (in days) - already in numerical nature
below are the columns:
Department - gynecology, radiotherapy, surgery, TB & Chest disease, anesthesia
doctor_name - Dr John, Dr Simon, Dr Isaac, Dr Mark
gender - Male, Female
Type of Admission - Emergency,Urgent,Trauma
health_conditions - Asthama, Other, None,High Blood Pressure,Other, Heart disease
Insurance - Yes, No
Severity of Illness - Minor, Moderate, Extreme
Please find the workflow attached below that i have done
Hospital LOS.knwf (34.1 KB)
@booramaravind if you want a quick way of preparing data for machine learning you could consider using vtreat package that would do it all in one step and also encode categorical data.
Thank you for sharing @booramaravind
However the data is not included in workflow. You can deactivate reset workflow’s check box when exporting the workflow. Anyhow, the upgraded description give us a more complete scope vision.
I have some previous work already deployed with KNIME base nodes, it can be used to develop a component if not in hurry. As I’m not currently working at my PC
Maybe @ArjenEX 's proposed solution, is a way to explore as well…
The first step is the ordinal variables; you can replace them easily from a simple dictionary, with the help of ‘Value Lookup’ node replacing ‘Cell Replacer’ deprecated node (try any node name).
I’ve put together some previous work… draft version
For the categorical ones, I’ve created a component that can do the job. I’ve tested it with a public dataset. Let me know if it works properly with your data.
Thanks but am not able to find the Variable Expressions node for the 4.7.3
@gonhaddock thanks for sharing the flows but am not able to find the Value Look-up and Dummy Variables nodes for 4.7.3
‘Cell Replacer’ node being deprecated, this is the first I noticed about; but it may be an advance to version5 settings. ‘Cell Replacer’ in your system should be available in your system.
There aren’t any details about versions within node hub site.
It is a very simple substitution, from your ordinal table dictionary into data table. Once they are transformed (as numeric! ~integer) you can feed all the data into the component that I am sharing.
Once into ‘Dummy_Variables’ component, numeric columns are bypassed since it work out handling only string type columns, creating Boolean dummies’
The workflow that you shared hadn’t any data saved within workflow, nor relative to workflow data area either. You can export your workflow with embedded data, by deactivating ‘Reset Workflow(s) before export’ 's check box; it is marked as default.
Anyhow, I’ve prepared a workflow with dummy data based in your description:
20230621_cell_replacer_for_dummy_variables_v3.knwf (182.0 KB)
PS.- I found a bug in my Dummy_Variables workflow in Hub (a misconfiguration of a column filter); I will update it late along the day …
I assume KNIME team thought value lookup sound more like excel’s vlookup
@Daniel_Weikert , it sounds terrific
Going back to subject; bug in hub’s workflow has been fixed up.
I used the one to many node with the Portugese student data set available from kaggle.com
But you can also do your own encoding to crate variables - but this takes a lot more effort.
Or, you can use Python’s ability to do something similar (but I forget what it is called).
As suggested by @PhilTroy One to Many is basically one hot encoding from eg. sklearn or pandas’ get dummies.
Any progress with this challenge?
As subject is related to hospital stuff and human health, that I consider it to be important; I wouldn’t like to see you tangled in a multicollinearity mess up.
Hi @gonhaddock the issue got resolved, thanks for following up