I built a workflow for a predictor model. All worked fine.
My predictor model is trained with data about universities.
Explanation: When a university is listed first, an accreditation check needs to be carried out.
The second time, the same (known) university does not need to be checked again.
I have two questions please:
Question 1: In the test set, known (from the trained model) and unknown (new) universities are listed. For the known universities, a prediction (YES or NO) is listed.
For the unknown universities, a question mark “?” is shown.
Question: How can I replace the question mark with a “YES”? (The university needs to be checked, when it appears the first time)
Question 2: In the basis datasheet (the datasheet to train the model), some universities are listed several times. So the value is once “YES” and the next time “NO”, as the university does not need to be checked again.
For universities, which were tested several times, the correct value is shown in the test set (= NO).
However, universities, which were only used once in the basis sheet to train the model, just have the value “YES”.
Here, the second time (e.g. when the university is listed in the test sheet), a “NO” should be shown.
(The university is already known and checked)
However, the model does not recognize that. May you have a solution for this issue?
@LukasB welcome to the KNIME forum. You might have to do some more explaining about what you want to do and what your data is. What is the target that your model is trying to predict? Is it the YES and NO oder something different.
And the question how you deal with ‘old’ data (before the universities have been ? verified) very much depends on your use case and what you want to do with it and what the data actually says in that regard.
If it is just about the missing YES/NO data you could just use a rule engine or Missing value node and replace MISSINGS with (well) YES or NO - but I assume there is more to it.
Hi @mlauber71 , thank you for your response. I have a list of students (applicants) who apply for a MSc. A precondition is that the (national or international) student completed a BSc at an accredited university. National universities are commonly known, but international universities often need to be check (e.g. with anabin: Anabin - Informationssystem zur Anerkennung ausländischer Bildungsabschlüsse). The list, which I use for training a prediction model contains applications from the years 2017 - 2022. Therefore, many universities have already been checked. On the list, the first check is quoted with YES (=> Check in Anabin). The second check (e.g. an applicant from the same university one year later) is quoted with NO, as the university is already known (=> Do not check in Anabin). The target value is therefore = Check in Anabin (YES or NO).
I have two issues:
a) Universities, which never appeared between 2017 - 2022 and appear on the test sheet, are quoted with “?”, as they are not recognized.
I want to replace the “?” with YES (Check in Anabin).
b) Universities, which were only checked once between 2017 - 2022 only have the value “YES”. The model does not recognize that the second time (e.g. when the university appears the second time first on the test sheet), needs to be “NO” (the university is already known).
If the (name) of the university is part of the model you will not be able to get a result for unknown ones. Also you will have to be sure including such information would work for your model in the first place. Alternative would be to create a model without the specific name but with characteristics that would describe a university but also could appear in unknown ones.
How should the model ‘know’ about this unless you provide that as an information. It might only deduct it by the year (but that runs the risk of some tautology).
To be honest I still do not understand what you are doing. If the YES or NO is simply the question if the university has been checked you would not need a fancy model for that but just to check your old entries and see if a YES somewhere appears and if not it is a NO. That would be more of a rule (which also can be implemented via KNIME).
Maybe it is possible to upload some dummy data and your approach in oder to get a better idea about what is going on (maybe there are some free student data on Kaggle).
Thank you for your feedback. May I ask two more questions:
a) As far as I understand, in my use case it is not possible to change the “?” to YES before the results (predicted data) are shown. However, is it possible to link the PMML Predictor with another node where “?” can be shown as “YES”?
b) How can I set the rule in KNIME for (training or deploying) my specific model, so that the second entry is always listed as “NO”? (for the column “Check in Anabin”)
Thank you for your feedback.