Self-paced Course: [L3-CD] Continuous Deployment and MLOps

Lada · September 20, 2023, 3:53pm

This forum thread is for questions related to KNIME self-paced course [L3-CD] Continuous Deployment and MLOps. For example, if you need help with an exercise or have some feedback, please post it here!

You can access the course here. This link will direct you to the learning management system.

Thiemo.Kellner · December 1, 2023, 2:38pm

Hi

I am surprised that, in L3-CD the trainer predictor workflows do the identical data preprocessing after the partitioning. I feel this will lead to incorrect results, possibly just marginally, if statistical measures are applied. In the exercise part 1 workflow, e.g., there is outlier management. However, the criteria of being outlier or not is purely based on the training set. However, the test set is “recorded” data the same way as the training set. That could lead to rows in the test set being detected as outliers, that would not be classified as outliers if the whole set had been used to determine the outlier threshold.

Am I missing something?

Kind regards

Thiemo

Sorry, if I incorrectly replied to your post. I tried to open a new topic with category Resources and Courses, but I lack permissions.
imagen

Lada · December 4, 2023, 8:15am

Hi @Thiemo.Kellner !

Thank you for your interest in the course. This thread is the right place to ask your question about the course.

We partition the data before all the other preprocessing steps to avoid data leakage, i.e., to make sure no information from the test set is used in the during the model training (edited). Since we want to build a model that works well on unseen data, we also need to evaluate it on unseen data. Here is a short video explaining the concept of the data leakage.

Let us know if this is helpful.
Lada

Thiemo.Kellner · December 4, 2023, 8:48am

Hi @Lada

Thanks for replying that quickly, providing your clip on data leakage. I feel, I will need some time to digest that what I saw as a benefit actually should be/is a flaw.

Kind regards

Thiemo

Thiemo.Kellner · December 4, 2023, 11:08am

Hi @Lada

I have been mulling it over a bit. I do not see in how far applying data cleaning techniques to the whole data set results in data leakage, especially when the techniques result in removal of rows.

Could you expand on that?

It just comes to my mind that the parameter connection from the training cleansing nodes to the test nodes seems to infringe the “no data leak” paradigm, the “no information from the training set is used in the … model … evaluation”.

What am I getting wrong?

Kind regards

Thiemo

Lada · December 4, 2023, 11:56am

Hi @Thiemo.Kellner ,

I just reread my previous answer and I think I indeed wrote it in a confusing way, what I should have written was: “make sure no information from the test set is used in the during the model training”. Let me try to explain in more detail.

What we want to achieve is that the test data plays the role of unseen new data for which you’re building the model and on which you want the model to perform well. Therefore, it is important, that no information from the test data is used during training. Because then it can’t be considered unseen by the model anymore.

If you take the whole dataset to do some calculations, for example, to calculate the quartiles to define the outliers or to calculate min and max to do min-max normalization, you use the information from the distribution of the whole dataset, including the test data, to prepare the data that will be used for the model training.

However, this would not be the possible when you apply the model to the new unseen data (independent on data preparation technique) because this data comes only after the model is trained. Since you want your test data to mimic the unseen new data, no information from test data should be used during training. In this case, during the evaluation, you can assume that the model will perform on unseen data as well as it performs on the test data.

Hope it makes it more clear,
Lada

Thiemo.Kellner · December 4, 2023, 1:56pm

Hi @Lada

Thanks for your explanations.

My thought on the outlier stuff is the following.

We have a total set of data T.
This has outlier rows: O1, O2, O3, O4.
O3 and O4 are extremely close to the threshold
There are rows I1 ad I2, that are extremely close to the threshold, but still good.
We split the data into sets Tr and Te.
O1, O2, O3, I1 go into Tr
O4, I2 go into Te
We do outlier filtering on the Te set. As O4 is missing, it could happen that I1 gets classified as outlier or that O3 gets classified as good. I do not know which will happen, but I do not feel, this matters.
We train on Tr, but we know that the training set is slightly faulty, as either I1 should have been used also or O3 should not have been used. Thus, I presume, that the model will not be optimal.
We apply the outlier threshold of the Tr set to the Te set. It can happen, that O4 will be not classified as outlier or I2 will get classified as outlier. I do not know which one happens, and I am not sure, that this matters, but I feel the performance is not measured as exact as it could.

I guess I just have to accept, somehow, that there is some to me counter-intuitive inherent.

Kind regards

Thiemo

Thiemo.Kellner · December 4, 2023, 2:00pm

Hi @Lada

I am working on Part 2 03b. My original prediction for 112 is

However, I fiddle the input to something, I would presume, that the score would get worse still, but the contrary is the case. Did I do something wrong, or is the model just so funny?

Kind regards

Thiemo

Thiemo.Kellner · December 6, 2023, 9:55am

Hi @Lada

I might have come across a fault in the course documentation/test. The CDDS process is described with a Training space/environment.

However, the knowledge test then deems it as wrong.

Did I understand something incorrectly?

Kind regards

Thiemo

Thiemo.Kellner · December 6, 2023, 10:46am

Hi @Lada

I am on Part 3 01, and I suspect, I will come across this with the other exercises as well.

The description sais to upload to the development space. I have not been able to find it.

What am I doing wrongly?

Kind regards

Thiemo

Lada · December 7, 2023, 11:20am

Hi @Thiemo.Kellner ,

Regarding the question about the model - I think it might be indeed not the best model. The focus in this course is on the deployment, not on the development of the best model, so I guess it might not always behave as expected.

Regarding the question about the Training space always present in the CDDS extension - this is indeed not always the case because the CDDS extension has 3 levels: in the course they are called Simple, Automated and Continuous deployment (in the documentation, they are called respectively level 1, 2, and 3). When only Level 1 of the CDDS extension is installed (for simple deployment), the Training environment is not needed and the Training space is not created during the CDDS installation.

Finally, apparenty the Development space got deleted*. I am sorry for the inconvinience and thank you very much for reporting it. I reinstalled the CDDS extension and now it should work as described in the exercises. I hope you have time to try it out now.

Thank you for your comments! It will be useful feedback for the next course update. Please let us know if you find any other issues.

Best,
Lada

'* p.s. This is why the exercise instructions mention that this part of the exercises includes collaboration with other learners and it is important not to change or delete the CDDS spaces.

Lada · December 7, 2023, 12:16pm

Hi @Thiemo.Kellner ,

coming back to your question about outliers and partitioning. There are different ways to detect outliers, and even the Numeric Outliers node that applies the interquartile range to detect outliers, allows you to customize the interquartile range multiplier. This means that the threshold can differ. Next, if the outliers are extremely close to the threshold and also to the next non-outlier values, maybe they are not that extreme to change the model behaviour significantly.

Maye a simple way to think about it is that the idea of the test data is to mimic the data that you don’t have yet. And you can’t use the data you don’t hae yet in the preparation of your model.

I hope this helps.
Lada

Thiemo.Kellner · December 7, 2023, 12:39pm

Hi @Lada

Many thanks to get at the issue. To give feedback is the least, I can do for a course free of charge.

About the training space. I am very sorry, my bad. I now see, I was not looking that correct type of diagram. It the continuous one, not the simple…

With respect to the missing Development space. I was afraid that someone had deleted it erroneously. Thanks for reinstalling it. I could deploy, but I am having still an issue.

I get the following error when trying to open it.

My KNIME installation is on the B drive while my Windows profile is on C. Does that error message mean, that the path of the temporary files must be within the installation?

Kind regards

Thiemo

Thiemo.Kellner · December 7, 2023, 12:49pm

Hi @Lada

With respect to the latest error message, I have news, maybe just a workaround. I changed to the following setting and now it seems to work.
imagen

Kind regards

Thiemo

Thiemo.Kellner · December 7, 2023, 1:46pm

Hi @Lada

I could make work the simple path and the continuous path. However, not the automated path. The upload worked bbut not the sending to training, even though it tells me it should have worked.

The workflow never showed up for approval in the admin workflow.

What have I been doing wrong?

Kind regards

Thiemo

Thiemo.Kellner · December 9, 2023, 10:19am

Hi @Lada

I am pretty sure, that addressing you with this issue is the wrong person, however, I hope you know to whom I should address, after having taken my L3 exam successfully.

First, in the exam I encountered a question that, to me at least, does not look like a question at all. I could attach a screenshot, but thought you might not want me to un-disclose it here. It starts with “A team …”.

Second, I would like to get an overview of my examination results such that I can figure out, which question I answered incorrectly to improve on that.

Kind regards

Thiemo

emilio_s · December 11, 2023, 8:42am

Hi @Thiemo.Kellner ,

Thank you for your feedback about the certification question (and for not disclosing exam questions…). There was indeed a mistake in the question and we fixed it now.
Unfortunately, we don’t share exam transcripts.

Have a nice week,
Emilio

CandyKang · December 17, 2023, 1:42pm

I am exploring the L3 courses. That’s great for the new version of self-paced courses, but because of the network access we are not able to watch the videos. Is there any other way? Thank you.

Lada · December 18, 2023, 3:43pm

Hi @CandyKang !

Thanks for reaching out and letting us know that you can’t access the videos. We will try to find the solution as soon as possible and will get back to you.

Best,
Lada

kurt · June 7, 2024, 9:12am

Good day,
in module 8. Continuous Deployment for Data Science (CDDS) step 1 you talk about using the Configure Monitoring Output component. Is there an example of what this component might look like? I cannot find any hint on the forum for it.
I found this workflow Model Monitoring with Integrated Deployment but not sure if we refer to the same thing.
Thx :Kurt