String to Date&Time node and decision tree learner

ajatal · October 17, 2022, 8:45am

Hello Community,

I have bee using both Decision Tree Learner, Predictor and String to Date&Time to predict the time. I have some dates, so that I use the node of String to Date & Time to convert. However, I have three main challenges below:

When I tried to increase the Relative (60% or 70%), it could potentially freeze the program while running the node of Decision Tree Learner.
Once I selected “Guess data type and format” in String to Date&Time, it stated that “No suitable format found”. Alternatively, I have manually modified the date format to fit with the date format in the String to Date&Time. Then the error message stated in the console “Execute failed: Failed to parse date in row ‘Row0’: Text ‘2/1/2000’ could not be parsed at index 0”
Whenever the program got frozen there, I had to close the program. When I re-open KNIME, I have to re-configure the relative percentage in the node of Partitioning, the class column of Decision Tree learner, and the maximum number of stored patterns for HiLite-ing

Please let me know if any of you might have some tips or suggestion to share with me. Many thanks a lot in advance.

dora_gcs · October 18, 2022, 1:08pm

Hello @ajatal ,

welcome to the Forum!

Could you share some sample data you are using?
According to your post I have tried out these data type with the String to Date&Time node and it works for me:

Furthermore it is quite hard to answer why your AP is frozen during the execution at this point. It could be wrong settings, or the insufficient memory allocated to your AP.
Could you share more details? Maybe the part of your WF where you are facing this problem?

Have a nice day!

ajatal · October 18, 2022, 3:32pm

Hi @dora_gcs ,

Thank you for replying me with you suggestion.

I have tried according to your screenshots. Meanwhile, I’m trying to share some sample as the screenshots attached here.

The String to Date&Time node is kind of working now. However, the decision tree learner node doesn’t work.

Here is my full model:

In the CSV Reader, the input file contains the order number, the order entry date, the order received date, lead time (the days spent from entry date to received date).

I want to predict the new lead time or the new order received date based on these historical data.

If i would select the both “entry date” and “received date” in the String to Date&Time node, i got “WARN Decision Tree Learner 3:3 Class column “Purchase order received date” not found or incompatible” in the console. If I would select only the entry date to be included in the String to Date&Time, then the node is working. It is really strange as both of the dates are in the same format.

Here are some of my sample data:

If I would select either “entry date” or “received date” in the String to Date&Time node, I would not be able to select it inside Opentions-> General → Class column of the Decision Tree Learner. Also, i might have the message of “ERROR Decision Tree Learner 3:3 Execute failed: Cannot invoke “org.knime.base.node.mine.decisiontree2.model.DecisionTreeNode.getOwnIndex()” because “children[i]” is null” in the console.

Could you please help me have a look in further?

Thanks a lot in advance.

mlauber71 · October 18, 2022, 4:24pm

@ajatal I think we have tow issues here: what about converting Strings to dates (A meta collection about handling data and time functions in KNIME, R and SQL (including databases and big data systems) – KNIME Hub). This should be relatively straightforward you will have to find the right pattern and if it is not working in one node you might just use two String to Date nodes in a row and get the job done.

The other thing is ‘predicting’ a future date. I do not think just using date variables as targets will do you any good since in the future dates might change. I think you will have to use and predict date/time variables in a relative way. Days-since-today / Days-before-today (or whatever your anchor date is) and prediction in the form of: days-until-next-sale or something. In this case you could try and predict a numeric (regression) value (Machine Learning Meta Collection (with KNIME) – KNIME Hub).

So you would predict the days until the next sale (?) and then derive a date by adding that number to your current (reference) date.

If you just have patters of sales (without further items/variables) you might try and formulate the whole things as a time series problem. You might have to do some research achieving that.

https://medium.com/mlearning-ai/time-series-analysis-with-knime-an-introduction-7ce01a7ce055

ajatal · October 19, 2022, 7:02am

Hi @mlauber71 ,

Thank you for your kind suggestion by introducing time series to me. It is really helpful to learn more about the tool.

However, our goal aims for predicting on the time to be spent (lead time) for the further orders. Each order has been placed separately and it is hard to conclude if any certain period could bring more orders place.

We have the historical data, the order number, the item number, the order entry date and the order received date. Using KNIME to predict the new order received dates to compare with the historical received date. We want to see how close they are so that we could know if the model is the right one to predict on the further orders and their receiving dates or lead time.

Can you get my point of trying the model? Do you think if the decision tree should be the right option to achieve the goal?

Thank you again for your time.

dora_gcs · October 19, 2022, 9:19am

Hello @ajatal ,

I believe your use case can be handled with KNIME, you are on the right place.
Speaking of right places, have you already taken a look on our Hub? There are many example workflows even in this topic as well for inspiration. (Here is a list of searching for ‘inventory’ giving an example.)

Hope you can find something which could help you!

mlauber71 · October 19, 2022, 10:46am

@ajatal my impression is that you have to put more work into understanding and setting up your data. What information is there that might influence the orders and their timing. Do you have any data about that. Is it seasonal? Would it depend on the customer placing the oder - and so on.

The next thing is the presentation of data to the model. As said before using fixed dates in a model would not make much sense since it would then only work for past data. You might be able to create something like a prefect model for the past which might be entirely useless for further predictions (if this is what you seek).

You have plenty of model types in KNIME (Can KNIME be used to show the employment and unemployment rate in the UK from the year 2008-2018, showing the prediction for the next year? - #11 by mlauber71) but you will have to prepare the data in the right way. And again: the problem might be better formulated as a time series task. But that depends on the data and what you want to achieve.

Maybe you could provide sample data representing your challenge without spelling any secrets. Sometime Kaggle might have a similar case you could adapt.

ajatal · October 20, 2022, 8:07am

Hi @mlauber71

Thank you for the kind comment. Here is the sample data attached. There are purchase order number, item number, purchase order quantity, the order entry date, receive date, requested date and the lead time. Column G, Lead time is the result that the receive date minus the entry date (G = E - D)

So far I don’t have other data to reflect if the receiving date could be effected by other factors. I need to run a proper model based on these data, except the lead time (Column G) to get the new lead time. To see how the model think about the lead time supposed to be. Then I could compare the new lead time with the old lead time (Column G) to see how close they are. Then i could use this model to predict the further orders. Can you see my point? I don’t see the correlation with the time series or the trend as each order is placed separately. That’s why I chose decision tree.

Please let me know if you might have further suggestion to me. Many thanks in advance.
Regards

Daniel_Weikert · October 20, 2022, 4:52pm

Maybe you can gather further data (features) for your model. Based on the screenshot I do not see many options for a model to learn a mapping function
br

mlauber71 · October 20, 2022, 5:07pm

@ajatal Ok to sum up your data you have

the lead time as a target
as features
** an item number (which might contain additional characteristics you might be able to add or which might be there implicitly if you have enough cases and the items are stable)
** the order quantity (basic correlation might be: smaller quantity = faster execution. Or slower execution - larger orders might be more important and get priority treatment)

There might be some sort of correlation between the items and the quantity ordered if there is any systematic correlation between what that item is and how long it might take to produce/get it. If this is influenced by other factors as well you only might have seasonal information (holidays, weekends) to go with in your current dataset; or you could create features from past orders (have there been similar orders been fulfilled in the -recent- past which might influence the ability to procure the items).

All this will depend on what might influence your production/procurement. You will have to talk to the business people who might know about this. If the ability to produce the items is heavily influenced by outside factors (strikes, shortages with providers/workers, regulations etc.) you could try and get data about that also to enrich you model.

Currently it would seem that this can be formulated as regression problem. The expected quality might very much depend on the conditions mentioned. You might want to upload a sample file - it is hard to work with screenshots

ajatal · October 21, 2022, 6:06am

Hi @mlauber71,

Thank you for sharing your thoughts. I have attached and shared an example file with much example data, as item class stored in the different factory and mode of delivery .Though i don’t think they have much correlation with the lead time.
Book2.xlsx (13.0 KB)

Our challenge is that due to the massive data, so far we could not conclude if there are any correlation between the lead time and the quantity or the mode of delivery. So we need to try with the different approaches to see which might predict the new lead time is much closer to the fact.

Please help me have a look and let me know your thoughts.

Thank you again for the kind help

ajatal · October 21, 2022, 6:08am

Hi @Daniel_Weikert ,

Thank you for looking into my case. I have attached a sample here with much data.
Book2.xlsx (13.0 KB)

It would be great to get your ideas also.

Thank you for your time.

ajatal · October 26, 2022, 3:20pm

Hi @mlauber71 ,

Kindly check with you if you might have any update about the sample file shared last week.

Thanks a lot!

mlauber71 · October 26, 2022, 3:53pm

@ajatal could you explain which columns are features and confirm what column is the target (Lead time). Then you might want to define what role the date variables might play with regards to what I have said about them.

And maybe you try to comment on the remarks I had about your approach. You will have to think about what information your data might be able to provide.

You could of course just try and build a regression model with your data (without the date variables) and target but it might no help you a great deal until you have sorted out your data preparation.

ajatal · October 26, 2022, 4:11pm

@mlauber71 Thank you so much for the tips and comments shared. I will have a try and might come back with further questions.

ajatal · October 27, 2022, 1:46pm

Hi @mlauber71

Thank you again for kindly introduce H2O. I have tried and here is the workflow:

In the node of CSV Reader, I have unselected “limited data scanned” so it should scan the entire data. However, then view the result in the node of H2O to Table, I can see only 1/5 entire data. I checked all the rest nodes but I could not find anywhere else to config in order to view the whole results. Would you mind give me some suggestions to view the entire results?

I haven’t tried AutoML yet. Should I include it into H2O or it is the separate model type?

Thanks a lot in advance.

mlauber71 · October 27, 2022, 3:17pm

I assume this would be the 20% test data to validate the model where you would base the statistics on. I would suggest to learn about creating machine learning models in a broader approach so you might see which approaches could work for your business question.

system · January 25, 2023, 3:18pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.