There must be a better way (hopefully!)

Hello KNIMErs

I have process I must KNIME and I have a solution, but I think the solution is horrible. Really horrible.

My process receives a string and from that string I need to extract / extrapolate / infer / calculate: the start date, the end date, the total length and the length from (what ever today is) until the end.

These strings come from different sources. The majority look like this: “125 years from 1 November 2004”. Some look like this: “999 years (less 10 days) from 1 November 1885” (with or without brackets), or this “999 years from and including 1 January 2020”, or this “From and including 25 December 1981 expiring on 24 December 2196”, or (my favourite) “125 years from 25 rstjsryjkdlkfyu 1997”, plus many other variants. Oh, and the date can come in any format (luckily I have not seen American date format).

My (horrible) solution it to have many filters that run in series e.g. if string starts “[0-9]{2,4} years from” then go to branch A, else go to branch B. Branch A: if string does not contain “including” go to Branch AA else go to branch AB and so on and so on and so on. Horrible!

Is there a better way?

Notes:
It does not matter in the date calculations are ± 1 day
Length figures are in months or years (rounded down)
I have about 22000 examples - i have attached 300
Book1.xlsx (13.5 KB)
Still on AP 4.7.7

Many Thanks

Frank

One more thing: I am not expecting a full workable flow as a reply. Ideas and skeletons (if required) would be amazing!

Have you got a limited number of different patterns because they are generated by systems or they are generated by people ?
if from systems, then maybe to do pattern match for each one. For example with string manipulation or column expression.

from humans : not an easy one, try to breakdown patterns like using regex : xxx years from , catch a year , like 4 digits and identify if a from is before or after.
or the chat GPT Api, or does the chatgpt chat gives you the answers you want ?
there is some video on knime AI that i need to watch, maybe your solution is in there.
good luck.

hello @bpoleselmazak

The strings have been generated by both humans and computers from historical and contempary documents. There appears to be many different patterns throughout the input data.

I have flirted with ChatGPT and the results have been inconsistent. I have made fine tuning models for openai for other projects, but the initial testing for this use case didn’t turn out well. This is maybe because to are asking it to do different complex tasks e.g.
Input string: “125 years from 25 December 1998” Required action: 1. identify length (125 years) 2. identify start date (1998/12/25) 3. calculate end date (1998/12/25 + 125 years) 4. calculate time until end (end date - today())
input string: “from and including 1 January 2006 to and including 31 December 2130” Required action: 1. calculate length (2130/12/31 - 2006/1/1) 2. identify start date (2006/1/1) 3. identify end date (2130/12/31) 4. calculate time until end (end date - today())

Alex

@FrankColumbo what I do not fully understand is what is supposed to be the result from a pattern like

999 years from and including 29 September 2020 and to and including 29 September 3019

If you consider employing LLMs this might be a point. You might have to give clear instructions.

What you could try to do is turn the whole thing into a vector store and add this as an information either to a local LLM or to ChatGPT (via API). Any maybe test different prompts. What might be essential is to make clear what are samples and what the expected result should be structured. ChatGPT usually is better at understanding this.

If you could separate your strings into groups that also might help. With ChatGPT it could be a thing to activate some plugins like Wolfram to improve the chance of it getting the math right (which is not the strong side of the base product since it is a language model).

2 Likes

hello @mlauber71

I am required to output: the length, the start date, the end date and the time until the end date. (unless corrupt) I will always be able to get this information using the input.

The issue is that the input varies in format(s) and the information within it. e.g.

  • “125 years from 1 January 2008” contains length and start date

  • “999 years from and including 29 September 2020 and to and including 29 September 3019” contains length, start date and end date

  • “Starting on 1 October 2017 and ending on 8 February 3007” contains start date and end date

I do like this “If you could separate your strings into groups that also might help”. I think you have nailed it for me! Overview below:

  • Create 4 column with null values: start date, end date, length and time until end

  • Does the string have only 1 date? Yes - that date == start date column

  • Does the string have 2 dates? Yes - earliest date == start date column, latest date == end date column

  • Does the string have “[0-9]{2,4} years”? Yes - that number == length column

  • Calculate columns that are null

Thank you

Frank

1 Like

Hi @FrankColumbo , I too was deliberating on if this is a job for AI. It certainly will be the kind of thing we’ll soon be using AI for.

But in the meantime…

extract dates and durations.knwf (210.9 KB)
Here is my take on a possible regex-y workflow.

I made some assumptions about patterns, and it can be adjusted, but for the most part in the sample data, I figured that the following patterns would find you “years duration”, “less days”, “from date” and “to date” working from left to right across your text strings.

Pattern ColumnName
.*?([0-9]+) years(.*) Years
.*less\s*(.+)\s*day(.*) LessDays
.*?([0-9]{1,2}\s*(?:st|nd|rd|th).*[0-9]{4})(.*) FromDate
.*?[from|commencing|on]\s*([0-9]{1,2}[ .-/].*?[0-9]{4})(.*) FromDate2
.*?([0-9]{1,2}\s*(?:st|nd|rd|th).*[0-9]{4})(.*) ToDate
.*?([0-9]{1,2}[ .-/].*?[0-9]{4})(.*) ToDate2

The idea here being that it would process the data for each “pattern” working from the left and moving to the right. Each regex pattern contains 2 capture groups. The first is the piece of data to be collected, and the second is the “remainder” of the text to the right when that piece of information is found.

To process the rows and also loop through the rules recursively required a recursive loop. On each iteration of the loop, it collects the text (if any) that was found using the pattern, reduces the text to what is left after that pattern has been found, and removes the top pattern from the list so it is not attempted again. It proceeds until there are no more patterns to process.

I also made use of a slightly upgraded version of my “Flexible Date Reader” component for handling such things as “the 1st day of January 2013” and returning this as “2013-01-01”.

It’s by no means perfect, and if the patterns of your text are wildly different it will need to be tweaked (and maybe it won’t be possible to tweak it enough… managing expectations!), but hopefully it helps.

Here is a sample of its output

Looking at this, its possible that I didn’t need to split parsing of FromDate into two patterns, and likewise ToDate. They were after thoughts to cover different formats, but hey it worked so I didn’t try to fix it any further!

2 Likes

@takbb impressive :slight_smile:. One approach in such cases could be to let an AI come up with the Regex patters if one has very divers (but somehow also structured) entries.

1 Like

Yes indeed @mlauber71, that’s certainly the way it’s headed.

AI is my coworker in that regard already “off screen” even if just to bounce around ideas, and a good many of the Java snippets I’ve put on the forum recently have been about 80% written by chatgpt with me then just stitching it in and tweaking it for KNIME.

2 Likes

Hello

thank you for your suggestions. I shall be trying out @takbb flow this afternoon / evening and will let you know.

Frank

1 Like

@takbb Awesome!

Thank you

Frank

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.