Extract Date in different formats present between muddled text


Looking for pure KNIME nodes which can help in extracting date of different formats present in a muddled text.
for example:

Unable to do it using regex, as I’m not good at it

And the solution present in other topics doesn’t fit to my requirement, as the data examples others have posted only has dates in a single column, so those are not helpful.

I don’t usually recommend LLMs right out of the box, but in this case I think it might be a fairly straightforward solution to an otherwise frustrating problem.

You could build a custom prompt for each line of your data, something like: "Return the date found in this text using mmyydddd format: " and then append your text.

Huggingface models are free to use (just sign up for an API key). Then you could use nodes from the new KNIME AI Extension (Labs). Maybe a workflow like this could get you started?

1 Like

@Asghar you can also tell ChatGPT to write the Regex syntax for you. If you have a limited number of variations this could be a straightforward task. Only thing to remember is that KNIME Regex nodes sometimes like to have a double escape but ChatGPT might know that.

If you could provide a complete sample of the variations you face the community might be able to help.

If your data is very sensitive you could try and employ a local LLM - depending on which model you use you might have to tweak the prompt to always give you a consistent result.

Hi @Asghar, is there just one date per row, present somewhere in the text, or is this a document containing lots of different dates? Do you have some sample text that you could upload as this may lead to other ideas too.

To extract dates from text that may contain dates written in different formats, regular expressions can be used in Python. Some examples:

import re

text = “On 01/14/2024 a meeting is scheduled, another on February 14, 2024 too. Not sure if 4/5/24 or April 5, 2024 will work?”

Extract common date formats

date_regex = r"\d{1,2}[-/\s]\d{1,2}[-/\s]\d{2,4}|\d{1,2} \w+ \d{2,4}"

dates = re.findall(date_regex, text)

Extract less common date formats

date_regex2 = r"\w+ \d{1,2}[a-z]*,? \d{2,4}"
more_dates = re.findall(date_regex2, text)

This extracts date strings like:

  • 01/14/2024
  • 02/14/2024
  • 4/5/24
  • April 5, 2024

It uses two regex patterns, one for common number-based date formats, and another for written month and year dates.

The regexes could be enhanced to support more date variants and locale formats as needed. This demonstrates how regular expressions can pull out dates from text even when they are written inconsistently.

Regards: 툰코-online.com/

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.