Error with String Splitter (Regex) based on date and time

Hi, I’m trying to split a column that contains a log such as the one shown below. I want to split it in rows with each row containing the part of the log starting with the date/time (keeping the date/time). I’m using the following regex in the String Splitter (Regex) node but get an error Some input string(s) were missing or did not match the pattern. Played with advanced settings to no avail.

(?=\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})

regex101 shows the split and I used the above regex in Python code successfully using re.split()

Log
2025-01-23 11:45:12 - John Doe (Additional comments)
Rejected by John Doe
2025-01-22 09:56:36 - David Smith (Additional comments)
Approval request sent to David Smith - david.smith@acme.com

Any help appreciated.
Thanks.

Hi @Ferko , Welcome to the KNIME forum.

What final output are you wanting to retain?

Are you purely wanting to keep the date/time or are you trying to ultimately merge each pair of rows into a single row of multiple columns

e.g

LogDate Text1 Text2
2025-01-23 11:45:12 John Doe (Additional comments) Rejected by John Doe
2025-01-22 09:56:36 David Smith (Additional comments) Approval request sent to David Smith - david.smith@acme.com

To just collect the date/times, adjust your pattern to this:

(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}).*
if set to “require whole string to match”
or just this, if not:
(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})

e.g.

You will still get this message, because for some rows, the pattern didn’t match, but it can be ignored in this case.
image

This would put your dates into a new column, and you could then follow this with a Row Filter/Row Splitter keeping those rows where Split is not Missing.

You could alternatively follow it with a Missing Value node and have missing values filled down from “previous value” which would place the date against the associated row too.

Hopefully they’re useful suggests but it depends what you want to do with it :wink:

Hi @takbb. Thanks for the quickly reply. Here is a diagram explaining what I’d like to do.

Once I have the rows as shown below, I can use the Cell Splitter to split the lines in columns (using \n) and then split the cell containing the date using regex.

I’m struggling with the first part. The regex needs to split the text in rows using the date/time that can appear multiple times. I understand (?= is needed to “lookhead to match without consuming the text”

(?=\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})

Ah, ok, so if I understand correctly your data is initially in this form, so there are multiple entries per “NUMBER” contained within the one cell?

I wouldn’t try to achieve it all with the regex node. Instead I’d go for reassembling the data into a format which regex can more easily assist with. Since I already had a regex to extract the date when the date was split across multiple rows… let’s first split the data across multiple rows :wink:

To achieve that, the Cell Splitter, splitting the data on \n but generating a List turns the data into this:

After a quick rename of the column to remove the “_SplitResultList” suffix, ungroup can be used to unpack the List column into individual rows per item, and a nice side-effect of this is that the “NUMBER” value gets filled down against each item

This then puts us on track to use the String Splitter (regex) to pull out only the date data (if present) into a new column. Missing Value can then fill down those dates across all the rows to which it applies

and finally a groupby, grouping on NUMBER plus the “DATE” column and concatenating with \n, followed by a tidy up of the table gets you to the result (I hope! :slight_smile: )

see attached workflow:
regex date split.knwf (91.7 KB)


edit: – Having just re-read I gather that after this, you then want your data in column form, so I’ve added a second workflow with a modification to the end processing. This does indeed use “lookahead” regex, as it tries to generate three columns across different rows that can then be “merged”

(?=\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}) \- (.*)|(.*)

So this looks ahead to see if the row contains the date and time, and if it does it captures the date/time, then skips the hyphen and captures the remaining text on that row. If it doesn’t contain date/time it just captures the whole entry into a third column

So you end up with this:

Tidying that up with a Missing Value, then a Row Filter and Table Manipulation results in this output:

regex date split -v2.knwf (101.5 KB)

1 Like

@takbb thanks a lot for your help and thorough answers. This is perfect. Amazing community. And Knime is a great product. I’ve been using it for a good month now and will never go back to Excel to process my data :wink: Wish you a good day.

2 Likes

You’re welcome @Ferko . Great that you’re moving on from Excel data manipulation, and thank you for marking the solution.

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.