Date String to Number

Hi All,

I am trying to clean some date columns in my project. The dates are written by month name.
image

I have tried to create a secondey Month Name/ Month Number table and wanted to use Find&Replace to clean the columns. Would that be the correct approach and would you have any recommendations?

Thank You.

Hi @ADonia

Welcome to the KNIME forum!

The question on how to approach is partially driven by what are you comfortable with, because there are many ways to do this (next to how big your data set is/workflow performance).

To give you some inspiration, I’ll outline three different approaches that achieve the same result:

Option 1 is the full manual way that is suitable if you can find your way through code easily. It uses the Column Expression node. If you don’t have it, I highly recommend adding it since it’s a very versatile and powerful node.

WF:

It:

  • Checks if the subject column contains December of Dec
  • If that is the case, replace it with 12 subsequently.
  • Repeats this for all months.

Downside: a lot of work to main this code properly if you need to make changes.
Upside: it only requires one node.
Note: I only did a few months for illustration purposes.

Option 2:
This option utilizes the Text Processing package that KNIME offers which has some dedicated tools for this, mainly the Dictionary Replacer. As the name suggest, it is capable of replacing terms in your text that are present in a dictionary that you feed to node.

The dictionary is basically the input that you already have whereby you associate the month with a number. This is also used as such within the Dictionary Replacer.

image

Downside: requires more nodes to convert the string to document and extract the updated text again.
Upside: replacement values easy manageable through the dictionary.
.
Option 3

This is similar to option 2 as it also uses a dictionary as input but stays away from the data type Document. Here it constructs all the necessary replacement statements in a more conventional way using a temporary string.

image

This is grouped to create one big statement and passed as a flow variable to a Column Expression node which is controlled by this.

Variable controlled Column Expression:
image

Downside: requires quite a few nodes to get the desired result.
Upside: replacement values easy manageable through the dictionary.

See WF with these examples:
Date String to Number cleansing.knwf (75.3 KB)

Have look around in the forum as well. Actions like this are widely discussed and you’ll definitely find more and probably better solutions and inspiration to solve this.

Hope this helps!

2 Likes

Hi @ADonia and welcome to the Knime Community.

There are definitely several ways to do this as @ArjenEX mentioned.

In addition to the options that @ArjenEX has suggested, you can also use the Rule Engine to add your rules - it would be similar to the Column Expressions solution, but instead of using coding and if…else, you would use the Rule Engine syntax.

Another way that I wanted to use is using the Extract Date&Time Fields node that can extract the month number, provided that we have a valid date. So using this method, I don’t need to create a dictionary or Rules for the month numbers. It avoids me having to manually build a dictionary or manually write the rules. More importantly, it avoids mistakes as there are quite a lot rules to write (I do see a couple of mistakes from @ArjenEX 's screenshots :smiley: - or perhaps that was on purpose)

This is how I achieved this:

Input (a subset of your data):
image

Results:
image

Explanation:
I first extract the month using the Cell splitter:
image

There’s a date format that is accepted with month such as: Jan, Feb, Mar, Apr, etc… basically the capitalized version of the first 3 letters of the month name, therefore I used the substring(month, 0, 3) which will give me the first 3 letters which will work whether it’s a full name or already with 3 letters.

I generate a valid dummy date using the formatted month name:

And this gives me this:
image

To convert to type date, I use the following settings:

Choose type Date, Locale as en-US and Date format as yyyy-MMM-dd
Note: This Date format (MMM) works only with Locale en-US

I get these dates after converting:

Then I use the ability to extract the month number by the Extract Date&Time Fields node:

And I get this:

After that, I just need to run a replace() function using values from the 2nd and 4th column as search and replace in the 1st column:

Here’s the workflow: Date String to Number.knwf (16.6 KB)

Note: Please provide some data that we can copy and paste that we can work with in the future :slight_smile:

2 Likes

Thank You for the fast response Guys.

ArjenEX

  • Really appreciated the different approaches offered. I faced a few issues trying to make them work until I discovered that you guys shared the example file.

I was able to follow Bruno’s method to clean the dates but I went back to Arjen different approaches as to learn the different tools in Knime(I am new to the system).

ArjenEX.
For the Third Method you shared. I have been receiving the following error message at the Bulk replacer step.
image

I have changed the output column field but still getting the same error response. Would you have any pointers.

Hi @ADonia
It’s hard to tell from just that one screenshot because it’s cut off a bit.
Prime suspect would be that you forgot to apply the flow variable. Whenever this is done properly, you will see this as message in the bottom in the Column Expression.

Thanks, Worked like a charm. I had to be careful with the naming of the columns.

1 Like

Hi @ADonia , the main goal behind my method was to avoid having to manually write the rules (January - 1, February - 2, etc), and the cleaning (not so much cleaning really, but standardizing rather) was just a way to reach that.

Anytime we can avoid manual input or manipulation is the best, as there is always risk of mistakes when doing these manually - as mentioned, I spotted a few of them already from the screenshots.

Not to discard what @ArjenEX has suggested (his method works), and also as I mentioned, the mistakes could have been done on purpose, but on purpose or not, the fact remains that the potential for mistakes is there.

For example, there was a mistake in option 1 in the replace statements for:
replace(replae(column("Date"), "October", "10"), "Nov", "10")

This essentially happened from a copy and paste from “Nov” to be modified for October, but “Nov” did not get modified.

Not to mention that you would have to write 12 rules manually.

Another mistake was in option 2 where both March and September translating to month 3:

And for option 3, I’m guessing the rules where based on option 2 as it has the same mistake:


image

With the Extract Date&Time Fields, basically it’s Knime that is applying the rules automatically. No need for manually created rules, no need to create 12 individual rules.

Just wanted to point out that the focus from my solution was not about “cleaning” the dates, but rather was to avoid manually writing down the rules which has the risk of incurring mistakes :slight_smile:

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.