Flagging dates close in time

Is there a way of flagging dates which are close together in time (if I say a week, for example?). Trouble is, there is no base date from which to define it; it would be looking across to other values. A terrible explanation I know, but perhaps the below example would help!

Date Colour
01-Jan-22 Red
01-Feb-22 Blue
14-Feb-22 Yellow
15-Feb-22 Red
01-Mar-22 Blue
02-Mar-22 Blue
07-Mar-22 Blue
08-Mar-22 Red
01-Sep-22 Red

Those “Blue” rows in March I would like to be able to flag. But equally that could be in May, June or September. The ultimate goal would be to then look into each date cluster to see if nominal values are the same, but I think I can do that. It’s just trying to gather up similar dates which is the part stumping me for the time being.

Hi @JWebb

An Auto-Binner node came to mind when I read your description. I would opt for something like:

Use an Extract Date&Time Fields node, select the Week as Date Field. This will generate the corresponding week number for each date.

The Auto-Binner node will then start looking for clusters where you can select the method of binning to your liking. Here I opted for a simple setup with 5 bins whereby the bin number is equivalent to your “colour”

Perhaps this could be a starting point.

2 Likes

@JWebb

I don’t know if the following workflow will help. I have put it up on KNIME Hub here so that people can play with it.

The workflow uses KNIME distance nodes and hierarchical clustering to identify dates within seven days of each other.

  1. The sample data is generated by creating a date range between 1 Jan 2022 and 31 Dec 2022. It then randomly knocks out approximately 80% of the dates leaving a random distribution of days remaining. This should approximate your data.

  2. The top branch then gets the last date in the sequence, converts it to a variable which is then used in the Date&Time Difference node to calculate the number of days between the row date and the reference last day in the sequence. We don’t need the actual date, we only need to be able to calculate the number of days between rows.

  3. Switching to the bottom branch, we calculate the Manhattan distance (for a single variable this is just the number of days difference between any two rows) and feed the distance calculation and data to the Hierarchical Clustering node. The Hierarchical Clustering node calculates the maximum number of days between rows in a cluster.

  4. The Hierarchical Cluster viewer shows the hierarchy of clusters and cumulative number of rows in the cluster as the threshold cluster distance is increased. Note: The algorithm will cluster adjacent nodes together (distance one) and then add any nodes that are a bit further away to the next (aggregated) cluster further up the hierarchy until all rows are aggregated into one cluster. This enables clustering of more than seven rows if required.

  5. The top branch takes the number of days between the first and last date (the domain) and divides 7 days (our window) by this domain to get a normalised threshold value.

  6. The normalised threshold value is passed to the Cluster assigner to assign the clusters to the data.

  7. The groupby node groups on cluster id and counts the rows per cluster to demonstrate the clustering of the data.

Following two charts are from the Hierarchical Cluster View:


Hierarchy of clusters. Each node at the bottom is a row in the data, the hierarchy shows how they are clustered as the number of days between the minimum and maximum date increases in the cluster.


Clusters are ranked by number of days between min and max date. The number of days in the cluster is shown as the clusters are aggregated together.

Hope this helps
DiaAzul
LinkedIn | Medium | GitHub

2 Likes

That is really helpful! It looks like the more complex version (hierarchical clustering rather than auto-binning) is probably the way forward. It looks like your workflow assigns the date clusters nicely, then I just need a way to flag like nominal values (“colour” in this example) in the same cluster. I shall look into and learn how your workflow works and then develop it further for this identification. Thank you very much!

2 Likes

Calculating the datetime difference between rows and then add a rule engine based on that as the color column. Would that be an option as well?
br

I am not sure. To what would each row be compared?

I assumed you have some kind of rule like if difference is less then X then flag. As long as you found a solution for you great.
br

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.