Is there a way of flagging dates which are close together in time (if I say a week, for example?). Trouble is, there is no base date from which to define it; it would be looking across to other values. A terrible explanation I know, but perhaps the below example would help!

Date

Colour

01-Jan-22

Red

01-Feb-22

Blue

14-Feb-22

Yellow

15-Feb-22

Red

01-Mar-22

Blue

02-Mar-22

Blue

07-Mar-22

Blue

08-Mar-22

Red

01-Sep-22

Red

Those “Blue” rows in March I would like to be able to flag. But equally that could be in May, June or September. The ultimate goal would be to then look into each date cluster to see if nominal values are the same, but I think I can do that. It’s just trying to gather up similar dates which is the part stumping me for the time being.

The Auto-Binner node will then start looking for clusters where you can select the method of binning to your liking. Here I opted for a simple setup with 5 bins whereby the bin number is equivalent to your “colour”

The sample data is generated by creating a date range between 1 Jan 2022 and 31 Dec 2022. It then randomly knocks out approximately 80% of the dates leaving a random distribution of days remaining. This should approximate your data.

The top branch then gets the last date in the sequence, converts it to a variable which is then used in the Date&Time Difference node to calculate the number of days between the row date and the reference last day in the sequence. We don’t need the actual date, we only need to be able to calculate the number of days between rows.

Switching to the bottom branch, we calculate the Manhattan distance (for a single variable this is just the number of days difference between any two rows) and feed the distance calculation and data to the Hierarchical Clustering node. The Hierarchical Clustering node calculates the maximum number of days between rows in a cluster.

The Hierarchical Cluster viewer shows the hierarchy of clusters and cumulative number of rows in the cluster as the threshold cluster distance is increased. Note: The algorithm will cluster adjacent nodes together (distance one) and then add any nodes that are a bit further away to the next (aggregated) cluster further up the hierarchy until all rows are aggregated into one cluster. This enables clustering of more than seven rows if required.

The top branch takes the number of days between the first and last date (the domain) and divides 7 days (our window) by this domain to get a normalised threshold value.

The normalised threshold value is passed to the Cluster assigner to assign the clusters to the data.

The groupby node groups on cluster id and counts the rows per cluster to demonstrate the clustering of the data.

Following two charts are from the Hierarchical Cluster View:

Hierarchy of clusters. Each node at the bottom is a row in the data, the hierarchy shows how they are clustered as the number of days between the minimum and maximum date increases in the cluster.

That is really helpful! It looks like the more complex version (hierarchical clustering rather than auto-binning) is probably the way forward. It looks like your workflow assigns the date clusters nicely, then I just need a way to flag like nominal values (“colour” in this example) in the same cluster. I shall look into and learn how your workflow works and then develop it further for this identification. Thank you very much!