Detecting and defining per-month usage periods in multi-month data

Cardinal_Fire · January 27, 2022, 10:17am

Greetings Knimers, hope everyone is well.

I have server farm power usage data, periodically provided as csv, containing records per server, spanning a range of months. In those records one can see, per server, power On/Off events: sometimes they are minutes apart, sometimes hours. This usage needs billing on a monthly basis and to achieve this I can pivot the server power event data into columns for “Power On date/time” and “Power Off date/time” and then use wonderful tools like the Knime “Date & Time Difference” node to accurately demonstrate the period a given server was powered on for.

The problem I’ve run into (well, a couple of problems) is when a server has run from, lets say, the 11th of one month, to the 2nd of the next. Whilst in this case I can determine the duration, I need to be able to attribute one portion to one month’s billing and the remainder to the next month’s billing.

Here are two classic examples of this conundrum, one spans multiple months, the next spans two:

Node	Powered On	Powered Off	Power Event Period	Minutes
Enclosure 7, Server 8	2020-12-17T03:23:13.114Z	2021-02-08T11:39:53.804Z	PT1280H16M40.69S	76816
Enclosure 9, Server 2	2021-02-11T08:27:15.965Z	2021-03-02T12:36:58.327Z	PT460H9M42.362S	27609

I perceive I need to somehow punctuate the threshold between months with an artificially inserted MonthEnd & MonthStart record. I am certain I’m not alone in juggling data like this - has anyone found, or can anyone conj our, a solution?

Sincere thanks
Cardinal

PS: Another closely related issue is the tracking of server power state at the end of a set of records.

Using the first row entry as an example, imagine my data set, at one point in time, ended 2020-12-31, Encl7,Srv8 would be an open ended record (with the server left in an on-state needing capping). It’s debatable then if by the end of 2021-02 if the whole of January would be billed as no recorded power event would show until the “off” event 2021-02-08T11:39:53.804Z. Has anyone come across these data issues?

gonhaddock · January 27, 2022, 10:51am

Hello @Cardinal_Fire
I can’t say your problem is simple but i may suggest you an approach. You need another input that it will be your billing reporting granularity.

Then you have to run queries [ i guess in loops ] for each billing period and extract from each record in your data a billable time range applicable for the month. Then it doesn’t mind if a record has been working for months.

In each of the loop iterations you will have to calculate Effective.Starting and Effective.Ending for the billing period. In the case of ‘out of range’ you will apply the period starting or ending time.

With the approach that I am pointing to, you may be able to solve clapping issues commented in your PS as well.

I recently solved a challenge in the Forum about forecasting periods based in a weekly granularity, and another one to extract differentials from retail on a monthly basis schedule. In both of them I had to work in with the reporting granularity as input.

You can have a look to this two workflows aiming for insights:

BR

Cardinal_Fire · January 27, 2022, 11:19am

BR, firstly, thank you. I did take time to review past posts to see if any touched this subject but I didn’t happen up on this -thank you for forwarding it.

I too think there’s a solution for the first issue that’ll naturally solve, or provide support to, the latter issue, both being related. Let me take a look at this thread and see if it can help.

Update to follow.
Best regards
GC

PS: Yes, currently I’m using loops to process sorted event-sets per server per month period and also separate loop functions to provide me a “till receipt” of each server’s first start, and first end, datetime per month for reference.

Daniel_Weikert · January 27, 2022, 5:55pm

Do you have a sample dataset to work with to share?

gonhaddock · January 27, 2022, 7:10pm

Hi @Daniel_Weikert

20220127_usage_periods.knwf (74.9 KB)

You can start from here if you will; I won’t have too much time for this onwards

BR

gonhaddock · January 28, 2022, 7:38am

Same staff, however reporting granularity would to be in Date&Time format…
20220127_usage_periods2.knwf (74.8 KB)

BR

gonhaddock · February 1, 2022, 7:21pm

Hello @Cardinal_Fire
I’ve prepared a workflow, that I think it covers the two main bullets mentioned in your post (both of them are related as you said):

month’s portion billing estimates
apply caps to over-ranged usage periods

BR

Cardinal_Fire · February 2, 2022, 9:11am

@gonhaddock, I just wanted to thank you for helping here by taking up this challenge (I’ve been away and just returned to find your latest posts). I studied the workflows you originally shared, understood, and agreed this “billing reporting granularity” approach - I see where you’re coming from.

I experimented by using “Create Date & Time Range” node to create the first days of each month (with time stamp of 1 second past midnight), then used “Date & Time shift” node to duplicate and retract the date/time by 2 seconds, thus giving me two reference columns: one is the last day of the months, the other is the first day of the month: I have a rough collection of subsequent nodes seeking to use this info but this is not yet completed…

Today I am planning to investigate how I can use this new reporting granularity to provide start/end condition induced record termination (and re-start) of periods found in the input records. But first I will look at this new model and come back to you as soon as possible.

In the meantime, my sincere thanks - feedback/updates to follow. GC

Cardinal_Fire · February 3, 2022, 9:49am

Cardinal_Fire:

@gonhaddock, good morning. I’m increasingly convinced this model is the solution (which I am ready to mark as such) but I have discovered a data handling error in both your example which also manifests itself in my replica when real data is applied.

The Constant Value Column node seems to be doing precisely what it’s name suggests and setting the REPORT_ID column, without regard, to (in your model) 202012 and in my model (with compliant input data) 202006 (my data start point). [see attached grabs]

Suggestion/Question
Can you advise the intention/ purpose of the Constant Value Column node and consider whether this is this the best node for this role?
I would anticipate the goal is to match value to that shown in the corresponding “Powered On_Period” column, not set it regardless to the first available date in the granularity data?

So close to success - hope you can help.

2022-02-03 KnimeModeldataError1053×153 20.8 KB

Cardinal_Fire · February 3, 2022, 10:20am

PS: it could actually be the Table Row To Variable Loop Start blindly injecting the same farmed value into every row…

gonhaddock · February 3, 2022, 10:40am

Hi @Cardinal_Fire
I did a fast check and the model is performing right in HUB’s set up. If you check the Constant Value Column set up in the model, it is controlled by a variable. You must need to set up in the same way for your real workflow model.

This is the core feature in the whole loop process, as it is selecting for each loop step, one of the invoicing periods to compare the logics in the two following Rule Engines. This two steps of LOGIC indexing decides for each Power Event if Power Event item applies or not for the current in-loop period.

I don’t see this error manifesting in the example model. As looking at Complete Report output, I see that the logic is deciding correctly whether a Power Event range applies to curren in-loop (iteration) period.

Let me know if fixing the variable address the error.

PS.- I’ve just downloaded from Hub for double check, and there isn’t any miss version with my desktop project.

Cardinal_Fire · February 3, 2022, 12:08pm

Thanks for the quick response - good call but I can confirm my settings are all identical (as you’d expect them to be). With regard to “…as it is selecting for each loop step, one of the invoicing periods to compare the logics in the two following Rule Engines.” I see this and have two questions (as in both the model and my application of it the behaviour is the same.

How is the “choice” of input variable actually made (as it’s divorced from any main data processing influence prior to the loop start)? - I ask this as the evidence shows no choice is being made.
As per the included graphics above both models are only injecting one value over and over. I observe that your model works because that value 202012 is in range, but in mine, where it is 202006 the subsequent rule engines fail. The fault is this REPORT_MONTH variable input - which I why I asked.

2022-02-03 KnimeNodeError1256×424 58.4 KB

Example
As shown, in your example model, observe REPORT_ID in the output table of Constant Value Column, look at the REPORT_ID value for the second row - there you can see it is the same as the row above despite not being in any way related to the report date values in Powerer On_Period. This brilliant example model does work, but only because, by apparent chance, the Table Row to Variable Loop Start has injected a date that’s just in range. With live data, this whitewash of that field means the concept does not work.

I’m just wondering how that injection of the Reporting Granularity can be context specific per row rather than same value for the whole dataset (as this would solve the issue).

Respectfully, GC

gonhaddock · February 3, 2022, 3:07pm

I wouldn’t expect nothing but that , but just in case

Ok I see the point now. It is true that for a long dataset it is very inefficient (in fact is totally inefficient). But on my exemption score, with this workflow I was just trying to cover the conceptual approach to the caption. In fact I built in this way intentionally and let me explain that; I added the 202104 ‘REPORT_ID’ knowing it will return missed values, so at the end for the summary report, the missed value just pivots to the report position. It is a cheap null handling for the conceptual model.

Now for, let’s call it “your new well argued requested functionality”, and from the functional perspective the changes should be simple: connect the Table Row To Variable Loop Start’ to a ‘Rule Based Row Filter’ just before the ‘Constant Value Column’ node. The rules to apply are:
Powered On_Period <= Report_ID (variable) OR Powered Off_Period >= Report_ID (variable)
… more or less same stuff within the two Rule Engines for the validation (these may be removed); and an ‘Empty Table Switch’ afterwards to handle these occurrences, but…

As I don’t have from this position the business view about the continuity of the dataset etc; for my nice HUB project portfolio; it will require to add a null handling section, aiming to achieve a nice data gap cover for the Summary Report (null handling section).

You can start to work out these upgrades in your real model if you will, as I am busy now. I would be able to work later in the evening updating the HUB concept.

Thank you for the feedback.

Cardinal_Fire · February 3, 2022, 6:08pm

BR, again my thanks. I assure you, any perceived pressure-to-solve is only a) with great respect to your time thus far and for your convenience going forward and b) out of a strong desire to learn. I have not come across these specific nodes, nor have I seen or used value injection to chained nodes in this way: completion of the puzzle will demonstrate the method - I guess I’m eager to see this magic realised.

I am of course testing on my local model in an attempt to resolve, but can appreciate it would be good also to have your reference Hub artefact complete.

Hope to swap notes later. In the meantime, thank you. GC

gonhaddock · February 3, 2022, 10:05pm

Thanks for your encouraging comments, not needed indeed as I am learning as well by helping others with their use cases.

Go back to business. The workflow in HUB has been updated, I hope it covers all bullet points with the desired performance now. The main changes over the previous version are three (red marks in the picture):

As previously commented, a ‘Rule Based Row Filter’ validates the Power Event Period on a pre-processing position; and replaces the two ‘Rule Engine’ nodes with the following rule (include TRUES):

( $Powered On_Period$ <= $${IREPORT_ID}$$ ) AND ( $Powered Off_Period$ >= $${IREPORT_ID}$$ ) => TRUE
TRUE => FALSE

This change #1 generates a problem, as a gap in time events continuity will not be reflected in the summary report when pivoting the REPORT_ID.

To solve this, an ‘Empty Table Switch’ allows to introduce an unique empty register in the loop step, that can be easily filtered in the reporting section.
This latest change #2 brings another undesirable effect, because the Summary Report’s ‘Pivot’ node changes the default resorting of the columns, giving priority to the unreferenced ones.
As we want the columns to be ordered chronologically in an automated way, and aiming to achieve a Time-Line format; a new branch has been added to feed the ‘Reference Column Resorter’ at the very end of the workflow.

That’s all from this side. Regards

Cardinal_Fire · February 4, 2022, 6:14pm

@gonhaddock High Kudos. Not just for solving this challenge, but for your patient incremental model evolution. Thank you.

At my end I still have issues with the absence of valid BILL EFF_Start/End values, as described, blocking the conclusion of the model when real data is applied, but I cannot argue that your model doesn’t deliver the required outcome and I can work on the resolution of my issues.

Defining the reporting granularity was the innovative approach that was missing: it delineates the scope of billing perfectly and your model is was the key to demonstrating this.

Sincere thanks for this work, it was pivotal. Best regards. GC

Cardinal_Fire · February 10, 2022, 9:37am

PS: I discovered why both models fail when data other than the canned data is used: “Table Row to Variable Loop Start” can only insert a single variable many times. Instead of inserting the value associated with the row-data being processed, it’s just blindly stamping the first value it received from the start/end_report thread against every one of my ~1300 rows.

I’m therefore focused now on a loop that calculates the correct variable, for the dates in the associated row being processed, as this will provide the missing per-row context needed.

system · February 17, 2022, 9:38am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.