Sampling of records

89trunks · September 20, 2021, 2:19pm

Hi

I have a task to take a sample from datebase.

How the datebase look:

Product, date, serial number
A, 2021-01-01,xxx
A, 2021-01-01,yyy
B, 2021-01-01,ccc
C, 2021-01-01,vvv
D, 2021-01-01,bbb

But of course there are more products and diffrent dates.

I have to take random sample of 100 records from each day but the condition is that there must be only so many occurences of Products that the sum give me 100. So when in one date there is 100 products (A, B, C…) but 150 records only one of them will be chossen and there wont be duplicates. On The other hand if there are 80 products and 150 records I want to have at least one of each product and the gap (20) should be filled by 20 another random records - without too many duplicates.

bruno29a · September 20, 2021, 2:38pm

Hi @89trunks , what if you have less than 100 records in a specific day?

EDIT: Also, is there any unique identifier per record? This would be useful for the case where you have less than 100 products.

89trunks · September 20, 2021, 3:22pm

If i have less than 100 in some day i would have to take all of them without multipling them (without creating duplications) so if there is 80 records I take all.

So for Identifier I have unique rów number and also serial number.

HansS · September 20, 2021, 3:41pm

Hi @89trunks

I created this wf sampling_records.knwf (50.6 KB) with some sample data. First it creates a counter within every product. This makes it possible to select a least one record of every product. Next, bases on the expected sample size (max_sample_size) configured in the wizard (100 in your dataset), it calculates the records to be sampled (extra, via a stratified sample) or in case there are more unique products than records needed there is a final random sampling a the end.

gr. Hans

bruno29a · September 20, 2021, 6:59pm

Hi @89trunks , thank you for confirming.

I put something together that covers all the use cases (See the Different Sample data set) that looks like this:

In my samples, I’ve used only 10% of your data requirements, (that is 10 for 100, 15 for 150, and 8 for 80). So I have set my sample size to 10. You can set yours to 100. This can be configured in the Variable Creator node:

First use case - More than 100 records with more than 100 products.
Input data:

Results:

Second use case: More than 100 records with 100 products
Input data:

Results:

Third use case: More than 100 records with 80 products
Input data:

Results:

Forth use case: Less than 100 records
Input data:

Results:

Here, the results will be exactly the same as input data, as it’s not processed at all. It just takes everything as is.

There are other similar use cases, such as 100 records with more, or 100, or less products, but they work the same as the use cases with more than 100 records.

Here’s the workflow: Sampling of records.knwf (55.8 KB)

Note, you may get different samples when you run the workflow, as it retrieves the data randomly, but the rules remain the same.

EDIT: Had to re-upload the workflow, as one of the components was still linked to my library.

Daniel_Weikert · September 21, 2021, 6:06pm

I guess my though process is to easy here (apologies). Couldn’t you just oversample filter out duplicates and then shuffle 100 of them?

bruno29a · September 21, 2021, 6:20pm

Hi @Daniel_Weikert , Shuffle was my first thought, but given the different situations and use case, it will not work in some cases.

89trunks · September 27, 2021, 7:16pm

@bruno29a and @HansS

Took me some time to replay both ideas doing what they should so Thank You both for your time and help.

I tried them and they work fine.

I was wondering how can I add one more condition.

I want that my result was some what or close to evenly of numbers of Products but still taking randomly from the datebase containig at least one from Group.

When there is for instance following numbers of: Product A - 2000 rows, Product B - 100, Product C - 400 and I want end sample containig the100 the result somehow is Product B - 5, Product C - 10 and the rest 85 is Product A.

Sometimes this is too big discrepancy

bruno29a · September 27, 2021, 7:53pm

Hi @89trunks , I think you will have to define what “evenly” would mean.

In your case, you have a ratio of 20:1:4, which yielded a number of 85:5:10 which is not too bad considering the ratio.

What would you consider as even from a pool of 2000 Product A, 100 Product B, 400 Product C? Like 33, 34, 33?

What if you have 4000 Product A, 1 Product B, 2 Product C? What would the wanted result be?

89trunks · September 28, 2021, 8:00am

From a pool of 2000 Product A, 100 Product B, 400 Product C, I would want to have something like 33, 34, 33 or A: 40 B: 30 C: 30 as close as it can be.

From a pool of 4000 Product A, 1 Product B, 2 Product C? I would like to have A: 97 B: 1 C: 2.

Daniel_Weikert · September 28, 2021, 4:54pm

This is a stratified sampling. I have not read the whole thread but have you already looked at the partioning node?
br

bruno29a · September 28, 2021, 10:58pm

Hi @89trunks , in that case, I have re-worked the logic.

To get this kind of even distribution, done it in a way where it will go and take 1 for each product at a time until it reaches 100 (or any number that’s defined). It’s a bit slower, but it’s ok if you’re doing 100.

New workflow looks like this:

For a sample of 2000 A, 100 B, 400 C, here’s the distribution:

And for a sample of 4000 A, 1 B, 2 C, here’s the distribution:

Here’s the new workflow: Sampling of record - even distribution.knwf (301.4 KB)

One thing to note, I think it’s a known bug, but tagging @ipazin and @ScottF just in case.

If I manually define the Last row number to be 100 in the Row Filter, I will get 100 records. However, if I use a variable to define that number for 100, I will get 101 records somehow:

So, I had to create another variable to do that adjustment:

system · October 5, 2021, 10:58pm

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.