# Generate multi variate data to fit a set of statistics

Hi!

I couldn’t find any hits for various searches on this topic… but maybe I don’t know what to search for!

TLDR// question:
Is there a way to generate 2 random variables with the below simple formula applied to calculate the third variable, where the output of the 2 random variable generators will fit the set of statistics that I have (mean and std dev) for all 3 of my data points?

I want to create a fictitious dataset with three numerical variables that are linked using real world stats that I have on the distributions of those variables.

The specific problem I have is to reverse engineer a dataset of fictitious Venture Capital funded companies using published statistics of the stage of financing (pre-seed, seed, early, late, growth), company valuation at time of each financing stage (pre money), size of investment in round of financing, % acquired in round.

The stats I have contain multiple years, each of the stages above, and the mean, median, 10% / 25% / 75% / 90% percentile for each of the variables (valuation, investment size, % acquired).

The variables are inherently linked in that the % acquired = (size of investment / (valuation + size of investment).

My method:
I started with a blank table, created clusters that I will fit to the percentiles as above. Use the Gaussian node to generate the data using the statistics for each variable. I used the mean of time series statistics data for my mean by cluster and the std dev of time series data - which is not great but something to generate by data.

This is ok for each variable independently, they look good enough to ‘fit’ the statistics I started with. Doesn’t need to be highly accurate anyway.

Problem
However, for a given company, as each stages three variables are randomly generated, they don’t follow the above equation for valuation, size and % acquired.

I’m considering using target shuffling in a loop until I satisfy a set of criteria that each company’s data fits within the distribution statistics I have, but I’m thinking go there must be a better way to go about this.

Question
Is there a way to generate 2 random variables with the above simple formula applied to calculate the third variable, where the output of the 2 random variable generators will fit the set of statistics that I have (mean and std dev) for all three of my data points?

Any suggestions would be highly welcomed. I’m open to using R, python, Java script nodes but don’t profess to be adept with any of these languages.

Many thanks

Hello @SamWvan and welcome to the KNIME community

As in brainstorm mood and based on the challenge description:

A concept to be tested by using Python script node (based on my preferences as I think the coding in my mind is simple; should be simple in R as well), aiming to generate a random sample distribution based in mean, standard dev, and sample size; as you did with your method. Then you can calculate the percentile values defined as constrain rule.

Running this scripting node within a KNIME loop you can test two different stochastic modelling approaches:

1. Iterate a predefined n number of times (i.e. n == 1E4), and accept the closest one, to the compared percentiles vector.

2. Iterate under a conditional statement until the comparing vectors get close to a predefined tolerance

A problem could be performance, based on data size and tolerance.

BR

1 Like