I couldn’t find any hits for various searches on this topic… but maybe I don’t know what to search for!
Is there a way to generate 2 random variables with the below simple formula applied to calculate the third variable, where the output of the 2 random variable generators will fit the set of statistics that I have (mean and std dev) for all 3 of my data points?
I want to create a fictitious dataset with three numerical variables that are linked using real world stats that I have on the distributions of those variables.
The specific problem I have is to reverse engineer a dataset of fictitious Venture Capital funded companies using published statistics of the stage of financing (pre-seed, seed, early, late, growth), company valuation at time of each financing stage (pre money), size of investment in round of financing, % acquired in round.
The stats I have contain multiple years, each of the stages above, and the mean, median, 10% / 25% / 75% / 90% percentile for each of the variables (valuation, investment size, % acquired).
The variables are inherently linked in that the % acquired = (size of investment / (valuation + size of investment).
I started with a blank table, created clusters that I will fit to the percentiles as above. Use the Gaussian node to generate the data using the statistics for each variable. I used the mean of time series statistics data for my mean by cluster and the std dev of time series data - which is not great but something to generate by data.
This is ok for each variable independently, they look good enough to ‘fit’ the statistics I started with. Doesn’t need to be highly accurate anyway.
However, for a given company, as each stages three variables are randomly generated, they don’t follow the above equation for valuation, size and % acquired.
I’m considering using target shuffling in a loop until I satisfy a set of criteria that each company’s data fits within the distribution statistics I have, but I’m thinking go there must be a better way to go about this.
Is there a way to generate 2 random variables with the above simple formula applied to calculate the third variable, where the output of the 2 random variable generators will fit the set of statistics that I have (mean and std dev) for all three of my data points?
Any suggestions would be highly welcomed. I’m open to using R, python, Java script nodes but don’t profess to be adept with any of these languages.