[SOLVED] Creating data with pre-determined correlations in R

Issue

This Content is from Stack Overflow. Question asked by Ian Hargreaves

I am looking to simulate a data set with pre-determined correlations between the variables. The code, below, is where I am at but I want to be able to control the parameters of the features individually.

In short, how do I change the SD, mean and min/max, intervals, skew and kurtosis for each variable individually?

library(tidyverse)
library(faux)

cmat <- c(1,   .195,  .346,  .674,  .561,  
         .195,  1,    .479,  .721,  .631,  
         .346, .479,   1,    .154,  .121, 
         .674, .721,  .154,   1,    .241, 
         .561, .631,  .121,  .241,   1)

nps_sales <- round(rnorm_multi(100, 5, 3, .5, cmat, 
                   varnames = c("NPS",
                                "change in NPS",
                                "sales (t0)",
                                "sales (t1)",
                                "sales (t2)")), 0) %>%
    tibble()



Solution

You have specified rnorm_multi(n = 100, vars = 5, mu = 3, sd = .5, cmat = ...). rnorm_multi will accept vectors of the appropriate length for mu and sd (e.g. mu = c(3,3,3,2,2) and sd = c(1,0.5,0.5,1,2), which will set the means and standard deviations accordingly.

Adjusting the other characteristics (min/max, skew, kurtosis, etc.), will be much more challenging, and may require a question on CrossValidated; the reason everyone uses the multivariate normal is that it’s easy to specify means, SDs, and correlations, but you can’t control the other aspects of the distributions easily. You can transform the results to achieve some level of skew/kurtosis, but this may not get as much flexibility and control as you want (see e.g. here).


This Question was asked in StackOverflow by Ian Hargreaves and Answered by Ben Bolker It is licensed under the terms of CC BY-SA 2.5. - CC BY-SA 3.0. - CC BY-SA 4.0.

people found this article helpful. What about you?