Introduction

The R package SimCorMultRes is suitable for simulation of correlated binary responses (exactly two response categories) and of correlated nominal or ordinal multinomial responses (three or more response categories) conditional on a regression model specification for the marginal probabilities of the response categories. A more detailed description of SimCorMultRes can be found in Touloumis (2016). This vignette briefly describes the simulation methods proposed by Touloumis (2016), introduces how to simulate ordinal responses under a marginal adjacent-category logit model and illustrates the use of the core functions of SimCorMultRes.

Areas of Applications

This package was created to facilitate the task of carrying out simulation studies and evaluating the performance of statistical methods for estimating the regression parameters in a marginal model with clustered binary and multinomial responses. Examples of such statistical methods include maximum likelihood methods, copula approaches, quasi-least squares approaches, generalized quasi-likelihood methods and generalized estimating equations (GEE) approaches among others (see references in Touloumis 2016).

In addition, SimCorMultRes can generate correlated binary and multinomial random variables conditional on a desired dependence structure and known marginal probabilities even if these are not determined by a regression model (see third example in Touloumis 2016) or to explore approximations of association measures for discrete variables that arise as realizations of an underlying continuum (see second example in Touloumis 2016).

Simulation Methods

Let Y_it be the binary or multinomial response for subject i (i = 1, …, N) at measurement occasion t (t = 1, …, T), and let x_it be the associated covariates vector. We assume that Y_it ∈ {0, 1} for binary responses and Y_it ∈ {1, 2, …, J ≥ 3} for multinomial responses.

Correlated nominal responses

The function rmult.bcl simulates nominal responses under the marginal baseline-category logit model where β_tj0 is the j-th category-specific intercept at measurement occasion t and β_tj is the j-th category-specific parameter vector associated with the covariates at measurement occasion t. The popular identifiability constraints β_tJ0 = 0 and β_tJ = 0 for all t, imply that β_tj0^* = β_tj0 and β_tj^* = β_tj for all t = 1, …, T and j = 1, …, J − 1. The threshold Y_it = j ⇔ U_itj^NO = max {U_it1^NO, …, U_itJ^NO} generates clustered nominal responses that satisfy the marginal baseline-category logit model @ref(eq:MBCLM), where U_itj^NO = β_tj0 + β_tj^′x_it + e_itj^NO, and where the random variables {e_itj^NO : i = 1, …, N, t = 1, …, T and j = 1, …, J} satisfy the following conditions:

e_itj^NO follows the standard extreme value distribution for all i, t and j (mean = γ ≈ 0.5772, where γ is Euler’s constant, and variance = π²/6).
e_i₁t₁j₁^NO and e_i₂t₂j₂^NO are independent random variables provided that i₁ ≠ i₂.
e_itj₁^NO and e_itj₂^NO are independent random variables provided that j₁ ≠ j₂.

For each subject i, the association structure among the clustered nominal responses {Y_it : t = 1, …, T} depends on the joint distribution and correlation matrix of {e_itj^NO : t = 1, …, T and j = 1, …, J}. If the random variables {e_itj^NO : t = 1, …, T and j = 1, …, J} are independent then so are {Y_it : t = 1, …, T}.

Suppose the aim is to simulate nominal responses from the marginal baseline-category logit model where N = 500, T = 3, (β₁₀, β₁₁, β₁₂, β₂₀, β₂₁, β₂₂, β₃₀, β₃₁, β₃₂) = (1, 3, 2, 1.25, 3.25, 1.75, 0.75, 2.75, 2.25) and x_it = (x_i1, x_it2)^′ for all i and t, with $x_{i1}\overset{iid}{\sim} N(0,1)$ and $x_{it2}\overset{iid}{\sim} N(0,1)$. For the dependence structure, suppose that the correlation matrix R in the NORTA method has elements $$ \mathbf{R}_{t_1j_1,t_2j_2}=\begin{cases} 1 & \text{if } t_1=t_2 \text{ and } j_1=j_2\\ 0.95 & \text{if } t_1 \neq t_2 \text{ and } j_1=j_2\\ 0 & \text{otherwise }\\ \end{cases} $$ for all i = 1, …, 500.

# parameter vector
betas <-  c(1, 3, 2, 1.25, 3.25, 1.75, 0.75, 2.75, 2.25, 0, 0, 0)
# sample size
sample_size <- 500 
# number of nominal response categories
categories_no <- 4  
# cluster size
cluster_size <- 3 
set.seed(1) 
# time-stationary covariate x_{i1}
x1 <- rep(rnorm(sample_size), each = cluster_size)
# time-varying covariate x_{it2}
x2 <- rnorm(sample_size * cluster_size) 
# create covariates dataframe 
xdata <- data.frame(x1, x2) 
set.seed(321)
library("SimCorMultRes")
# latent correlation matrix for the NORTA method
equicorrelation_matrix <- toeplitz(c(1, rep(0.95,cluster_size - 1)))
identity_matrix <- diag(categories_no)
latent_correlation_matrix <- kronecker(equicorrelation_matrix, identity_matrix) 
# simulation of clustered nominal responses
simulated_nominal_dataset <- rmult.bcl(clsize = cluster_size,
                                         ncategories = categories_no, 
                                         betas = betas, xformula = ~ x1 + x2,
                                         xdata = xdata,
                                         cor.matrix = latent_correlation_matrix)
suppressPackageStartupMessages(library("multgee"))
# fitting a GEE model
nominal_gee_model <- nomLORgee(y ~ x1 + x2,
                               data = simulated_nominal_dataset$simdata,
                               id = id, repeated = time,
                               LORstr = "time.exch")
# checking regression coefficients
round(coef(nominal_gee_model), 2)
#> beta10   x1:1   x2:1 beta20   x1:2   x2:2 beta30   x1:3   x2:3 
#>   1.07   3.18   1.99   1.35   3.40   1.70   0.89   3.06   2.22

Correlated ordinal responses

Simulation of clustered ordinal responses is feasible under either a marginal cumulative link model or a marginal continuation-ratio model.

Marginal cumulative link model

The function rmult.clm simulates ordinal responses under the marginal cumulative link model where F is a cumulative distribution function (cdf), β_tj0 is the j-th category-specific intercept at measurement occasion t and β_t is the regression parameter vector associated with the covariates at measurement occasion t. The category-specific intercepts at each measurement occasion t are assumed to be monotone increasing, that is −∞ = β_t00 < β_t10 < β_t20 < ⋯ < β_{t(J − 1)0} < β_tJ0 = ∞ for all t. Using the threshold Y_it = j ⇔ β_{t(j − 1)0} < U_it^O1 ≤ β_tj0 clustered ordinal responses that satisfy the marginal cumulative link model @ref(eq:MCLM) are generated, where U_it^O1 = −β_t^′x_it + e_it^O1, and where {e_it^O1 : i = 1, …, N and t = 1, …, T} are random variables such that:

e_it^O1 ∼ F for all i and t.
e_i₁t₁^O1 and e_i₂t₂^O1 are independent random variables provided that i₁ ≠ i₂.

For each subject i, the association structure among the clustered ordinal responses {Y_it : t = 1, …, T} depends on the pairwise bivariate distributions and correlation matrix of {e_it^O1 : t = 1, …, T}. If the random variables {e_it^O1 : t = 1, …, T} are independent then so are {Y_it : t = 1, …, T}.

Suppose the goal is to simulate correlated ordinal responses from the marginal cumulative probit model where Φ denotes the cdf of the standard normal distribution (mean = 0 and variance = 1), N = 500, T = 4, (β₁₀, β₂₀, β₃₀, β₄₀) = (−1.5, −0.5, 0.5, 1.5), (β₁₁, β₂₁, β₃₁, β₄₁) = (1, 2, 3, 4) and $\mathbf x_{it}=x_{i}\overset{iid}{\sim} N(0,1)$ for all i and t. For the dependence structure, assume that e_i^O1 = (e_i1^O1, e_i2^O1, e_i3^O1, e_i4^O1)^′ are iid random vectors from a tetra-variate normal distribution with mean vector the zero vector and covariance matrix the correlation matrix $$\left( {\begin{array}{*{20}c} 1.00 & 0.85 & 0.50 & 0.15 \\ 0.85 & 1.00 & 0.85 & 0.50 \\ 0.50 & 0.15 & 1.00 & 0.85 \\ 0.15 & 0.85 & 0.50 & 1.00 \end{array} } \right).$$

set.seed(12345)
# sample size
sample_size <- 500
# cluster size
cluster_size <- 4
# category-specific intercepts
beta_intercepts <- c(-1.5, -0.5, 0.5, 1.5)
# time-varying regression parameters associated with covariates
beta_coefficients <- matrix(c(1, 2, 3, 4), 4, 1)
# time-stationary covariate 
x <- rep(rnorm(sample_size), each = cluster_size)
# latent correlation matrix for the NORTA method
latent_correlation_matrix <- toeplitz(c(1, 0.85, 0.5, 0.15))
# simulation of ordinal responses
simulated_ordinal_dataset <- rmult.clm(clsize = cluster_size,
                                       intercepts = beta_intercepts,
                                       betas = beta_coefficients,
                                       xformula = ~ x,
                                       cor.matrix = latent_correlation_matrix,
                                       link = "probit")
# first eight rows of the simulated dataframe
head(simulated_ordinal_dataset$simdata, n = 8)
#>   y         x id time
#> 1 1 0.5855288  1    1
#> 2 2 0.5855288  1    2
#> 3 2 0.5855288  1    3
#> 4 1 0.5855288  1    4
#> 5 1 0.7094660  2    1
#> 6 1 0.7094660  2    2
#> 7 1 0.7094660  2    3
#> 8 1 0.7094660  2    4

Marginal continuation-ratio model

The function rmult.crm simulates clustered ordinal responses under the marginal continuation-ratio model where β_tj0 is the j-th category-specific intercept at measurement occasion t, β_t is the regression parameter vector associated with the covariates at measurement occasion t and F is a cdf. This is accomplished by utilizing the threshold Y_it = j, given Y_it ≥ j ⇔ U_itj^O2 ≤ β_tj0 where U_itj^O2 = −β_t^′x_it + e_itj^O2, and where {e_itj^O2 : i = 1, …, N , t = 1, …, T and j = 1, …, J − 1} satisfy the following three conditions:

e_itj^O2 ∼ F for all i, t and j.
e_i₁t₁j₁^O2 and e_i₂t₂j₂^O2 are independent random variables provided that i₁ ≠ i₂.
e_itj₁^O2 and e_itj₂^O2 are independent random variables provided that j₁ ≠ j₂.

For each subject i, the association structure among the clustered ordinal responses {Y_it : t = 1, …, T} depends on the joint distribution and correlation matrix of {e_itj^O2 : j = 1, …, J and t = 1, …, T}. If the random variables {e_itj^O2 : j = 1, …, J and t = 1, …, T} are independent then so are {Y_it : t = 1, …, T}.

Suppose simulation of clustered ordinal responses under the marginal continuation-ratio probit model with N = 500, T = 4, (β₁₀, β₂₀, β₃₀, β₄₀, β) = (−1.5, −0.5, 0.5, 1.5, 1) and $\mathbf{x}_{it}=x_{it}\overset{iid}{\sim} N(0,1)$ for all i and t is desired. For the dependence structure, assume that {e_i^O2 = (e_i11^O2, …, e_i44^O1)^′ : i = 1, …, N} are iid random vectors from a multivariate normal distribution with mean vector the zero vector and covariance matrix the 16 × 16 correlation matrix with elements$$ \text{corr}(e^{O2}_{it_1j_1},e^{O2}_{it_2j_2}) = \begin{cases} 1 & \text{for } j_1 = j_2 \text{ and } t_1 = t_2\\ 0.24 & \text{for } t_1 \neq t_2\\ 0 & \text{otherwise.}\\ \end{cases} $$

set.seed(1)
# sample size
sample_size <- 500
# cluster size
cluster_size <- 4
# category-specific intercepts
beta_intercepts <- c(-1.5, -0.5, 0.5, 1.5)
# regression parameters associated with covariates
beta_coefficients <- 1
# time-varying covariate
x <- rnorm(sample_size * cluster_size)
# number of ordinal response categories
categories_no <- 5
# correlation matrix for the NORTA method
latent_correlation_matrix <- diag(1, (categories_no - 1) * cluster_size) +
  kronecker(toeplitz(c(0, rep(0.24, categories_no - 2))),
            matrix(1, cluster_size, cluster_size))
# simulation of ordinal responses
simulated_ordinal_dataset <- rmult.crm(clsize = cluster_size,
                                       intercepts = beta_intercepts,
                                       betas = beta_coefficients,
                                       xformula = ~ x,
                                       cor.matrix = latent_correlation_matrix,
                                       link = "probit")
# first six clusters with ordinal responses
head(simulated_ordinal_dataset$Ysim)
#>     t=1 t=2 t=3 t=4
#> i=1   2   1   3   1
#> i=2   1   4   1   1
#> i=3   2   2   1   3
#> i=4   3   5   2   2
#> i=5   2   1   1   1
#> i=6   3   3   4   5

Marginal adjacent-category logit model

The function rmult.acl simulates clustered ordinal responses under the marginal adjacent-category logit model where β_tj0 is the j-th category-specific intercept at measurement occasion t, β_t is the regression parameter vector associated with the covariates at measurement occasion t.

Generation of clustered ordinal responses relies upon utilizing the connection between baseline-category logit models and adjacent-category logit models. In particular, the threshold Y_it = j ⇔ U_itj^O3 = max {U_it1^O3, …, U_itJ^O3} generates clustered nominal responses that satisfy the marginal adjacent-category logit model @ref(eq:MACL), where $$U^{O3}_{itj}=\sum_{k=j}^J\beta_{tk0}+(J-j)\boldsymbol{\beta}_{t}^{\prime} \mathbf {x}_{it}+e^{O3}_{itj},$$ and where the random variables {e_itj^O3 : i = 1, …, N, t = 1, …, T and j = 1, …, J} satisfy the following conditions:

e_itj^O3 follows the standard extreme value distribution for all i, t and j.
e_i₁t₁j₁^O3 and e_i₂t₂j₂^O3 are independent random variables provided that i₁ ≠ i₂.
e_itj₁^O3 and e_itj₂^O3 are independent random variables provided that j₁ ≠ j₂.

For each subject i, the association structure among the clustered ordinal responses {Y_it : t = 1, …, T} depends on the joint distribution and correlation matrix of {e_itj^O3 : t = 1, …, T and j = 1, …, J}. If the random variables {e_itj^O3 : t = 1, …, T and j = 1, …, J} are independent then so are {Y_it : t = 1, …, T}.

Suppose the aim is to simulate ordinal responses from the marginal adjacent-category logit model where N = 500, T = 3, (β₁₀, β₂₀, β₃₀) = (3, 2, 1), (β₁, β₂) = (1, 1) and x_it = (x_i1, x_it2)^′ for all i and t, with $x_{i1}\overset{iid}{\sim} N(0,1)$ and $x_{it2}\overset{iid}{\sim} N(0,1)$. For the dependence structure, suppose that the correlation matrix R in the NORTA method has elements $$ \mathbf{R}_{t_1j_1,t_2j_2}=\begin{cases} 1 & \text{if } t_1=t_2 \text{ and } j_1=j_2\\ 0.95 & \text{if } t_1 \neq t_2 \text{ and } j_1=j_2\\ 0 & \text{otherwise }\\ \end{cases} $$ for all i = 1, …, 500.

# intercepts
beta_intercepts <- c(3, 2, 1)
# parameter vector
beta_coefficients <- c(1, 1)
# sample size
sample_size <- 500 
# cluster size
cluster_size <- 3 
set.seed(321) 
# time-stationary covariate x_{i1}
x1 <- rep(rnorm(sample_size), each = cluster_size)
# time-varying covariate x_{it2}
x2 <- rnorm(sample_size * cluster_size) 
# create covariates dataframe 
xdata <- data.frame(x1, x2) 
# correlation matrix for the NORTA method
equicorrelation_matrix <- toeplitz(c(1, rep(0.95, cluster_size - 1)))
identity_matrix <- diag(4)
latent_correlation_matrix <- kronecker(equicorrelation_matrix, identity_matrix) 
# simulation of clustered ordinal responses
simulated_ordinal_dataset <- rmult.acl(clsize = cluster_size,
                                       intercepts = beta_intercepts,
                                       betas = beta_coefficients,
                                       xformula = ~ x1 + x2, xdata = xdata,
                                       cor.matrix = latent_correlation_matrix)
suppressPackageStartupMessages(library("multgee"))
# fitting a GEE model
ordinal_gee_model <- ordLORgee(y ~ x1 + x2,
                               data = simulated_ordinal_dataset$simdata,
                               id = id, repeated = time, LORstr = "time.exch",
                               link = "acl")
# checking regression coefficients
round(coef(ordinal_gee_model), 2)
#> beta10 beta20 beta30     x1     x2 
#>   2.95   1.97   1.67   1.14   1.00

Correlated binary responses

The function rbin simulates binary responses under the marginal model specification where β_t0 is the intercept at measurement occasion t, β_t is the regression parameter vector associated with the covariates at measurement occasion t and F is a cdf. The threshold Y_it = 1 ⇔ U_it^B ≤ β_t0 + 2β_t^′x_it, generates clustered binary responses that satisfy the marginal model @ref(eq:MB), where and where {e_it^B : i = 1, …, N and t = 1, …, T} are random variables such that:

e_it^B ∼ F for all i and t.
e_i₁t₁^B and e_i₂t₂^B are independent random variables provided that i₁ ≠ i₂.

For each subject i, the association structure among the clustered binary responses {Y_it : t = 1, …, T} depends on the pairwise bivariate distributions and correlation matrix of {e_it^B : t = 1, …, T}. If the random variables {e_it^B : t = 1, …, T} are independent then so are {Y_it : t = 1, …, T}.

Suppose the goal is to simulate clustered binary responses from the marginal probit model Pr (Y_it = 1|x_it) = Φ(0.2x_i) where N = 100, T = 4 and $\mathbf{x}_{it}=x_i\overset{iid}{\sim} N(0,1)$ for all i and t. For the association structure, assume that the random variables e_i^B = (e_i1^B, e_i2^B, e_i3^B, e_i4^B)^′ in @ref(eq:TMB) are iid random vectors from the tetra-variate normal distribution with mean vector the zero vector and covariance matrix the correlation matrix R given by This association configuration defines an exchangeable correlation matrix for the clustered binary responses, i.e. corr(Y_it₁, Y_it₂) = ρ_i for all i and t. The strength of the correlation (ρ_i) is decreasing as the absolute value of the time-stationary covariate x_i increases. For example, ρ_i = 0.7128 when x_i = 0 and ρ_i = 0.7 when x_i = 3 or x_i = −3. Therefore, a strong exchangeable correlation pattern for each subject that does not differ much across subjects is implied with this configuration.

set.seed(123)
# sample size
sample_size <- 100
# cluster size
cluster_size <- 4
# intercept
beta_intercepts <- 0
# regression parameter associated with the covariate
beta_coefficients <- 0.2
# correlation matrix for the NORTA method
latent_correlation_matrix <- toeplitz(c(1, 0.9, 0.9, 0.9))
# time-stationary covariate
x <- rep(rnorm(sample_size), each = cluster_size)
# simulation of clustered binary responses
simulated_binary_dataset <- rbin(clsize = cluster_size,
                                 intercepts = beta_intercepts,
                                 betas = beta_coefficients, xformula = ~ x,
                                 cor.matrix = latent_correlation_matrix,
                                 link = "probit")
library("gee")
# fitting a GEE model
binary_gee_model <- gee(y ~ x, family = binomial("probit"), id = id,
                        data = simulated_binary_dataset$simdata)
#> Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept)           x 
#>   0.1315121   0.2826005
# checking the estimated coefficients
summary(binary_gee_model)$coefficients
#>              Estimate Naive S.E.  Naive z Robust S.E. Robust z
#> (Intercept) 0.1315121 0.06399465 2.055048   0.1106696 1.188331
#> x           0.2826006 0.07191931 3.929412   0.1270285 2.224703

Consider now simulation of correlated binary responses from the marginal logit model where F is the cdf of the standard logistic distribution (mean = 0 and variance = π²/3), N = 100, T = 4 and $\mathbf{x}_{it}=x_i\overset{iid}{\sim} N(0,1)$ for all i and t. This is similar to the marginal model configuration in Example @ref(exm:BinaryProbit) except from the link function. For the dependence structure, assume that the correlation matrix of e_i^B = (e_i1^B, e_i2^B, e_i3^B, e_i4^B)^′ in @ref(eq:TMB) is equal to the correlation matrix R defined in @ref(eq:CMB). To simulate e_i^B without utilizing the NORTA method, one can employ the tetra-variate extreme value distribution (Gumbel 1958). In particular, this is accomplished by setting e_i^B = U_i − V_i for all i, where U_i and V_i are independent random vectors from the tetra-variate extreme value distribution with dependence parameter equal to 0.9, that is $$\Pr\left(U_{i1}\leq u_{i1},U_{i2}\leq u_{i2},U_{i3}\leq u_{i3},U_{i4}\leq u_{i4}\right)=\exp\left\{-\left[\sum_{t=1}^4 \exp{\left(-\frac{u_{it}}{0.9}\right)}\right]^{0.9}\right\}$$ and $$\Pr\left(V_{i1}\leq v_{i1},V_{i2}\leq v_{i2},V_{i3}\leq v_{i3},V_{i4}\leq v_{i4}\right)=\exp\left\{-\left[\sum_{t=1}^4 \exp{\left(-\frac{v_{it}}{0.9}\right)}\right]^{0.9}\right\}.$$ It follows that e_it^B ∼ F for all i and t and corr(e_i^B) = R for all i.

set.seed(8)
# simulation of epsilon variables
library("evd")
simulated_latent_variables1 <- rmvevd(sample_size, dep = sqrt(1 - 0.9),
                                      model = "log", d = cluster_size)
simulated_latent_variables2 <- rmvevd(sample_size, dep = sqrt(1 - 0.9),
                                      model = "log", d = cluster_size)
simulated_latent_variables <- simulated_latent_variables1 -
  simulated_latent_variables2
# simulation of clustered binary responses
simulated_binary_dataset <- rbin(clsize = cluster_size,
                                 intercepts = beta_intercepts,
                                 betas = beta_coefficients, xformula = ~ x,
                                 rlatent = simulated_latent_variables)
# fitting a GEE model
binary_gee_model <- gee(y ~ x, family = binomial("logit"), id = id,
                        data = simulated_binary_dataset$simdata)
#> Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept)           x 
#>  0.04146261  0.09562709
# checking the estimated coefficients
summary(binary_gee_model)$coefficients
#>               Estimate Naive S.E.   Naive z Robust S.E.  Robust z
#> (Intercept) 0.04146261  0.1008516 0.4111249   0.1790511 0.2315686
#> x           0.09562709  0.1107159 0.8637160   0.1949327 0.4905647

No marginal model specification

To achieve simulation of clustered binary, ordinal and nominal responses under no marginal model specification, perform the following intercepts:

Based on the marginal probabilities calculate the intercept of a marginal probit model for binary responses (see Example @ref(exm:no-covariate)) or the category-specific intercepts of a cumulative probit model (see third example in Touloumis 2016) or of a baseline-category logit model for multinomial responses (see Example @ref(exm:Example2NoCovariates)).
Create a pseudo-covariate say x of length equal to the number of cluster size (clsize) times the desired number of clusters of simulated responses (say R), that is x = clsize * R. This step is required in order to identify the desired number of clustered responses.
Set betas = 0 in the core functions rbin (see Example @ref(exm:no-covariate)) or rmult.clm, or set 0 all values of the beta argument that correspond to the category-specific parameters in the core function rmult.bcl (see Example @ref(exm:Example2NoCovariates)).
set xformula = ~ x.
Run the core function to obtain realizations of the simulated clustered responses.

Suppose the goal is to simulate 5000 clustered binary responses with Pr (Y_t = 1) = 0.8 for all t = 1, …, 4. For simplicity, assume that the clustered binary responses are independent.

set.seed(123)
# sample size
sample_size <- 5000
# cluster size
cluster_size <- 4
# intercept
beta_intercepts <- qnorm(0.8)
# pseudo-covariate
x <- rep(0, each = cluster_size * sample_size)
# regression parameter associated with the covariate
beta_coefficients <- 0
# correlation matrix for the NORTA method
latent_correlation_matrix <- diag(cluster_size)
# simulation of clustered binary responses
simulated_binary_dataset <- rbin(clsize = cluster_size,
                                 intercepts = beta_intercepts,
                                 betas = beta_coefficients, xformula = ~x,
                                 cor.matrix = latent_correlation_matrix,
                                 link = "probit")
# simulated marginal probabilities
colMeans(simulated_binary_dataset$Ysim)
#>    t=1    t=2    t=3    t=4 
#> 0.8024 0.7972 0.7948 0.8088

Suppose the aim is to simulate N = 5000 clustered nominal responses with Pr (Y_t = 1) = 0.1, Pr (Y_t = 2) = 0.2, Pr (Y_t = 3) = 0.3 and Pr (Y_t = 4) = 0.4, for all i and t = 1, …, 3. For the sake of simplicity, we assume that the clustered responses are independent.

# sample size
sample_size <- 5000
# cluster size
cluster_size <- 3 
# pseudo-covariate 
x <- rep(0, each = cluster_size * sample_size)
# parameter vector
betas <-  c(log(0.1/0.4), 0, log(0.2/0.4), 0, log(0.3/0.4), 0, 0, 0)
# number of nominal response categories
categories_no <- 4  
set.seed(1) 
# correlation matrix for the NORTA method
latent_correlation_matrix <- kronecker(diag(cluster_size), diag(categories_no)) 
# simulation of clustered nominal responses
simulated_nominal_dataset <- rmult.bcl(clsize = cluster_size,
                                       ncategories = categories_no,
                                       betas = betas, xformula = ~ x,
                                       cor.matrix = latent_correlation_matrix)
# simulated marginal probabilities
apply(simulated_nominal_dataset$Ysim, 2, table) / sample_size
#>      t=1    t=2    t=3
#> 1 0.1000 0.0996 0.1036
#> 2 0.2034 0.2000 0.2000
#> 3 0.2874 0.3130 0.2894
#> 4 0.4092 0.3874 0.4070

A note on the correlation matrix

In SimCorMultRes, the user provides the correlation matrix (denoted as R) for the multivariate normal distribution used in the intermediate step of the NORTA method, rather than the correlation matrix of the latent responses used in the corresponding threshold approach. This choice is motivated by the observation that when all the marginal distributions of the correlated latent responses follow a logistic distribution, the correlation matrix R and the correlation matrix of the latent responses are expected to be similar, as noted by Touloumis (2016). Therefore, in SimCorMultRes, this approximation is employed irrespective of the marginal distribution of the latent responses.

To evaluate the validity of this approximation for the threshold approaches employed in SimCorMultRes, a simulation study was conducted. For a fixed sample size N and a correlation parameter ρ, N independent bivariate random vectors {y_i : i = 1, …, N} from a bivariate normal distribution with mean vector the zero vector and covariance matrix the correlation matrix $$ \mathbf R = \begin{bmatrix} 1 & \rho\\ \rho & 1 \end{bmatrix} $$ were drawn. The sample correlation was used to estimate ρ. Next, the NORTA method was applied to obtain bivariate random vectors {z_i : i = 1, …, N} so that their marginal distribution is the logistic distribution. Their correlation parameter, say ρ_z, was estimated using the corresponding sample correlation. Then, the NORTA method was applied to obtain bivariate random vectors {w_i : i = 1, …, N} so that their marginal distribution is the Gumbel distribution. Their correlation parameter, say ρ_w, was estimated using their sample correlation.

For a fixed value of ρ, this procedure was replicated 10, 000, 000 times to reduce the simulation error and with N = 10, 000 to reduce the sampling error. The three correlation parameters ρ, ρ_z and ρ_w were estimated using their corresponding Monte Carlo counterparts, denoted by ρ̂, ρ̂_z and ρ̂_w, respectively. We let ρ ∈ {0, 0.01, 0.02, …, 0.99}.

The dataframe simulation contains the true correlation parameter ρ (rho) and the Monte Carlo estimates ρ̂ (normal), ρ̂_z (logistic) and ρ̂_w (gumbel) from the simulation study described above. As expected, ρ̂ ≈ ρ regardless of the strength of the correlation parameter. For the case of logistic marginal distributions, the average difference between ρ and ρ̂_z is 0.002, taking the maximum value of 0.0031 at ρ = 0.58. Therefore ρ appears to approximate ρ_z to 2 decimal points. For the case of Gumbel marginal distributions, the average difference between ρ and ρ̂_w is 0.0103, taking the maximum value of 0.0154 at ρ = 0.51. Although ρ appears to approximate well ρ_w, there is some accuracy loss compared to ρ_z. The plot below shows the differences between the true correlation coefficient of the bivariate normal distribution and the simulated correlations for ρ, ρ_z or ρ_w.

Difference between the correlation parameters of the bivariate normal distribution and of the latent variables for three different marginal distributions.

Overall, there is little accuracy loss by specifying the correlation matrix of the multivariate normal distribution in the intermediate step of the NORTA distribution instead of the correlation matrix of correlated latent responses regardless of whether their marginal distributions are either logistic or Gumbel distributions. Users can treat the correlation matrix passed on to the core functions of SimCorMultRes as the correlation matrix of the latent variables.

How to Cite

citation("SimCorMultRes")
To cite 'SimCorMultRes' in publications, please use:

  Touloumis A (2016). "Simulating Correlated Binary and Multinomial
  Responses under Marginal Model Specification: The SimCorMultRes
  Package." _The R Journal_, *8*(2), 79-91. R package version 1.9.1,
  <https://journal.r-project.org/archive/2016/RJ-2016-034/index.html>.

A BibTeX entry for LaTeX users is

  @Article{,
    title = {Simulating Correlated Binary and Multinomial Responses under
         Marginal Model Specification: The SimCorMultRes Package},
    author = {Anestis Touloumis},
    year = {2016},
    journal = {The R Journal},
    volume = {8},
    number = {2},
    note = {R package version 1.9.1},
    pages = {79-91},
    url = {https://journal.r-project.org/archive/2016/RJ-2016-034/index.html},
  }

References

Gumbel, E. J. 1958. Statistics of Extremes. Columbia University Press, New York.

Touloumis, A. 2016. “Simulating Correlated Binary and Multinomial Responses under Marginal Model Specification: The SimCorMultRes Package.” The R Journal 8 (2): 79–91. https://journal.r-project.org/archive/2016/RJ-2016-034/index.html.

Simulating Correlated Binary and Multinomial Responses with SimCorMultRes