Areas of Applications
This package was created to facilitate the task of carrying out
simulation studies and evaluating the performance of statistical methods
for estimating the regression parameters in a marginal model with
clustered binary and multinomial responses. Examples of such statistical
methods include maximum likelihood methods, copula approaches,
quasi-least squares approaches, generalized quasi-likelihood methods and
generalized estimating equations (GEE) approaches among others (see references in Touloumis 2016).
In addition, SimCorMultRes can generate correlated
binary and multinomial random variables conditional on a desired
dependence structure and known marginal probabilities even if these are
not determined by a regression model (see third
example in Touloumis 2016) or to
explore approximations of association measures for discrete variables
that arise as realizations of an underlying continuum (see second example in Touloumis 2016).
Simulation Methods
Let \(Y_{it}\) be the binary or
multinomial response for subject \(i\)
(\(i=1,\ldots,N\)) at measurement
occasion \(t\) (\(t=1,\ldots,T\)), and let \(\mathbf {x}_{it}\) be the associated
covariates vector. We assume that \(Y_{it} \in
\{0,1\}\) for binary responses and \(Y_{it} \in \{1,2,\ldots,J\geq 3\}\) for
multinomial responses.
Correlated nominal responses
The function rmult.bcl simulates nominal responses under
the marginal baseline-category logit model \[\begin{equation}
\log \left[\frac{\Pr(Y_{it}=j |\mathbf {x}_{it})}{\Pr(Y_{it}=J |\mathbf
{x}_{it})}\right]=(\beta_{tj0}-\beta_{tJ0})+(\boldsymbol
{\beta}_{tj}-\boldsymbol{\beta}_{tJ})^{\prime} \mathbf
{x}_{it}=\beta^{\ast}_{tj0}+\boldsymbol{\beta}^{\ast\prime}_{tj}\mathbf
{x}_{it},
(\#eq:MBCLM)
\end{equation}\] where \(\beta_{tj0}\) is the \(j\)-th category-specific intercept at
measurement occasion \(t\) and \(\boldsymbol{\beta}_{tj}\) is the \(j\)-th category-specific parameter vector
associated with the covariates at measurement occasion \(t\). The popular identifiability
constraints \(\beta_{tJ0}=0\) and \(\boldsymbol{\beta}_{tJ}=\mathbf {0}\) for
all \(t\), imply that \(\beta^{\ast}_{tj0}=\beta_{tj0}\) and \(\boldsymbol
{\beta}^{\ast}_{tj}=\boldsymbol{\beta}_{tj}\) for all \(t=1,\ldots,T\) and \(j=1,\ldots,J-1\). The threshold \[Y_{it}=j \Leftrightarrow U^{NO}_{itj}=\max
\{U^{NO}_{it1},\ldots,U^{NO}_{itJ}\}\] generates clustered
nominal responses that satisfy the marginal baseline-category logit
model @ref(eq:MBCLM), where \[U^{NO}_{itj}=\beta_{tj0}+\boldsymbol{\beta}_{tj}^{\prime}
\mathbf {x}_{it}+e^{NO}_{itj},\] and where the random variables
\(\{e^{NO}_{itj}:i=1,\ldots,N \text{, }
t=1,\ldots,T \text{ and } j=1,\ldots,J\}\) satisfy the following
conditions:
- \(e^{NO}_{itj}\) follows the
standard extreme value distribution for all \(i\), \(t\)
and \(j\) (mean \(=\gamma \approx 0.5772\), where \(\gamma\) is Euler’s constant, and variance
\(=\pi^2/6\)).
- \(e^{NO}_{i_1t_1j_1}\) and \(e^{NO}_{i_2t_2j_2}\) are independent random
variables provided that \(i_1 \neq
i_2\).
- \(e^{NO}_{itj_1}\) and \(e^{NO}_{itj_2}\) are independent random
variables provided that \(j_1\neq
j_2\).
For each subject \(i\), the
association structure among the clustered nominal responses \(\{Y_{it}:t=1,\ldots,T\}\) depends on the
joint distribution and correlation matrix of \(\{e^{NO}_{itj}:t=1,\ldots,T \text{ and }
j=1,\ldots,J\}\). If the random variables \(\{e^{NO}_{itj}:t=1,\ldots,T \text{ and }
j=1,\ldots,J\}\) are independent then so are \(\{Y_{it}:t=1,\ldots,T\}\).
Suppose the aim is to simulate nominal responses from the marginal
baseline-category logit model \[\begin{equation*}
\log \left[\frac{\Pr(Y_{it}=j |\mathbf {x}_{it})}{\Pr(Y_{it}=4 |\mathbf
{x}_{it})}\right]=\beta_{j0}+ \beta_{j1} {x}_{i1}+ \beta_{j2} {x}_{it2}
\end{equation*}\] where \(N=500\), \(T=3\), \((\beta_{10},\beta_{11},\beta_{12},\beta_{20},\beta_{21},\beta_{22},\beta_{30},\beta_{31},\beta_{32})=(1,
3, 2, 1.25, 3.25, 1.75, 0.75, 2.75, 2.25)\) and \(\mathbf
{x}_{it}=(x_{i1},x_{it2})^{\prime}\) for all \(i\) and \(t\), with \(x_{i1}\overset{iid}{\sim} N(0,1)\) and
\(x_{it2}\overset{iid}{\sim} N(0,1)\).
For the dependence structure, suppose that the correlation matrix \(\mathbf{R}\) in the NORTA method has
elements \[
\mathbf{R}_{t_1j_1,t_2j_2}=\begin{cases}
1 & \text{if } t_1=t_2 \text{ and } j_1=j_2\\
0.95 & \text{if } t_1 \neq t_2 \text{ and } j_1=j_2\\
0 & \text{otherwise }\\
\end{cases}
\] for all \(i=1,\ldots,500\).
# parameter vector
betas <- c(1, 3, 2, 1.25, 3.25, 1.75, 0.75, 2.75, 2.25, 0, 0, 0)
# sample size
sample_size <- 500
# number of nominal response categories
categories_no <- 4
# cluster size
cluster_size <- 3
set.seed(1)
# time-stationary covariate x_{i1}
x1 <- rep(rnorm(sample_size), each = cluster_size)
# time-varying covariate x_{it2}
x2 <- rnorm(sample_size * cluster_size)
# create covariates dataframe
xdata <- data.frame(x1, x2)
set.seed(321)
library("SimCorMultRes")
# latent correlation matrix for the NORTA method
equicorrelation_matrix <- toeplitz(c(1, rep(0.95,cluster_size - 1)))
identity_matrix <- diag(categories_no)
latent_correlation_matrix <- kronecker(equicorrelation_matrix, identity_matrix)
# simulation of clustered nominal responses
simulated_nominal_dataset <- rmult.bcl(clsize = cluster_size,
ncategories = categories_no,
betas = betas, xformula = ~ x1 + x2,
xdata = xdata,
cor.matrix = latent_correlation_matrix)
suppressPackageStartupMessages(library("multgee"))
# fitting a GEE model
nominal_gee_model <- nomLORgee(y ~ x1 + x2,
data = simulated_nominal_dataset$simdata,
id = id, repeated = time,
LORstr = "time.exch")
# checking regression coefficients
round(coef(nominal_gee_model), 2)
#> beta10 x1:1 x2:1 beta20 x1:2 x2:2 beta30 x1:3 x2:3
#> 1.07 3.18 1.99 1.35 3.40 1.70 0.89 3.06 2.22
Correlated ordinal responses
Simulation of clustered ordinal responses is feasible under either a
marginal cumulative link model or a marginal continuation-ratio
model.
Marginal cumulative link model
The function rmult.clm simulates ordinal responses under
the marginal cumulative link model \[\begin{equation}
\Pr(Y_{it}\le j |\mathbf {x}_{it})=F(\beta_{tj0} +\boldsymbol
{\beta}^{\prime}_t \mathbf {x}_{it})
(\#eq:MCLM)
\end{equation}\] where \(F\) is
a cumulative distribution function (cdf), \(\beta_{tj0}\) is the \(j\)-th category-specific intercept at
measurement occasion \(t\) and \(\boldsymbol \beta_t\) is the regression
parameter vector associated with the covariates at measurement occasion
\(t\). The category-specific intercepts
at each measurement occasion \(t\) are
assumed to be monotone increasing, that is \[-\infty=\beta_{t00} <\beta_{t10} <
\beta_{t20} < \cdots < \beta_{t(J-1)0}<
\beta_{tJ0}=\infty\] for all \(t\). Using the threshold \[Y_{it}=j \Leftrightarrow \beta_{t(j-1)0} <
U^{O1}_{it} \leq \beta_{tj0}\] clustered ordinal responses that
satisfy the marginal cumulative link model @ref(eq:MCLM) are generated,
where \[U^{O1}_{it}=-\boldsymbol{\beta}^{\prime}_t
\mathbf {x}_{it}+e^{O1}_{it},\] and where \(\{e^{O1}_{it}:i=1,\ldots,N \text{ and }
t=1,\ldots,T\}\) are random variables such that:
- \(e^{O1}_{it} \sim F\) for all
\(i\) and \(t\).
- \(e^{O1}_{i_1t_1}\) and \(e^{O1}_{i_2t_2}\) are independent random
variables provided that \(i_1 \neq
i_2\).
For each subject \(i\), the
association structure among the clustered ordinal responses \(\{Y_{it}:t=1,\ldots,T\}\) depends on the
pairwise bivariate distributions and correlation matrix of \(\{e^{O1}_{it}:t=1,\ldots,T\}\). If the
random variables \(\{e^{O1}_{it}:t=1,\ldots,T\}\) are
independent then so are \(\{Y_{it}:t=1,\ldots,T\}\).
Suppose the goal is to simulate correlated ordinal responses from the
marginal cumulative probit model \[\begin{equation*}
\Pr(Y_{it}\le j |\mathbf x_{it})=\Phi(\beta_{j0} + \beta_{t1} {x}_{i})
\end{equation*}\] where \(\Phi\)
denotes the cdf of the standard normal distribution (mean \(=0\) and variance \(=1\)), \(N=500\), \(T=4\), \((\beta_{10},\beta_{20},\beta_{30},\beta_{40})=(-1.5,-0.5,0.5,1.5)\),
\((\beta_{11},\beta_{21},\beta_{31},\beta_{41})=(1,2,3,4)\)
and \(\mathbf x_{it}=x_{i}\overset{iid}{\sim}
N(0,1)\) for all \(i\) and \(t\). For the dependence structure, assume
that \(\mathbf{e}_i^{O1}=(e^{O1}_{i1},e^{O1}_{i2},e^{O1}_{i3},e^{O1}_{i4})^{\prime}\)
are iid random vectors from a tetra-variate normal distribution with
mean vector the zero vector and covariance matrix the correlation matrix
\[\left( {\begin{array}{*{20}c}
1.00 & 0.85 & 0.50 & 0.15 \\
0.85 & 1.00 & 0.85 & 0.50 \\
0.50 & 0.15 & 1.00 & 0.85 \\
0.15 & 0.85 & 0.50 & 1.00
\end{array} } \right).\]
set.seed(12345)
# sample size
sample_size <- 500
# cluster size
cluster_size <- 4
# category-specific intercepts
beta_intercepts <- c(-1.5, -0.5, 0.5, 1.5)
# time-varying regression parameters associated with covariates
beta_coefficients <- matrix(c(1, 2, 3, 4), 4, 1)
# time-stationary covariate
x <- rep(rnorm(sample_size), each = cluster_size)
# latent correlation matrix for the NORTA method
latent_correlation_matrix <- toeplitz(c(1, 0.85, 0.5, 0.15))
# simulation of ordinal responses
simulated_ordinal_dataset <- rmult.clm(clsize = cluster_size,
intercepts = beta_intercepts,
betas = beta_coefficients,
xformula = ~ x,
cor.matrix = latent_correlation_matrix,
link = "probit")
# first eight rows of the simulated dataframe
head(simulated_ordinal_dataset$simdata, n = 8)
#> y x id time
#> 1 1 0.5855288 1 1
#> 2 2 0.5855288 1 2
#> 3 2 0.5855288 1 3
#> 4 1 0.5855288 1 4
#> 5 1 0.7094660 2 1
#> 6 1 0.7094660 2 2
#> 7 1 0.7094660 2 3
#> 8 1 0.7094660 2 4
Marginal continuation-ratio model
The function rmult.crm simulates clustered ordinal
responses under the marginal continuation-ratio model \[\begin{equation}
\Pr(Y_{it}=j |Y_{it} \ge j,\mathbf {x}_{it})=F(\beta_{tj0} +\boldsymbol
{\beta}^{'}_t \mathbf {x}_{it})
(\#eq:MCRM)
\end{equation}\] where \(\beta_{tj0}\) is the \(j\)-th category-specific intercept at
measurement occasion \(t\), \(\boldsymbol \beta_t\) is the regression
parameter vector associated with the covariates at measurement occasion
\(t\) and \(F\) is a cdf. This is accomplished by
utilizing the threshold \[Y_{it}=j, \text{
given } Y_{it} \geq j \Leftrightarrow U^{O2}_{itj} \leq
\beta_{tj0}\] where \[U^{O2}_{itj}=-\boldsymbol {\beta}^{\prime}_t
\mathbf {x}_{it}+e^{O2}_{itj},\] and where \(\{e^{O2}_{itj}:i=1,\ldots,N \text{ , }
t=1,\ldots,T \text{ and } j=1,\ldots,J-1\}\) satisfy the
following three conditions:
- \(e^{O2}_{itj} \sim F\) for all
\(i\), \(t\) and \(j\).
- \(e^{O2}_{i_1t_1j_1}\) and \(e^{O2}_{i_2t_2j_2}\) are independent random
variables provided that \(i_1 \neq
i_2\).
- \(e^{O2}_{itj_1}\) and \(e^{O2}_{itj_2}\) are independent random
variables provided that \(j_1\neq
j_2\).
For each subject \(i\), the
association structure among the clustered ordinal responses \(\{Y_{it}:t=1,\ldots,T\}\) depends on the
joint distribution and correlation matrix of \(\{e^{O2}_{itj}:j=1,\ldots,J \text{ and }
t=1,\ldots,T\}\). If the random variables \(\{e^{O2}_{itj}:j=1,\ldots,J \text{ and }
t=1,\ldots,T\}\) are independent then so are \(\{Y_{it}:t=1,\ldots,T\}\).
Suppose simulation of clustered ordinal responses under the marginal
continuation-ratio probit model \[\begin{equation*}
\Pr(Y_{it}=j |Y_{it} \ge j,\mathbf{x}_{it})=\Phi(\beta_{j0} + \beta
{x}_{it})
\end{equation*}\] with \(N=500\), \(T=4\), \((\beta_{10},\beta_{20},\beta_{30},\beta_{40},\beta)=(-1.5,-0.5,0.5,1.5,1)\)
and \(\mathbf{x}_{it}=x_{it}\overset{iid}{\sim}
N(0,1)\) for all \(i\) and \(t\) is desired. For the dependence
structure, assume that \(\left\{\mathbf{e}_i^{O2}=\left(e^{O2}_{i11},\ldots,e^{O1}_{i44}\right)^{\prime}:i=1,\ldots,N\right\}\)
are iid random vectors from a multivariate normal distribution with mean
vector the zero vector and covariance matrix the \(16 \times 16\) correlation matrix with
elements\[
\text{corr}(e^{O2}_{it_1j_1},e^{O2}_{it_2j_2}) = \begin{cases}
1 & \text{for } j_1 = j_2 \text{ and } t_1 = t_2\\
0.24 & \text{for } t_1 \neq t_2\\
0 & \text{otherwise.}\\
\end{cases}
\]
set.seed(1)
# sample size
sample_size <- 500
# cluster size
cluster_size <- 4
# category-specific intercepts
beta_intercepts <- c(-1.5, -0.5, 0.5, 1.5)
# regression parameters associated with covariates
beta_coefficients <- 1
# time-varying covariate
x <- rnorm(sample_size * cluster_size)
# number of ordinal response categories
categories_no <- 5
# correlation matrix for the NORTA method
latent_correlation_matrix <- diag(1, (categories_no - 1) * cluster_size) +
kronecker(toeplitz(c(0, rep(0.24, categories_no - 2))),
matrix(1, cluster_size, cluster_size))
# simulation of ordinal responses
simulated_ordinal_dataset <- rmult.crm(clsize = cluster_size,
intercepts = beta_intercepts,
betas = beta_coefficients,
xformula = ~ x,
cor.matrix = latent_correlation_matrix,
link = "probit")
# first six clusters with ordinal responses
head(simulated_ordinal_dataset$Ysim)
#> t=1 t=2 t=3 t=4
#> i=1 2 1 3 1
#> i=2 1 4 1 1
#> i=3 2 2 1 3
#> i=4 3 5 2 2
#> i=5 2 1 1 1
#> i=6 3 3 4 5
Marginal adjacent-category logit model
The function rmult.acl simulates clustered ordinal
responses under the marginal adjacent-category logit model \[\begin{equation}
\log\left[\frac{\Pr(Y_{it}=j |\mathbf {x}_{it})}{\Pr(Y_{it}=j+1 |\mathbf
{x}_{it})}\right]=\beta_{tj0} +\boldsymbol {\beta}^{'}_t \mathbf
{x}_{it}
(\#eq:MACL)
\end{equation}\] where \(\beta_{tj0}\) is the \(j\)-th category-specific intercept at
measurement occasion \(t\), \(\boldsymbol \beta_t\) is the regression
parameter vector associated with the covariates at measurement occasion
\(t\).
Generation of clustered ordinal responses relies upon utilizing the
connection between baseline-category logit models and adjacent-category
logit models. In particular, the threshold \[Y_{it}=j \Leftrightarrow U^{O3}_{itj}=\max
\{U^{O3}_{it1},\ldots,U^{O3}_{itJ}\}\] generates clustered
nominal responses that satisfy the marginal adjacent-category logit
model @ref(eq:MACL), where \[U^{O3}_{itj}=\sum_{k=j}^J\beta_{tk0}+(J-j)\boldsymbol{\beta}_{t}^{\prime}
\mathbf {x}_{it}+e^{O3}_{itj},\] and where the random variables
\(\{e^{O3}_{itj}:i=1,\ldots,N \text{, }
t=1,\ldots,T \text{ and } j=1,\ldots,J\}\) satisfy the following
conditions:
- \(e^{O3}_{itj}\) follows the
standard extreme value distribution for all \(i\), \(t\)
and \(j\).
- \(e^{O3}_{i_1t_1j_1}\) and \(e^{O3}_{i_2t_2j_2}\) are independent random
variables provided that \(i_1 \neq
i_2\).
- \(e^{O3}_{itj_1}\) and \(e^{O3}_{itj_2}\) are independent random
variables provided that \(j_1\neq
j_2\).
For each subject \(i\), the
association structure among the clustered ordinal responses \(\{Y_{it}:t=1,\ldots,T\}\) depends on the
joint distribution and correlation matrix of \(\{e^{O3}_{itj}:t=1,\ldots,T \text{ and }
j=1,\ldots,J\}\). If the random variables \(\{e^{O3}_{itj}:t=1,\ldots,T \text{ and }
j=1,\ldots,J\}\) are independent then so are \(\{Y_{it}:t=1,\ldots,T\}\).
Suppose the aim is to simulate ordinal responses from the marginal
adjacent-category logit model \[\begin{equation*}
\log \left[\frac{\Pr(Y_{it}=j |\mathbf {x}_{it})}{\Pr(Y_{it}=j+1
|\mathbf {x}_{it})}\right]=\beta_{j0}+ \beta_{1} {x}_{i1}+ \beta_{2}
{x}_{it2} \end{equation*}\] where \(N=500\), \(T=3\), \((\beta_{10},\beta_{20},\beta_{30})=(3, 2,
1)\), \((\beta_{1},\beta_{2})=(1,
1)\) and \(\mathbf
{x}_{it}=(x_{i1},x_{it2})^{\prime}\) for all \(i\) and \(t\), with \(x_{i1}\overset{iid}{\sim} N(0,1)\) and
\(x_{it2}\overset{iid}{\sim} N(0,1)\).
For the dependence structure, suppose that the correlation matrix \(\mathbf{R}\) in the NORTA method has
elements \[
\mathbf{R}_{t_1j_1,t_2j_2}=\begin{cases}
1 & \text{if } t_1=t_2 \text{ and } j_1=j_2\\
0.95 & \text{if } t_1 \neq t_2 \text{ and } j_1=j_2\\
0 & \text{otherwise }\\
\end{cases}
\] for all \(i=1,\ldots,500\).
# intercepts
beta_intercepts <- c(3, 2, 1)
# parameter vector
beta_coefficients <- c(1, 1)
# sample size
sample_size <- 500
# cluster size
cluster_size <- 3
set.seed(321)
# time-stationary covariate x_{i1}
x1 <- rep(rnorm(sample_size), each = cluster_size)
# time-varying covariate x_{it2}
x2 <- rnorm(sample_size * cluster_size)
# create covariates dataframe
xdata <- data.frame(x1, x2)
# correlation matrix for the NORTA method
equicorrelation_matrix <- toeplitz(c(1, rep(0.95, cluster_size - 1)))
identity_matrix <- diag(4)
latent_correlation_matrix <- kronecker(equicorrelation_matrix, identity_matrix)
# simulation of clustered ordinal responses
simulated_ordinal_dataset <- rmult.acl(clsize = cluster_size,
intercepts = beta_intercepts,
betas = beta_coefficients,
xformula = ~ x1 + x2, xdata = xdata,
cor.matrix = latent_correlation_matrix)
suppressPackageStartupMessages(library("multgee"))
# fitting a GEE model
ordinal_gee_model <- ordLORgee(y ~ x1 + x2,
data = simulated_ordinal_dataset$simdata,
id = id, repeated = time, LORstr = "time.exch",
link = "acl")
# checking regression coefficients
round(coef(ordinal_gee_model), 2)
#> beta10 beta20 beta30 x1 x2
#> 2.95 1.97 1.67 1.14 1.00
Correlated binary responses
The function rbin simulates binary responses under the
marginal model specification \[\begin{equation}
\Pr(Y_{it}=1 |\mathbf {x}_{it})=F(\beta_{t0} +\boldsymbol
{\beta}^{\prime}_{t} \mathbf {x}_{it})
(\#eq:MB)
\end{equation}\] where \(\beta_{t0}\) is the intercept at
measurement occasion \(t\), \(\boldsymbol \beta_t\) is the regression
parameter vector associated with the covariates at measurement occasion
\(t\) and \(F\) is a cdf. The threshold \[Y_{it}=1 \Leftrightarrow U^{B}_{it} \leq
\beta_{t0} + 2 \boldsymbol {\beta}^{\prime}_t \mathbf {x}_{it},\]
generates clustered binary responses that satisfy the marginal model
@ref(eq:MB), where \[\begin{equation}
U^{B}_{it}=\boldsymbol {\beta}^{\prime}_t \mathbf {x}_{it}+e^{B}_{it},
(\#eq:TMB)
\end{equation}\] and where \(\{e^{B}_{it}:i=1,\ldots,N \text{ and }
t=1,\ldots,T\}\) are random variables such that:
- \(e^{B}_{it} \sim F\) for all \(i\) and \(t\).
- \(e^{B}_{i_1t_1}\) and \(e^{B}_{i_2t_2}\) are independent random
variables provided that \(i_1 \neq
i_2\).
For each subject \(i\), the
association structure among the clustered binary responses \(\{Y_{it}:t=1,\ldots,T\}\) depends on the
pairwise bivariate distributions and correlation matrix of \(\{e^{B}_{it}:t=1,\ldots,T\}\). If the
random variables \(\{e^{B}_{it}:t=1,\ldots,T\}\) are
independent then so are \(\{Y_{it}:t=1,\ldots,T\}\).
Suppose the goal is to simulate clustered binary responses from the
marginal probit model \[\Pr(Y_{it}=1
|\mathbf{x}_{it})=\Phi(0.2x_i)\] where \(N=100\), \(T=4\) and \(\mathbf{x}_{it}=x_i\overset{iid}{\sim}
N(0,1)\) for all \(i\) and \(t\). For the association structure, assume
that the random variables \(\mathbf{e}_i^{B}=(e^{B}_{i1},e^{B}_{i2},e^{B}_{i3},e^{B}_{i4})^{\prime}\)
in @ref(eq:TMB) are iid random vectors from the tetra-variate normal
distribution with mean vector the zero vector and covariance matrix the
correlation matrix \(\mathbf{R}\) given
by \[\begin{equation}
\mathbf{R}=\left( {\begin{array}{*{20}c}
1.00 & 0.90 & 0.90 & 0.90 \\
0.90 & 1.00 & 0.90 & 0.90 \\
0.90 & 0.90 & 1.00 & 0.90 \\
0.90 & 0.90 & 0.90 & 1.00
\end{array} } \right).
(\#eq:CMB)
\end{equation}\] This association configuration defines an
exchangeable correlation matrix for the clustered binary responses,
i.e. \(\text{corr}(Y_{it_1},Y_{it_2})=\rho_i\) for
all \(i\) and \(t\). The strength of the correlation (\(\rho_i\)) is decreasing as the absolute
value of the time-stationary covariate \(x_i\) increases. For example, \(\rho_i=0.7128\) when \(x_{i}=0\) and \(\rho_i=0.7\) when \(x_i=3\) or \(x_i=-3\). Therefore, a strong exchangeable
correlation pattern for each subject that does not differ much across
subjects is implied with this configuration.
set.seed(123)
# sample size
sample_size <- 100
# cluster size
cluster_size <- 4
# intercept
beta_intercepts <- 0
# regression parameter associated with the covariate
beta_coefficients <- 0.2
# correlation matrix for the NORTA method
latent_correlation_matrix <- toeplitz(c(1, 0.9, 0.9, 0.9))
# time-stationary covariate
x <- rep(rnorm(sample_size), each = cluster_size)
# simulation of clustered binary responses
simulated_binary_dataset <- rbin(clsize = cluster_size,
intercepts = beta_intercepts,
betas = beta_coefficients, xformula = ~ x,
cor.matrix = latent_correlation_matrix,
link = "probit")
library("gee")
# fitting a GEE model
binary_gee_model <- gee(y ~ x, family = binomial("probit"), id = id,
data = simulated_binary_dataset$simdata)
#> Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept) x
#> 0.1315121 0.2826005
# checking the estimated coefficients
summary(binary_gee_model)$coefficients
#> Estimate Naive S.E. Naive z Robust S.E. Robust z
#> (Intercept) 0.1315121 0.06399465 2.055048 0.1106696 1.188331
#> x 0.2826006 0.07191931 3.929412 0.1270285 2.224703
Consider now simulation of correlated binary responses from the
marginal logit model \[\begin{equation*}
\Pr(Y_{it}=1 |\mathbf{x}_{it})=F(0.2x_i)
\end{equation*}\] where \(F\) is
the cdf of the standard logistic distribution (mean \(=0\) and variance \(=\pi^2/3\)), \(N=100\), \(T=4\) and \(\mathbf{x}_{it}=x_i\overset{iid}{\sim}
N(0,1)\) for all \(i\) and \(t\). This is similar to the marginal model
configuration in Example @ref(exm:BinaryProbit) except from the link
function. For the dependence structure, assume that the correlation
matrix of \(\mathbf{e}_i^{B}=(e^{B}_{i1},e^{B}_{i2},e^{B}_{i3},e^{B}_{i4})^{\prime}\)
in @ref(eq:TMB) is equal to the correlation matrix \(\mathbf{R}\) defined in @ref(eq:CMB). To
simulate \(\mathbf{e}_i^{B}\) without
utilizing the NORTA method, one can employ the tetra-variate extreme
value distribution (Gumbel 1958). In particular, this is
accomplished by setting \(\mathbf{e}_i^{B}=\mathbf{U}_i-\mathbf{V}_i\)
for all \(i\), where \(\mathbf{U}_i\) and \(\mathbf{V}_i\) are independent random
vectors from the tetra-variate extreme value distribution with
dependence parameter equal to \(0.9\),
that is \[\Pr\left(U_{i1}\leq
u_{i1},U_{i2}\leq u_{i2},U_{i3}\leq u_{i3},U_{i4}\leq
u_{i4}\right)=\exp\left\{-\left[\sum_{t=1}^4
\exp{\left(-\frac{u_{it}}{0.9}\right)}\right]^{0.9}\right\}\] and
\[\Pr\left(V_{i1}\leq v_{i1},V_{i2}\leq
v_{i2},V_{i3}\leq v_{i3},V_{i4}\leq
v_{i4}\right)=\exp\left\{-\left[\sum_{t=1}^4
\exp{\left(-\frac{v_{it}}{0.9}\right)}\right]^{0.9}\right\}.\] It
follows that \(e_{it}^{B}\sim F\) for
all \(i\) and \(t\) and \(\textrm{corr}(\mathbf{e}_i^{B})=\mathbf{R}\)
for all \(i\).
set.seed(8)
# simulation of epsilon variables
library("evd")
simulated_latent_variables1 <- rmvevd(sample_size, dep = sqrt(1 - 0.9),
model = "log", d = cluster_size)
simulated_latent_variables2 <- rmvevd(sample_size, dep = sqrt(1 - 0.9),
model = "log", d = cluster_size)
simulated_latent_variables <- simulated_latent_variables1 -
simulated_latent_variables2
# simulation of clustered binary responses
simulated_binary_dataset <- rbin(clsize = cluster_size,
intercepts = beta_intercepts,
betas = beta_coefficients, xformula = ~ x,
rlatent = simulated_latent_variables)
# fitting a GEE model
binary_gee_model <- gee(y ~ x, family = binomial("logit"), id = id,
data = simulated_binary_dataset$simdata)
#> Beginning Cgee S-function, @(#) geeformula.q 4.13 98/01/27
#> running glm to get initial regression estimate
#> (Intercept) x
#> 0.04146261 0.09562709
# checking the estimated coefficients
summary(binary_gee_model)$coefficients
#> Estimate Naive S.E. Naive z Robust S.E. Robust z
#> (Intercept) 0.04146261 0.1008516 0.4111249 0.1790511 0.2315686
#> x 0.09562709 0.1107159 0.8637160 0.1949327 0.4905647
No marginal model specification
To achieve simulation of clustered binary, ordinal and nominal
responses under no marginal model specification, perform the following
intercepts:
Based on the marginal probabilities calculate the intercept of a
marginal probit model for binary responses (see Example
@ref(exm:no-covariate)) or the category-specific intercepts of a
cumulative probit model (see third example in Touloumis 2016) or of a
baseline-category logit model for multinomial responses (see Example
@ref(exm:Example2NoCovariates)).
Create a pseudo-covariate say x of length equal to
the number of cluster size (clsize) times the desired
number of clusters of simulated responses (say R), that is
x = clsize * R. This step is required in order to identify
the desired number of clustered responses.
Set betas = 0 in the core functions
rbin (see Example @ref(exm:no-covariate)) or
rmult.clm, or set 0 all values of the beta
argument that correspond to the category-specific parameters in the core
function rmult.bcl (see Example
@ref(exm:Example2NoCovariates)).
set xformula = ~ x.
Run the core function to obtain realizations of the simulated
clustered responses.
Suppose the goal is to simulate \(5000\) clustered binary responses with
\(\Pr(Y_{t}=1)=0.8\) for all \(t=1,\ldots,4\). For simplicity, assume that
the clustered binary responses are independent.
set.seed(123)
# sample size
sample_size <- 5000
# cluster size
cluster_size <- 4
# intercept
beta_intercepts <- qnorm(0.8)
# pseudo-covariate
x <- rep(0, each = cluster_size * sample_size)
# regression parameter associated with the covariate
beta_coefficients <- 0
# correlation matrix for the NORTA method
latent_correlation_matrix <- diag(cluster_size)
# simulation of clustered binary responses
simulated_binary_dataset <- rbin(clsize = cluster_size,
intercepts = beta_intercepts,
betas = beta_coefficients, xformula = ~x,
cor.matrix = latent_correlation_matrix,
link = "probit")
# simulated marginal probabilities
colMeans(simulated_binary_dataset$Ysim)
#> t=1 t=2 t=3 t=4
#> 0.8024 0.7972 0.7948 0.8088
Suppose the aim is to simulate \(N=5000\) clustered nominal responses with
\(\Pr(Y_{t}=1)=0.1\), \(\Pr(Y_{t}=2)=0.2\), \(\Pr(Y_{t}=3)=0.3\) and \(\Pr(Y_{t}=4)=0.4\), for all \(i\) and \(t=1,\ldots,3\). For the sake of simplicity,
we assume that the clustered responses are independent.
# sample size
sample_size <- 5000
# cluster size
cluster_size <- 3
# pseudo-covariate
x <- rep(0, each = cluster_size * sample_size)
# parameter vector
betas <- c(log(0.1/0.4), 0, log(0.2/0.4), 0, log(0.3/0.4), 0, 0, 0)
# number of nominal response categories
categories_no <- 4
set.seed(1)
# correlation matrix for the NORTA method
latent_correlation_matrix <- kronecker(diag(cluster_size), diag(categories_no))
# simulation of clustered nominal responses
simulated_nominal_dataset <- rmult.bcl(clsize = cluster_size,
ncategories = categories_no,
betas = betas, xformula = ~ x,
cor.matrix = latent_correlation_matrix)
# simulated marginal probabilities
apply(simulated_nominal_dataset$Ysim, 2, table) / sample_size
#> t=1 t=2 t=3
#> 1 0.1000 0.0996 0.1036
#> 2 0.2034 0.2000 0.2000
#> 3 0.2874 0.3130 0.2894
#> 4 0.4092 0.3874 0.4070
A note on the correlation matrix
In SimCorMultRes, the user provides the correlation
matrix (denoted as \(\mathbf{R}\)) for
the multivariate normal distribution used in the intermediate step of
the NORTA method, rather than the correlation matrix of the latent
responses used in the corresponding threshold approach. This choice is
motivated by the observation that when all the marginal distributions of
the correlated latent responses follow a logistic distribution, the
correlation matrix \(\mathbf{R}\) and
the correlation matrix of the latent responses are expected to be
similar, as noted by Touloumis (2016). Therefore, in
SimCorMultRes, this approximation is employed irrespective
of the marginal distribution of the latent responses.
To evaluate the validity of this approximation for the threshold
approaches employed in SimCorMultRes, a simulation study
was conducted. For a fixed sample size \(N\) and a correlation parameter \(\rho\), \(N\) independent bivariate random vectors
\(\{\mathbf y_{i}: i = 1, \ldots, N
\}\) from a bivariate normal distribution with mean vector the
zero vector and covariance matrix the correlation matrix \[
\mathbf R = \begin{bmatrix}
1 & \rho\\
\rho & 1
\end{bmatrix}
\] were drawn. The sample correlation was used to estimate \(\rho\). Next, the NORTA method was applied
to obtain bivariate random vectors \(\{\mathbf
z_{i}: i = 1, \ldots, N \}\) so that their marginal distribution
is the logistic distribution. Their correlation parameter, say \(\rho_{z}\), was estimated using the
corresponding sample correlation. Then, the NORTA method was applied to
obtain bivariate random vectors \(\{\mathbf
w_{i}: i = 1, \ldots, N \}\) so that their marginal distribution
is the Gumbel distribution. Their correlation parameter, say \(\rho_{w}\), was estimated using their
sample correlation.
For a fixed value of \(\rho\), this
procedure was replicated \(10,000,000\)
times to reduce the simulation error and with \(N=10,000\) to reduce the sampling error.
The three correlation parameters \(\rho\), \(\rho_z\) and \(\rho_w\) were estimated using their
corresponding Monte Carlo counterparts, denoted by \(\widehat{\rho}\), \(\widehat{\rho}_z\) and \(\widehat{\rho}_w\), respectively. We let
\(\rho \in \{ 0, 0.01,0.02,\ldots,
0.99\}\).
The dataframe simulation contains the true correlation
parameter \(\rho\) (rho)
and the Monte Carlo estimates \(\widehat{\rho}\) (normal),
\(\widehat{\rho}_z\)
(logistic) and \(\widehat{\rho}_w\) (gumbel)
from the simulation study described above. As expected, \(\widehat{\rho} \approx \rho\) regardless of
the strength of the correlation parameter. For the case of logistic
marginal distributions, the average difference between \(\rho\) and \(\widehat{\rho}_z\) is 0.002, taking the
maximum value of 0.0031 at \(\rho =
0.58\). Therefore \(\rho\)
appears to approximate \(\rho_{z}\) to
2 decimal points. For the case of Gumbel marginal distributions, the
average difference between \(\rho\) and
\(\widehat{\rho}_w\) is 0.0103, taking
the maximum value of 0.0154 at \(\rho =
0.51\). Although \(\rho\)
appears to approximate well \(\rho_{w}\), there is some accuracy loss
compared to \(\rho_{z}\). The plot
below shows the differences between the true correlation coefficient of
the bivariate normal distribution and the simulated correlations for
\(\rho\), \(\rho_z\) or \(\rho_w\).
Overall, there is little accuracy loss by specifying the correlation
matrix of the multivariate normal distribution in the intermediate step
of the NORTA distribution instead of the correlation matrix of
correlated latent responses regardless of whether their marginal
distributions are either logistic or Gumbel distributions. Users can
treat the correlation matrix passed on to the core functions of
SimCorMultRes as the correlation matrix of the latent
variables.