Monday, 21st of January (2019)

Around the Genetic
of
Type 2 Diabetes

A bit of history …

A lot of SNPs discovered …

Flannick & Florez (2016)

Associated with T2D and traits …

Marullo, El-Sayed Moustafa, & Prokopenko (2014)

But, weak correlation of the effects

Scott et al. (2012)

But, weak correlation of the effects

Yaghootkar & Frayling (2013)

Is There
a
Genetic Joint Effect?

"In all cases, the glucose-raising allele was associated with increased risk of T2D, yet fasting glucose effect sizes and T2D ORs were weakly correlated"

Scott et al. (2012)

Why use a Joint Model?

  • To identify biomarker relevant to a disease

  • To identify the effect of a treatment on a disease
    (independently of the association between the disease and the biomarker)

For example:

  • Biomarker: CD4 counts
  • Event: death

We can start with a Cox Model …

(Extended) Cox model: \[\begin{align} \lambda_i(t)=\lambda_0(t) \exp(\beta Y_i(t) + \alpha Z_i + \eta W_i) \end{align}\]

Where:

  • \(\lambda_i(t)\) is the hazard function at time \(t\) for individual \(i\);
  • \(\lambda_0(t)\) is the unspecified baseline hazard function;
  • \(\alpha\) measures the effect of \(Z_i\) on the hazard function;
  • \(\beta\) measures the association between the trajectory function \(Y_i(t)\) and the hazard function.

We can start with a Cox Model …

coxph(Surv(Time, death) ~ drug, data = aids.id)

We can start with a Cox Model …

coxph(Surv(Time, death) ~ drug, data = aids.id)
#> Call:
#> coxph(formula = Surv(Time, death) ~ drug, data = aids.id, x = TRUE)
#> 
#>   n= 467, number of events= 188 
#> 
#>           coef exp(coef) se(coef)     z Pr(>|z|)
#> drugddI 0.2102    1.2339   0.1462 1.437    0.151
#> 
#>         exp(coef) exp(-coef) lower .95 upper .95
#> drugddI     1.234     0.8104    0.9264     1.643
#> 
#> Concordance= 0.531  (se = 0.019 )
#> Rsquare= 0.004   (max possible= 0.99 )
#> Likelihood ratio test= 2.07  on 1 df,   p=0.2
#> Wald test            = 2.07  on 1 df,   p=0.2
#> Score (logrank) test = 2.07  on 1 df,   p=0.1

It works, but biomarkers …

  1. are measured at determined time points (\(t_{ij}\))
  2. can have missing values over time
    • => Imputation?
    • \(\Rightarrow\) Bias introduction
  3. are measured with some degree of error
    • \(\Rightarrow\) Noise in the biomarker trajectory (\(Y_i(t_{ij}) \neq X_i(t_{ij})\))
  4. can be endogenous
    • \(\Rightarrow\) Trajectory can change when the event occurs
    • \(\Rightarrow\) Bias introduction

Hopefully, the Mixed Model is there!

(Generalised) linear mixed effect model: \[Y_{i}(t_{ij})=X_{i}(t_{ij})+\epsilon_{i}(t_{ij})\]

where:

  • \(Y_{i}(t_{ij})\) is the observed value
  • \(X_{i}(t_{ij})\) is the true (unobserved) value of the longitudinal measurement at time \(t_{ij}\) for individual \(i\).
  • \(\epsilon_{i}(t_{ij})\) is a random error term, usually:
    \[\epsilon_{i}(t_{ij})\sim \mathcal{N}(0,\sigma^2)\]

Hopefully, the Mixed Model is there!

lme(sqrt(CD4) ~ obstime*drug - drug, random = ~ 1|patient, data = aids)

Hopefully, the Mixed Model is there!

lme(sqrt(CD4) ~ obstime*drug - drug, random = ~ 1|patient, data = aids)
#> Linear mixed-effects model fit by REML
#>  Data: aids 
#>        AIC      BIC    logLik
#>   2730.729 2756.957 -1360.364
#> 
#> Random effects:
#>  Formula: ~1 | patient
#>         (Intercept)  Residual
#> StdDev:   0.8719493 0.4106749
#> 
#> Fixed effects: sqrt(CD4) ~ obstime * drug - drug 
#>                      Value  Std.Error  DF  t-value p-value
#> (Intercept)      2.5087153 0.04306984 936 58.24761  0.0000
#> obstime         -0.0338783 0.00352643 936 -9.60697  0.0000
#> obstime:drugddI  0.0055813 0.00498557 936  1.11949  0.2632
#>  Correlation: 
#>                 (Intr) obstim
#> obstime         -0.153       
#> obstime:drugddI  0.003 -0.691
#> 
#> Standardized Within-Group Residuals:
#>         Min          Q1         Med          Q3         Max 
#> -5.09514877 -0.46243667  0.01324625  0.48993786  5.17139372 
#> 
#> Number of Observations: 1405
#> Number of Groups: 467

Let's use a Joint Model

fitCOX <- coxph(Surv(Time, death) ~ drug, data = aids.id, x = TRUE)
fitLME <- lme(sqrt(CD4) ~ obstime*drug - drug, random = ~ 1 | patient, data = aids)
fitJOINT <- jointModel(fitLME, fitCOX, timeVar = "obstime", method = "piecewise-PH-aGH")
#> Variance Components:
#>                StdDev
#> (Intercept) 0.8793445
#> Residual    0.4093577
#>
#> Coefficients:
#> Longitudinal Process
#>                   Value Std.Err  z-value p-value
#> (Intercept)      2.5100  0.0434  57.8778 <0.0001
#> obstime         -0.0362  0.0035 -10.3474 <0.0001
#> obstime:drugddI  0.0045  0.0049   0.9173  0.3590
#>
#> Event Process
#>             Value Std.Err z-value p-value                                  
#> drugddI    0.3458  0.1523  2.2715  0.0231                                  
#> Assoct    -1.0866  0.1184 -9.1786 <0.0001  
#> log(xi.1) -1.6560  0.2529 -6.5475                                          
#> log(xi.*)  ......  .....   ......

What is a Joint Model? With a picture!

What is a Joint Model? With a picture!

What is a Joint Model? With a picture!

What is a Joint Model? With a picture!

What is a Joint Model? With a picture!

What is a Joint Model? With a picture!

What is a Joint Model? With a picture!

What is a Joint Model? With a picture!

What can we do with a Joint Model?

Test simultaneously an effect on:

  • a biomarker (\(\gamma\));
  • an event (\(\alpha\));
  • a biomarker and an event (\(\beta\gamma+\alpha\)).

The best part?

A gain in statistical power to detect those effects

  • if \(\beta\neq0\), to detect a joint effect of \(Z\): \(\beta\gamma+\alpha\neq0\)
    (from Chen, Ibrahim, & Chu (2011));
  • compared to the extended Cox model.

What is a Joint Model? With equations!

The standard (joint likelihood) formulation involves two components:

  • a longitudinal component
  • a time-to-event (survival) component.

With:

  • \(n\), the sample size;
  • \(i\), an individual (\(i=1,\cdots,n\));
  • \(m_i\), the number of measurements on individual \(i\);
  • \(t_{ij}\), a time points (\(j=1,\cdots,m_i\)).

The longitudinal component

(Generalised) linear mixed effect model: \[Y_{i}(t_{ij})=X_{i}(t_{ij})+\epsilon_{i}(t_{ij})\]

\(X_{ij}\) is the trajectory function, and could be defined: \[\begin{gather}X_{i}(t_{ij})=\theta_{0i} + \theta_{1i}t_{ij} + \cdots + \theta_{pi}t_{ij}^p &, & \boldsymbol\theta_p \sim \mathcal{N}(\boldsymbol\mu_, \boldsymbol\Sigma)\end{gather}\]

For simplicity here, we assume linearity over time (\(\theta_{0i}+\theta_{1i}t_{ij}\)): \[\begin{gather}Y_{i}(t_{ij})=\theta_{0i}+\theta_{1i}t_{ij}+\gamma Z_i+\delta W_i+\epsilon_{ij} &, & \boldsymbol{\theta} \sim \mathcal{N}_2 (\boldsymbol{\mu},\boldsymbol{\Sigma})\end{gather}\]

The longitudinal component

(Generalised) linear mixed effect model: \[Y_{i}(t_{ij})=\theta_{0i}+\theta_{1i}t_{ij}+\gamma Z_i+\delta W_i+\epsilon_{ij}\]

With:

  • \(Y_{i}(t_{ij})\), the observed value;
  • \(X_{i}(t_{ij})\), the true (unobserved) value of the longitudinal measurement at time \(t_{ij}\) for individual \(i\);
  • \(\epsilon_{ij}\),a random error term, usually: \(\epsilon_{ij}\sim \mathcal{N}(0,\sigma^2)\);
  • \(Z_i\), a vector denoting the genotype of individual \(i\);
  • \(W_i\), a set of adjusting covariates.

The time-to-event component

(Extended) Cox model (proportional hazards): \[\begin{align} \lambda_i(t)&=\lim_{dt \to 0} \frac{P\{t\leq T_i<t+dt|T_i\geq t, \bar{Y_i}(t), Z_i, W_i\}}{dt}\\ &=\lambda_0(t) \exp\{\beta X_{i}(t) + \alpha Z_i + \eta W_i\} \end{align}\]

With:

  • \(\bar{Y_i}(t)=\{Y_i(u),0 \leq u \leq t\}\), the history of the trajectory;
  • \(T_i\), the event time for individual \(i\);
  • \(C_i\), the right censoring time (end of the follow-up);
  • \(\Delta_i\), an event indicator: \(\begin{cases} \Delta_i=0, & \text{if }\ T_i>C_i. \\ \Delta_i=1, & \text{if }\ T_i <= C_i. \end{cases}\)

Hypothesis testing

Null hypothesis: \(\begin{cases}H_0&:& \theta=\theta_0\\H_1&:& \theta\neq\theta_0\end{cases}\)
  • Likelihood Ratio Test
    \[LRT=-2\{\ell(\hat{\theta}_0)-\ell(\hat{\theta})\}\]

  • Wald Test
    \[\begin{gather} W=(\hat{\theta}-\theta_0)^\top \mathcal{I}(\hat{\theta})(\hat{\theta}-\theta_0)\\ \left(\text{Univariate: }(\hat{\theta}_j-\theta_{0j})/\widehat{\text{s.e.}}(\hat{\theta}_j)\right) \end{gather}\]

  • Score Test
    \[U=S^\top(\hat{\theta}_0)\{\mathcal{I}(\hat{\theta}_0)\}^{-1}S(\hat{\theta}_0)\]

Is the
Joint Model Approach
Worth It?

Let's find out with simulations!

Estimators (& Computation time)

  • Are Joint Model estimators good?
    • Bias, variance and RMSE (Root-Mean Square Error)
      \[\begin{align} \operatorname{MSE}(\hat\phi)&= \operatorname{Bias}(\hat\phi)^2 + \operatorname{Var}(\hat\phi)\\ \operatorname{RMSE}(\hat{\phi})&=\sqrt{\operatorname{MSE}(\hat\phi)}\\ &=\sqrt{E\{(\hat{\phi}-\phi)^2\}} \end{align}\] \[\phi=(\beta, \gamma, \alpha)\]
  • Can we do a whole genome analysis…
    … in a reasonable time frame?

A more "naive" approach!

What if, we split the job in two?

\(\Rightarrow\) "Two-Step"? (Tsiatis, DeGruttola, & Wulfsohn, 1995)

  1. (Generalised) linear mixed effect model
    \[\begin{align} Y_{i}(t)&=X_i(t)+ \epsilon_{i}(t)\\ X^*_{i}(t)&=E\{X_{i}(t)|\bar{Y_i}(t), T_i\geq t\} \end{align}\]

  2. (Extended) Cox model (proportional hazards)
    \[h_i(t)=h_0(t) \exp\{\beta X^*_{i}(t)\}\]

Time to generate fake data!

Let's keep it simple, i.e., without covariates:

  • the trajectory: \(Y_{i}(t)=\theta_{0i} + \theta_{1i}t + \gamma Z_i + \epsilon_{i}(t)\)

  • the event: \(\lambda_i(t)=\lambda_0(t) \exp\{\beta X_{i}(t) + \alpha Z_i\}\)

  • the time of event, e.g., the exponential distribution (Austin, 2012):
    \[\begin{gather} H_i(T_i)=\int_0^{T_i}\lambda_0(t) \exp(\beta X_i(t)+\alpha Z_i)dt , & \lambda_0(t)=\lambda\\ F_i(T_i)=1-exp(-H_i(T_i))=u , & u\sim\mathcal{U}(0, 1) \end{gather}\] \[T_i=\frac{1}{\beta\theta_{1i}}\log\left(1-\frac{\beta\theta_{1i}\times \log(1-u)}{\lambda \exp(\beta\theta_{0i}+(\beta\gamma+\alpha)Z_i)}\right)\]

Set the parameters!

Scott et al. (2012)

Yaghootkar & Frayling (2013)

Set the parameters!

Parameters and numerical values used for sensitivity analysis and simulations, based on results from rs17747324 within gene TCF7L2 in the French cohort D.E.S.I.R.
Parameters Values
Number of participants (\(n\)) 4,352
Number of measures (\(m\)) 4
Diabetes incidence rate (\(d\)) 0.0384
Minor allele frequency (\(f\)) 0.244
Random effects (\(\theta\)) \(\sim\mathcal{N}_2\left (\begin{bmatrix}4.55\\0.0108\end{bmatrix} , \begin{bmatrix} 0.143 & -0.00109 \\ -0.00109 & 6.8\times 10^{-04} \end{bmatrix} \right )\)
SNP effect on \(Y_{ij}\) (\(\gamma\)) 0.0229
SNP effect on \(T_i\) (\(\alpha\)) 0.265 (OR=1.3)
Association between \(Y_{ij}\) and \(T_i\) (\(\beta\)) 3.17
Error term (\(\epsilon\)) \(\sim\mathcal{N}(0,0.305^2)\)

How does our fake data looks like?

What could we expect?

Let's do some power calculation (chen_sample_2011):
\[\begin{gather} H_0:\ \beta\gamma+\alpha=0 \\ d=\frac{(z_{\tilde{\beta}}+z_{1-\tilde{\alpha}})^2}{f(1-f)(\beta\gamma+\alpha)^2} \\ z_{\tilde{\beta}}=\pm\sqrt{df(1-f)(\beta\gamma+\alpha)^2}+z_{1-\tilde{\alpha}} \end{gather}\]

With:
  • \(d\), the incidence rate;
  • \(f\), the allele frequency;

  • \(\tilde{\alpha}\), the significance level;
  • \(\tilde{\beta}_{\tilde{\alpha}}\), statistical power at \(\tilde{\alpha}\).

\(\Rightarrow\tilde{\beta}_{\tilde{\alpha}}=46.56\%\) for TCF7L2 (rs17747324).

Back to the Simulations!

Let's see some (animated) pictures!

\(\beta\): Association between \(X_{i}(t)\) and \(S_{i}\)

\(\beta\): Association between \(X_{i}(t)\) and \(S_{i}\)

\(\beta\): Association between \(X_{i}(t)\) and \(S_{i}\)

\(\gamma\): Association between \(Z_{i}\) and \(X_{i}(t)\)

\(\gamma\): Association between \(Z_{i}\) and \(X_{i}(t)\)

\(\gamma\): Association between \(Z_{i}\) and \(X_{i}(t)\)

\(\alpha\): Association beteen the \(Z_{i}\) and \(S_{i}\)

\(\alpha\): Association beteen the \(Z_{i}\) and \(S_{i}\)

\(\alpha\): Association beteen the \(Z_{i}\) and \(S_{i}\)

What about computation time?

Joint Model
Two-Step Model
Sample Size mean (sd)
per SNP
in seconds
100K SNPs
in days
mean (sd)
per SNP
in seconds
100K SNPs
in days
500 51 (3.4) 59 0.71 (0.066) 0.82
2,500 100 (11) 120 3.1 (0.092) 3.6
5,000 180 (25) 210 6.3 (0.17) 7.3
10,000 340 (34) 400 9 (0.22) 10

Few Rounds With Real Data

The French cohort D.E.S.I.R.

Small overview with a Manhattan plot

Because, effect size matters …

What do we expect from the literature?

Scott et al. (2012)

Yaghootkar & Frayling (2013)

It seems great, isn't it?

Well, here are the real results …

Does that make any sense?!

Actually, it does make sense …

rs10830963_G
(MTNR1B)
rs17747324_C
(TCF7L2)
\(\alpha\) -0.4404
(\(p=9.37\times 10^{-04}\))
0.2652
(\(p=4.09\times 10^{-02}\))
\(\beta\) 3.2511
(\(p=3.63\times 10^{-42}\))
3.1703
(\(p=8.93\times 10^{-42}\))
\(\gamma\) 0.0991
(\(p=1.33\times 10^{-23}\))
0.0229
(\(p=3.02\times 10^{-02}\))

To summarise


  • The Joint Model is better than the "Two-Step" approach

  • The "Two-Step" approach is not that bad, especially regarding computation time

  • \(\Rightarrow\) Use "Two-Step" as a screening approach and refine with a Joint Model

Me?

Austin, P. C. (2012). Generating survival times to simulate Cox proportional hazards models with time-varying covariates. Statistics in Medicine, 31(29), 3946–3958. https://doi.org/10.1002/sim.5452

Chen, L. M., Ibrahim, J. G., & Chu, H. (2011). Sample size and power determination in joint modeling of longitudinal and survival data. Statistics in Medicine, 30(18), 2295–2309. https://doi.org/10.1002/sim.4263

Elashoff, R., li, G., & Li, N. (2016). Joint Modeling of Longitudinal and Time-to-Event Data (1st ed.). London, UK: Chapman and Hall/CRC.

Flannick, J., & Florez, J. C. (2016). Type 2 diabetes: Genetic data sharing to advance complex disease research. Nature Reviews Genetics, advance online publication. https://doi.org/10.1038/nrg.2016.56

Marullo, L., El-Sayed Moustafa, J. S., & Prokopenko, I. (2014). Insights into the Genetic Susceptibility to Type 2 Diabetes from Genome-Wide Association Studies of Glycaemic Traits. Current Diabetes Reports, 14(11). https://doi.org/10.1007/s11892-014-0551-8

Rizopoulos, D. (2012). Joint models for longitudinal and time-to-event data: With applications in R. London, UK: CRC Press.

Scott, R. A., Lagou, V., Welch, R. P., Wheeler, E., Montasser, M. E., Luan, J., … Barroso, I. (2012). Large-scale association analyses identify new loci influencing glycemic traits and provide insight into the underlying biological pathways. Nature Genetics, 44(9), 991–1005. https://doi.org/10.1038/ng.2385

Tsiatis, A. A., DeGruttola, V., & Wulfsohn, M. S. (1995). Modeling the Relationship of Survival to Longitudinal Data Measured with Error. Applications to Survival and CD4 Counts in Patients with AIDS. Journal of the American Statistical Association, 90(429), 27–37. https://doi.org/10.2307/2291126

Yaghootkar, H., & Frayling, T. M. (2013). Recent progress in the use of genetics to understand links between type 2 diabetes and related metabolic traits. Genome Biology, 14(3), 203.