High-Dimensional Survival Analysis: Methods and Applications

In the era of precision medicine, time-to-event outcomes such as time to death or progression are routinely collected, along with high-throughput covariates. These high-dimensional data defy classical survival regression models, which are either infeasible to fit or likely to incur low predictability due to over-fitting. To overcome this, recent emphasis has been placed on developing novel approaches for feature selection and survival prognostication. We will review various cutting-edge methods that handle survival outcome data with high-dimensional predictors, highlighting recent innovations in machine learning approaches for survival prediction. We will cover the statistical intuitions and principles behind these methods and conclude with extensions to more complex settings, where competing events are observed. We exemplify these methods with applications to the Boston Lung Cancer Survival Cohort study, one of the largest cancer epidemiology cohorts investigating the complex mechanisms of lung cancer.


INTRODUCTION
Survival analysis is an area of statistics where the random variate is survival time or the time until the occurrence of a specific event, which represents a qualitative change or the transition from one discrete state to another (e.g., alive to deceased).The most often studied event in biomedicine is death, though events of interest in fields ranging from sociology to industry, to engineering, to finance, to astronomy are widely encountered.The goals of survival analysis are to describe the probability of an event occurring by some time, to detect associations between risk factors and events, or to predict survival times based on informative characteristics.What distinguishes survival outcomes from other outcomes is the presence of censoring, meaning that the event of interest may not be observed for all subjects; subjects whose event times are not observed are said to be censored.In practice, the fraction of event times that are censored in a study population can be substantial, prohibiting the direct use of standard regression methods.Estimation methods in survival analysis are built around extracting information from all subjects, censored or not.
In the era of precision medicine, survival outcomes with high-throughput covariates or predictors are routinely collected.These high-dimensional data (i.e., with the number of predictors exceeding the number of observations) challenge classical survival regression models, which are either infeasible to fit or likely to incur low predictability due to over-fitting.Recent emphasis has been placed on developing novel approaches for feature selection and survival prognostication.We will review various methods that handle survival outcome data with high-dimensional predictors, highlighting recently developed machine learning approaches for survival prediction.We will also discuss recent developments for deep learning in survival settings and introduce some new deep learning techniques in the presence of competing or semi-competing outcomes.A competing risk is an event whose occurrence precludes the occurrence of another event of interest (Austin & Fine 2017), while in a semi-competing setting, the occurrence of a non-terminal event (e.g., disease progression) is subject to a terminal event (e.g., death), but not vice versa (Haneuse & Lee 2016).We illustrate a novel deep learning approach for prediction under semi-competing outcomes, and exemplify the method using data from the Boston Lung Cancer Survival Cohort (BLCSC), a large hospital-based cancer epidemiology cohort investigating the molecular mechanisms and clinical pathophysiology of lung cancer (Christiani 2017).
This paper is outlined as follows.In Section 2, we provide a brief overview of some key concepts and notation in survival analysis and introduce the necessary perquisites on which much of the subsequent literature is built.In Section 3, we survey current techniques for fitting survival models with high-dimensional covariates, primarily focusing on methods that perform feature selection under sparsity assumptions.We briefly discuss ultra highdimensional settings and introduce screening methods, and end this section with a discussion of methods for drawing valid inference with high-dimensional covariates.In Section 4, we turn to machine learning for survival prediction.We first discuss the application of common machine learning concepts in these settings, such as a support vector machines, recursive partitioning and survival trees, and ensemble learners such as random survival forests.We briefly review artificial neural networks and extend this notion to survival prediction.In Section 5, we review existing deep learning procedures for competing risk analysis, illustrate a new deep learning approach for predicting semi-competing outcomes, and work through the BLCSC study.We conclude with remarks on future work and open areas.The online Supplementary Material tabulates the reviewed methods and their available software, and presents additional simulation results.

NOTATION
Consider a study consisting of n subjects.The outcome variable is the time to the event of interest, such as death or cancer progression.Events in other contexts can be bankruptcy, COVID-19 infection, graduation, missing a mortgage payment, etc.A time zero also needs to be set carefully, to have proper biological or practical interpretations when helping to address specific scientific questions.For instance, some common choices of time zero in medical studies include date of birth, time of diagnosis, date of randomization in a clinical trial, or first date receiving a treatment.A unique aspect of survival analysis is that the event may go unobserved for some individuals.In particular, right censoring occurs when a subject's follow-up ends before the event can be observed (Figure 1).Though other types of censoring exist, we focus on right censoring, which happens most often in practice.
We denote the ith subject's survival and censoring times respectively by Ti and Ci (i = 1, . . ., n), which are non-negative random variates.For the ith subject, we observe Xi, a p-vector of covariates, Yi = min(Ti, Ci), and the event indicator δi = I(Ti ≤ Ci), where I(•) is an indicator function.We assume that subjects are independent from each other, and that Ti ⊥ Ci, given Xi.Often, the goal of survival analysis is to associate Xi with the distribution of Ti, and, in particular, model the conditional hazard function given Xi, i.e., which measures the instantaneous failure rate at a given time among those who are alive and with Xi.Throughout this review, for simplicity, we assume that Xi is time-invariant, though in many circumstances extensions to time-dependent covariates are possible.

Study Start Study End
Patient 1 Schematic of observations for two example patients, with different entry times, over the course of a study.The event of interest, death, is observed for Patient 1, whereas Patient 2 is censored, as the patient is still alive at the end of the study.

HIGH-DIMENSIONAL SURVIVAL MODELS
In high-dimensional settings, it is not recommended to build prediction models with all of the available features due to the risk of over-fitting.A useful strategy is to select only the most vital features under the assumption of sparsity, meaning that most of the potential predictors are 'unimportant,' with nearly no effect on the outcome (Friedman et al. 2010).
A key question is how to perform variable selection and estimation simultaneously, and the most widely used approaches fall under the class of regularized regression models.Regularization refers to the addition of a penalty term to the objective function, which shrinks the coefficient estimates toward zero and possibly forces some of them to be exactly zero.This mitigates over-fitting and results in parsimonious prediction models (Tibshirani 1996).

Regularized Cox Models
The approach that dominates survival analysis in the biomedical literature is the Cox (1972) proportional hazards model, famed for presenting both a novel hazard model and a novel concept of partial likelihood.The model links the conditional hazard function (1) to Xi via where the baseline hazard, λ0(t), is unspecified, and β = (β1, . . ., βp) is the coefficient vector of Xi to be estimated, with a fixed p < n, by maximizing the partial likelihood, i.e., with P Li(β) being the contribution for subject i who is observed to die: where R(Yi) = {j : Yj ≥ Yi}.In high-dimensional settings, that is, p > n and asymptotically p and n may both go to infinity (Zhao & Yu 2006), directly optimizing the partial likelihood is not feasible because of over-parameterization.Instead, regularized regression adds a penalty term to the negative log partial likelihood, (β), and optimizes a penalized version of objective function: where the penalty P en(β) is controlled by a positive tuning parameter, η, to be selected through cross-validation.A widely recognized family of penalties is based on the lq-norm, |βj| q ) 1/q , q ≥ 0.
Regularization approaches with P en(β) = ||β|| 2 2 , known as ridge regression (Hoerl & Kennard 1970), are applied to the Cox model, and return unique and shrunk estimates (Verweij & Van Houwelingen 1994).However, ridge regression does not promote sparsity, as it cannot shrink individual coefficients to zero.The least absolute shrinkage and selection operator (LASSO) (Tibshirani 1996), with P en(β) = ||β||1, penalizes the absolute sum of the coefficient estimates and has been routinely used for producing sparse models.Its application (Tibshirani 1997) to survival settings, namely, Cox LASSO, has become a widely used approach for high-dimensional survival analysis by performing feature selection and estimation simultaneously (Figure 2).
LASSO has several notable statistical properties.It possesses model selection consistency under certain regularity conditions, in particular, the strong irrepresentable condition when p grows much faster than n (i.e., that the absolute sum of coefficients for the regression of any noise variable on signal variables must be strictly smaller than 1) (Zhao & Yu 2006).It has a Bayesian interpretation by viewing β as having a "double exponential" prior (Tibshirani 2009).However, as the LASSO penalty term is linear in the size of the coefficients, it leads to biased estimates, especially for the coefficients with large absolute values.To remedy this, Zhang & Lu (2007) (Fan & Li 2002).While the SCAD penalty retains the penalization rate of LASSO for small coefficients, it relaxes the rate of penalization smoothly as the absolute value of the coefficient increases.Asymptotically, the SCAD penalty yields √ n-consistent estimates (with a proper rate of η) and possesses oracle properties (if √ nη → 0 and nη → ∞).Strong oracle properties for LASSO and SCAD were established in Bradic et al. (2011), which further proposed a class of nonconvex penalization procedures for the Cox model.Nonconvex regularization, including SCAD, is appealing as it obtains support recovery properties under much weaker assumptions than for l1 penalization (Loh & Wainwright 2017).
Another extension is the elastic net penalty for Cox models (Wu 2012), which combines the LASSO and ridge penalties; but, unlike LASSO, is capable of selecting more predictors than the sample size (Zou & Hastie 2005).This notion was generalized by Vinzamuri & Reddy (2013) with the kernel elastic net Cox regression model, which replaces the ridge penalty with β Σβ.Here, Σ is a p × p radial basis function kernel matrix of predictors, which measures pairwise similarity between predictors.This penalty is meant to encourage correlated predictors to have similar strengths on survival prediction.Other regularized Cox methods include the group Cox LASSO (Kim et al. 2012), which selects groups of related covariates as a whole, and the fused LASSO (Chaturvedi et al. 2014), which penalizes both the coefficient estimates and their successive differences for ordered features (Table 1).
where Y , X, β, and are an n × 1 vector of responses, an n × p covariate matrix, a p × 1 vector of coefficients and an n × 1 vector of zero-mean residual errors, respectively.It estimates β by solving min ||β||1 where ηQ > 0 is a tuning parameter.Empowered by linear programming, the Dantzig selector offers a useful alternative as a regularized estimating equation approach.As a dual problem of LASSO, it often produces the same solution path (Candès & Tao 2007).
On the other hand, accelerated failure time (AFT) models have become a useful alternative to Cox models due to the ease of interpretation (Saikia & Barman 2017).An AFT model links the (log transformed) survival time to covariates via a linear model log(Ti) = X i β + ei, 3.
where the log transformation ensures the parameter space of β is unconstrained, and the distribution of the errors, ei, induces a distribution for Ti (Table 2).For parametric AFT models, maximum likelihood estimation (MLE) can be used for inference.When ei's distribution is unspecified, the models are semi-parametric and the MLE is difficult to obtain, as the likelihood involves infinite dimensional parameters.With a fixed p < n, Buckley & James (1979) proposed an estimating equation approach by imputing the censored outcomes, and solving a least squares problem.For AFT models with high-dimensional predictors, one cannot directly apply LASSO estimation, as the objective function again involves infinite dimensional parameters.Mo-tivated by the work of Candès & Tao (2007) for regularized least squares estimation, Li et al. (2014) used Buckley-James imputation to express AFT estimation as a least squares problem and then applied the Dantzig selector: are imputed outcomes with the Kaplan- Here, the projection matrix Pn = In − 11 /n, where In and 1 are an n × n identity matrix and an n × 1 vector of 1's, is for centering covariates to avoid estimation of the intercept or the expection of ei in model ( 3).An iterative approach is necessary because this is not a linear programming problem, and, like the Dantzig selector for linear models, estimates may be biased and may not possess the oracle property.
To address this, Li et al. (2014) considered an adaptive version of Dantzig selector with data-driven weights that vary inversely with the magnitude of coefficients.They showed that the weighted Dantzig selector has model selection consistency and oracle properties.On the other hand, a Dantzig selector for the Cox model was proposed by Antoniadis et al. (2010) based on partial likelihood score equations.
Note that, in ultra high-dimensional settings where p n, penalized variable selection methods such as those described so far may incur high computational costs, numeric instability, and poor reproducibility (Fan & Lv 2008).As such, variable screening is a crucial first step in identifying predictive biomarkers and reducing the dimensionality of the feature space before applying regularized methods.Feature screening methods such as sure independence screen (SIS) (Fan & Lv 2008) fit marginal regression models for each covariate one at a time, choose a threshold, and retain those covariates with magnitudes of marginal effects above the threshold.In the ultra high-dimensional survival settings, additional censoring issues need to be addressed.Recent advancements in survival feature screening have included sure independence screening (Fan et al. 2010), principled sure independent screening (Zhao & Li 2012), score test screening (Zhao & Li 2014), concordance-measure based screening (Ma et al. 2017), Buckley-James assisted sure screening (Liu et al. 2020), conditional screening (Kang et al. 2017, Hong et al. 2018b), integrated power density screening (Hong et al. 2018a), Lq norm screening (Hong et al. 2020), and forward regression (Hong et al. 2019, Pijyan et al. 2020).See a focused review of survival feature screening in Hong & Li (2017).

Inference with High-Dimensional Covariates
As simultaneous estimation and inference is challenging within the high-dimensional survival framework, we review limited methods available for drawing inference in this area.More broadly, high-dimensional regression inference methods largely fall under post-selection inference and debiased LASSO estimation.Further challenges arise in that post-selection inference is conditional on the selected subset and does not account for variation in model selection.Several authors used debiased LASSO (Van de Geer et al. 2014, Javanmard & Montanari 2014, Yu et al. 2018) to draw inference; however, these methods require estimation of the inverse of a p × p information matrix, which is a daunting task, especially when p > n (Xia et al. 2021a,b).

Selection-Assisted Partial Regression and Smoothing (SPARES).
To address this challenge, Fei et al. (2019) proposed SPARES to draw inference for high-dimensional linear models (2) with p > n.Under this framework, model selection and partial regression are conducted separately on partitioned data, and multiple sample splittings or boostraps are used to account for variations in variable selection and estimation.Specifically, given data D = (X, Y ) and a variable selection procedure Sη, data are split into equally sized D1 and D2.Denote the variables selected by Sη on D2 as S = Sη(D2).On D1 = (X 1 , Y 1 ) and for any j ∈ {1, . . ., p}, Y 1 is regressed on where {•}j denotes the estimate corresponding to variable j.Equation ( 4) is termed the partial regression estimator (Fei et al. 2019).Set β = ( β1, . . ., βp) .The rationale behind this idea is that, if Pr(S ⊃ S0) → 1, where S0 is the true active set, the "one-time" partial regression in (4) returns a consistent estimator for β 0 j , regardless of whether j ∈ S.However, the one-time estimator is highly variable, heavily depending on S and the specific data-split, and does not account for variation in the variables selected.To address this, data are randomly split, say, B times, and partial regression are carried out over these B random splits.Denote by βb the estimate of β based on the bth resample (b = 1, . . ., B).
To draw inference, a nonparametric delta method (Efron 2014, Van der Vaart 2000) is used to estimate the standard error of βj (j = 1, . . ., p) as ŝe , where ĉ ovij is the sample covariance between I bi and βb j , with I bi indicating whether subject i is included in the bth resample (used for partial regression).Approximate 95% confidence intervals are given by βj ± 1.96 ŝe B j , while a two-sided p-value testing H0 : βj = 0 is given by 2 × [1 − Φ(| βj|/ ŝe B j )], where Φ(•) is the distribution function of N (0, 1).SPARES provides a novel inference technique that converts a high-dimensional problem to a low-dimensional regression.It is valid with general selection methods, including LASSO, SCAD, screening, and boosting, as long as they possess selection consistency or the relaxed "sure screening" condition in Fei & Li (2021).Further, this approach is not sensitive to the tuning parameter η in Sη and can be extended to analyze censored outcomes, as detailed below.
As a concrete example, with a subset of 153 patients from the BLCSC study, Hong et al. ( 2019) fit a censored quantile regression model that linked the conditional quantile of overall survival to age (years), sex (0: female; 1: male), pack-years, cancer type (0: adenocarcinoma; 1: non-adenocarcinoma), and cancer stage (0: stage one; 1: stage two or above).Figure 3 displays the point estimates (blue curves) of the quantile-specific regression coefficients and their 95% confidence intervals (lighter blue shaded regions).While methods have been proposed to deal with variable selection for high-dimensional censored quantile regression (HDCQR), including penalized quantile regression (Wang et al. 2013a), adaptive penalized quantile regression (Zheng et al. 2013), model-free variable screening (He et al. 2013), and stochastic integral-based estimating equations (Zheng et al. 2018), none could draw statistical inference with HDCQR.Belloni et al. (2019) provided post-selection inference in high-dimensional quantile regression at some fixed quantile levels; however, the method cannot handle censored outcomes.
To address this issue, Fei et al. (2021) proposed a Fused-HDCQR method, which utilizes a variable selection procedure for HDCQR (Zheng et al. 2018) to reduce the dimension of the data, and applies partial regression to estimate the effect of each predictor, regardless whether it is selected.Estimates are aggregated based on multiple data splits and selections.Specifically, when p > n, Fused-HDCQR adapts the SPARES procedure and fits low dimensional CQR's using Equation ( 6).With B random sample splittings, these estimates, denoted by βb j , b = 1, . . ., B, are aggregated to form the Fused-HDCQR estimator: and a functional delta method (Van der Vaart 2000) can be applied to estimate the variance of βj(τk).The Fused-HDCQR procedure involves repeated fitting of low dimensional regressions, which is computationally feasible and can estimate and conduct hypothesis tests for the heterogeneous effects of various risk factors.The use of Fused-HDCQR is illustrated with the BLCSC data by studying the differential impacts of genetic variants on different quantiles of survival times.For example, Fei et al. (2021) focused on 2,002 candidate SNPs residing in 14 well-known lung cancer related genes and investigated how each SNP played a different role among high-(i.e., lower quantiles of overall survival) versus low-risk (i.e., higher quantiles of overall survival) cancer survivors.With the Fused-HDCQR approach, the estimated coefficient of active smoking ranged from −0.42 (p = 0.0011) to −0.53 (p = 0.0005) as τ changed from 0.2 to 0.5, and then increased to −0.31 (p = 0.038) as τ changed to 0.7, suggesting that active smoking might be more harmful to the high-or median-risk groups than the low-risk group of patients.The results resonated with the strong need to develop effective smoking cession programs among high-risk populations (Barbeau et al. 2006).Further, SNP AX.37793583 T remained significant throughout τ = 0.2 to τ = 0.7, however, its estimated coefficient decreased from 2.75 (τ = 0.2) to 1.39 (τ = 0.7), indicating its heterogeneous impacts on survival, i.e., stronger protective effect at lower quantiles and vice versa, which could not be detected using traditional Cox models (Fei et al. 2021).

MACHINE LEARNING TECHNIQUES FOR SURVIVAL PREDICTION
Significant work has gone into the development of machine learning algorithms that can accommodate survival data.These nonparametric learning approaches can handle nonlinear relationships or higher-order interaction that would otherwise be costly in classical methods, and can improve accuracy in prediction for survival outcomes.

Support Vector Machines
Support vector machines (SVMs) fall under the supervised learning family (Vapnik et al. 1995, Noble 2006) and seek to find a hyperplane that provides maximal separation between groups (Figure 4).Specifically, consider a binary outcome Yi ∈ {−1, 1} for each individual i with a corresponding p-dimensional covariate vector, Xi.The goal of SVM is to identify a hyperplane, H(ψ, a) = {v ∈ R p | ψ, v + a = 0}, separating these two groups so that the margin, 2/||ψ||, can be maximized, where ψ ∈ R p is the slope vector, and •, • denotes the inner product (Figure 4).Often, the two classes may not be separable in the original feature space within R p , and we use F(•) to map the original predictors to a higher dimensional space where the outcomes can be distinguished, in which case, the hyperplane to deal with is H(ψ, a) = {v ∈ R p | ψ, F(v) + a = 0} and, with slight overuse of notation, the dimension of ψ is the same as that of F(v).In practice, F(•) does not have to be obtained explicitly and ψ, F(v) can be calculated by using a reproducing kernel (Wahba et al. 1999).We further introduce a slack variable, ξi = [1 − Yi{ ψ, F(Xi) + a}]+, to dictate the degree to which the ith data point is misclassified, as illustrated in Figure 4.
Support Vector Support Vector

Misclassified Point
Figure 4 A support vector machine to distinguish binary outcomes with two-dimensional covariates v = (v 1 , v 2 ) by a linear separating line.The solid line represents the optimal hyperplane separating the data, while the dotted lines denote the maximal margin defined by the support vectors (encircled nodes) for one group (white circles) versus the other (cyan squares).Misclassified points are labeled in red, with corresponding magnitudes for slack variables, ξ.
SVMs have been extended to model continuous time-to-event data, which are prone to censoring, by predicting the survival time to be ψ, F(Xi) + a. Van Belle et al. (2007) formulated the survival SVM based on the rank concordance between the prediction and observed survival time, Yi, among comparable individuals in the presence of censoring.Specifically, they introduced a comparability indicator, vij = δiI(Yi < Yj), such that the ordering of the observed survival times for subjects i, j can only be determined when vij = 1.For a comparable pair with vij = 1, a concordance in rank is reached if and only if ψ, F(Xj) − ψ, F(Xi) > 0. Allowing varying degrees of pairwise slacks, i.e., when ψ, F(Xj) − ψ, F(Xi) ≤ 0 with vij = 1, across comparable pairs, Van Belle et al. (2007) proposed to solve min and ξij ≥ 0, i, j = 1, . . ., n, where ξij's are pair-specific slacks, whose summation is to be minimized, and γ > 0 is a regularization parameter controlling the maximal margin and misclassification penalties.This formulation can be shown to maximize the Harrell rank-based concordance index (C-index) (Harrell et al. 1982).Hence, it is termed the rank-based SVM approach for survival data and does not estimate the "intercept" a.An alternative regression approach (Shivaswamy et al. 2007) aimed to find a prediction, ψ, F(Xi) + a, for continuous survival times, by identifying a hyperplane that best fit the data that are subject to censoring (Smola & Schölkopf 2004): With censoring indicators incorporated into the constraints, the formulation utilizes available information from both censored and non-censored observations.To make full use of the strengths of both approaches, Van Belle et al. ( 2011) and Pölsterl et al. (2015) further proposed hybrid approaches, combining the penalties imposed by both methods.

Tree-Based Methods
While SVMs are adept at estimating non-linear relationships, they do not scale well for large datasets and often under-perform when the outcomes are noisy.Also there may be no clear interpretations for classifying data points above or below the estimated hyperplane (Somvanshi et al. 2016).Decision trees are an alternative for classifying patients that provide an intuitive interpretation of the hierarchical relationships between predictors.Broadly, classification and regression trees (CART) is an umbrella term for a set of recursive partitioning algorithms, which predict the group membership (classification) or target value (regression) for an observation based on a set of binary decision rules (Figure 5).Gordon & Olshen (1985) first presented survival trees, and Ciampi et al. (1986Ciampi et al. ( , 1987) solidified the notion and established splitting criteria based on the log-rank and likelihood ratio test statistics, respectively, gaining predictive accuracy and interpretability.A recursive partitioning algorithm for generating a survival tree is given as follows.
1. Discretize each covariate to be a binary variable (categorical variables with m levels are expressed as m − 1 dummy variables).2. For every binary covariate, Xj, j = 1, . . ., p, compute the log-rank statistic to test the difference between the survival curves for the two groups defined by Xj. 3. Choose the covariate, say, Xj * , with the largest significant test statistic and partition the full sample (i.e., the root node) into two groups (child nodes) based on Xj * .4. Repeat steps 2-3 for each subset (child node) until reaching the terminal nodes, that is, no covariates produce a significant test statistic and there are enough events (exceeding a prespecified number) in each terminal node.
The resulting terminal nodes split the original sample into distinct groups, who are deemed more homogeneous within each group, and will output survival estimates via Kaplan-Meier estimation in each group.Further variations in splitting are based on metrics that accommodate censored data and by either minimizing within-node homogeneity or maximizing between-node heterogeneity.For example, these metrics can be Martingale residuals (Therneau et al. 1990) or deviance residuals (LeBlanc & Crowley 1992).With an established splitting criterion, to select a final tree, either a full survival tree is 'grown' and 'pruned' or a stopping rule is applied in backward or forward selection (Bou-Hamad et al. 2011).

X3
Terminal Node 3 Terminal Node 4 Illustration of a classification tree with three binary covariates that yields four terminal nodes.

Ensemble Learners
While survival trees provide a fast and intuitive means of studying hierarchical relationships of predictors with outcomes, they are prone to over-fitting and high variability (Hu & Steingrimsson 2018, Steingrimsson et al. 2016).Ensemble learners overcome the instability issues by using techniques such as bagging, boosting, and random forests.Bagging for survival trees was first proposed by Hothorn et al. (2004); in contrast to bagging for classification trees, aggregation is done by averaging survival predictions, rather than a 'majority vote.'Each survival tree is grown so that every terminal node has enough events, which are used to predict the survival function node-wise at each terminal node.Then, for any newcomer, the predictions are averaged over the individual trees to yield an ensemble prediction of their survival function (Figure 6).4.3.2.Boosting.In a similar vein, boosting trains a series of weak learners with the goal of aggregating them into a better ensemble learner (Bühlmann & Hothorn 2007).Hothorn et al. (2006) proposed a gradient boosting algorithm for survival settings.Consider a mortality risk prediction based on covariates, Xi.For an M -step gradient boosting algorithm, a prediction, Fm(Xi), is made at each step, say m = 1, . . ., M , based on a previous prediction, Fm−1(Xi), and an additional weak learner fm(Xi), which is the projection of the Ensemble learning with bootstrap aggregation (bagging) for survival trees.
"residual error" of Fm−1(Xi) to the space spanned by Xi, where 0 < wm ≤ 1 (e.g., wm = 0.1) is the step size, the residual error refers to the gradient of the loss function, e.g., the negative log partial likelihood function in a survival setting, evaluated at Fm−1(Xi), and the number of steps, M , can be viewed as a tuning parameter.
Boosting has two notable differences from bagging.First, boosting trains weak learners sequentially, updating the weights placed on learners iteratively, whereas in bagging individual weak learners such as survival trees are trained independently and in parallel, which are aggregated via majority voting or averaging.Second, boosting is applicable to settings where learners have low variability and high bias, as the performance is improved by redistributing the weights.In contrast, bagging is often applied when individual learners exhibit high variability, but low bias, as it reduces variations arising from individual trees.4.3.3.Random Forests.Yet another class of ensemble learners are random forests (Breiman 2001), which, like bagging, aggregate predictions from individual trees generated over bootstrap resampled datasets.However, differing from bagging, random forests randomly select a subset of features, say p < p features, when generating each tree and use them for the individual tree's growth.By doing so, random forests reduce correlations among individual trees, leading to gains in accuracy (Breiman 2001).The choice of p is problem-specific, which can also be viewed as a tuning parameter.In survival settings, Ishwaran et al. (2008) aggregated the survival predictions arising from each tree by averaging the predicted cumulative hazard functions into an ensemble prediction.
Further notable developments include Ishwaran et al. (2011), which extended random survival forests to high dimensions by incorporating regularization, Ishwaran & Lu (2019), which provided standard errors and confidence intervals for variable importance, and Steingrimsson et al. (2019), which proposed censoring unbiased regression trees and ensembles.

Deep Learning and Artificial Neural Networks
Deep learning has emerged as a powerful tool for risk prediction.This work stems from artificial neural networks that tried to mirror how the human brain functions (Rosenblatt 1958), wherein nodes (or neurons) are connected in a network as a weighted sum of inputs through a series of affine transformations and non-linear activations.
A fully-connected, feed-forward artificial neural network is made up of L layers, with k l neurons in the lth layer (l = 1, . . ., L) (Figure 7).With an input, network predictions are made based on an L-fold composite function, fL Diagram of a feed-forward, fully-connected two-layer artificial neural network, including the hidden (1st) and output (2nd) layer.The input (0-th) layer is not counted as a real neural network layer.
For survival prediction, several deep learning approaches have emerged, beginning with the seminal work of Faraggi & Simon (1995), which adopted a fully-connected, feed-forward neural network to extend the Cox model to nonlinear predictions.Other feed-forward neural networks (Liestbl et al. 1994, Brown et al. 1997, Biganzoli et al. 1998, Eleuteri et al. 2003) used the survival status as a training label, and output predicted survival probabilities.Further developments have been made in Bayesian networks (Bellazzi & Zupan 2008, Lisboa et al. 2003, Fard et al. 2016), convolutional neural networks (Yao et al. 2017, Katzman et al. 2016, 2018, Ranganath et al. 2016), and recurrent neural networks (Yang et al. 2018).

PREDICTION FOR COMPETING AND SEMI-COMPETING RISKS
Many survival processes in real applications involve multiple competing events.Risk prediction in these settings is an up-and-coming field with many potential developments.We focus on two common competing event settings, i.e., competing and semi-competing risks.

Competing Risks
In a competing risk setting, observing an event type, labeled by c ∈ {1, . . ., K}, effectively eliminates the chance of observing other event types happening to the same individual (Young et al. 2020).For example, when studying the survival of patients with cancer, competing events can be cancer-related death (c = 1) or death by cardiac disease (c = 2) (Figure 8); an individual cannot die of cardiac disease once they have died of cancer, and vice versa.For characterizing the risk of competing events, there are two commonly used statistical metrics, namely, the cause-specific hazard and the subdistribution hazard, which target different counterfactual scenarios.The former describes the risk under hypothetical elimination of competing events, while the latter is about the observable risk without elimination of any competing events (Rudolph et al. 2020).Several authors (Lau et al. 2009, Koller et al. 2012) have stated that the subdistribution hazard is useful for predicting the probability of having an event of a type of interest by a given time, termed the cumulative incidence function (CIF), which reflects an individual's actual risks and prognosis.In the following, we focus on the subdistribution hazard, which is derived from CIF, i.e., Fc(t) = Pr(Ti < t, Ci = c), where Ci marks the event type for subject i.Specifically, for each event type c = 1, . . ., K, it is defined as

Time
which denotes the instantaneous risk of failure from event type c among those who have not experienced this type of event.That is, the risk set at t includes those who are event free as well as those who have experienced a competing event (other than type c) by t.The subdistribution hazard model (Fine & Gray 1999) links a subdistribution hazard function to covariates via λc(t|Xi) = λ0c(t) exp(X i β), 7.
where λ0c(t) is the baseline subdistribution hazard function for event type c, and β specifies the effect of Xi on the probability of event c occurring over time.In fact, model ( 7) implies that 1 − Fc(t|Xi) = {1 − F0c(t)} exp(X i β) , where Fc(t|Xi) and F0c(t) are the CIF given Xi and the baseline CIF, respectively.With high-dimensional predictors, several authors (Kawaguchi et al. 2019, Ha et al. 2014, Ahn et al. 2018) proposed regularized subdistribution hazard models for variable selection, and Hou et al. (2019) further performed inference using a one-step debiased LASSO estimator.For prediction, several deep learning works for competing risks have been proposed based on CIFs.For example, DeepHit (Lee et al. 2018) developed a multi-task network to nonparametrically estimate Fc(t|Xi) for c = 1, . . ., K. The network is trained to minimize a loss function, which is constructed based on the joint distribution of the first hitting time for competing events of each subject, while ensuring the concordance of estimates across subjects (Harrell et al. 1982), that is, a patient who died at a given time should have a higher risk at that time than a patient who survived longer.Dynamic DeepHit (Lee et al. 2019) further incorporated longitudinal information for dynamic predictions.Other approaches have included DeepCompete (Aastha & Liu 2020), as well as a hierarchical approach to multi-event survival analysis (Tjandra et al. 2021).

Semi-Competing Risks
Semi-competing risk problems, a variant of competing risk problems, have commonly been encountered in clinical studies.By semi-competing, we mean that the occurrence of one event, i.e., a non-terminal event, is subject to the occurrence of another terminal event, but not vice versa (Figure 9).As the non-terminal event (e.g., cancer progression) is often a strong precursor to the terminal event (death), semi-competing events are often related and, hence, the terminal event may informatively censor the non-terminal event (Jazić et al. 2016).To overcome such informative censoring, researchers either consider only the terminal event (i.e., mortality) or a composite outcome such as progression-free survival, that is, time to progression or death, whichever comes first.
What is lacking here is how to model a predictor's potentially different roles in disease progress and death, while utilizing the crucial information about the sojourn time between progression and death.Even in settings where the non-terminal and terminal event times are only modestly correlated, failing to acknowledge this sojourn time may lead to incorrect inference or biased predictions (Crilly et al. 2021).5.2.1.The Illness-Death Model.Central to the formulation of the semi-competing problem is the illness-death model, a compartment-type model for the rates at which individuals transition from being event-free (e.g., from time of diagnosis) to progression or to death or from progression to death (Andersen et al. 2012).Letting Ti1, Ti2, and Ci denote the times to the non-terminal and terminal events, and the censoring time, respectively, we observe (Yi1, δi1, Yi2, δi2, Xi), i = 1, . . ., n, where Yi2 = min(Ti2, Ci), δi2 = I(Ti2 ≤ Ci), Yi1 = min(Ti1, Yi2), δi1 = I(Ti1 ≤ Yi2), and Xi is a p-vector of covariates.The hazards for each subject at t (since diagnosis) are defined and modeled as where 0 < t1 < t, and ( 8)-( 10), respectively, correspond to the transition from diagnosis to progression prior to death, from diagnosis to death prior to progression, and from progression (that happens at t1) to death (Haneuse & Lee 2016).Here, γi i.i.d ∼ Γ(1/θ, 1/θ) (i.e., both shape and rate are 1/θ so that the mean and variance are respectively 1 and θ), i = 1, . . ., n, is a patient-specific frailty that models the dependence among these three transition processes within subject i, that is, a larger value of θ reflects a stronger dependence.In addition, λ0g (•) , g = 1, 2, 3, are the baseline hazard functions for the three state transitions, respectively, and hg(•), g = 1, 2, 3, are log-risk functions which relate a patient's covariates to each potential transition.The λ0g functions can be taken to be Weibull functions or piecewise constant with jumps at the distinct observed event times.Given ( 8)-( 10), and by integrating out the frailty term, Reeder et al. (2022) derived the marginal likelihood based on n independent subjects as where λgi(s) = λ0g (s) exp {hg(Xi)}and Λgi(t) = t 0 λgi(s)ds for g = 1, 2, 3.

A New Deep
Learning Approach for Semi-Competing Risks.We propose a multitask deep neural network for semi-competing risks (DNN-SCR), by using Equation (11) as the objective function with potentially high-dimensional covariates.DNN-SCR consists of three risk-specific sub-networks, respectively corresponding to the three possible state transitions, and a finite set of trainable parameters for specifying the baseline hazards (i.e., the φ parameters in Figure 10) if we specify Weibull baseline hazards, λ0g(s) = φg1φg2s φ g2 −1 for g = 1, 2, 3, in ( 8)-( 11) as well as the dependence among the three transition processes (i.e., θ in Figure 10).As opposed to the classical models, we opt for flexible, nonparametric estimation of hg(•), g = 1, 2, 3, to better capture potential non-linear dependencies of covariates on semi-competing events and to maximize the predictive accuracy.
In particular, we design three neural network sub-architectures to estimate the h functions nonparametrically as outputs.For identifiability, we require hg(0) = 0, g = 1, 2, 3, where 0 is a p × 1 vector of 0's.Each sub-network is a fully-connected feed-forward neural network with ReLU activation functions and a linear activation in the final layer (Figure 10).The numbers of hidden layers and nodes per layer as well as the dropout and learning rates are optimized as hyperparameters over a grid of values based on predictive performance.We implement our approach using the R interface for the deep learning library TensorFlow (Abadi et al. 2015), with model building and fitting done using Keras API (Chollet et al. 2015).Finite dimensional parameter training is done via the Gradient-Tape API (Agrawal et al. 2019) for automatic differentiation.Intensive simulations have indicated the new method predicts the risks well (Supplement A).Revisiting the BLCSC study, we exemplify our method by studying the impact of clinical and genetic predictors on disease progression and mortality.The subset includes 5,296 patients with non-small cell lung cancer, diagnosed between June 1983 and October 2021.Also included in the dataset are patients' characteristics, namely, age at diagnosis (years), sex (0: male; 1: female), race (0: other; 1: white), ethnicity (0: non-Hispanic; 1: Hispanic), height (meters), weight (kilograms), smoking status (0: never; 1: former; 2: current), packyears, cancer stage (1-4), and two indicators of genetic mutations (EGFR and KRAS).Semi-competing events of cancer progression and death are documented in the data; the date of progression is the date of the first source evidence, including exam, radiology report or pathology.Progression followed by death is observed in 111 (2%) patients, progression but alive at the last followup date is observed in 224 (4%) patients, and death prior to progression is observed among 1,916 (36%) patients.
We estimate the frailty variance, θ, to be 3.15 (bootstrapped 95% CI: 3.02-3.29),suggesting that progression is indeed correlated with death.Figure 11 depicts the h functions (log risks) for the effect of age at diagnosis on each state transition, stratified by sex and initial cancer stage while fixing the other covariates to be at their sample means or modes.There seems to exist a non-linear effect of age that differs by transition, cancer stage and sex.The left panel shows that younger age and more advanced stage is associated with higher hazards for progression; for the transitions from diagnosis or progression to death (the middle and right panels), older age is associated with higher hazards; interestingly, while sex does not seem to play a role in disease progression (the left panel), male patients are more likely to die than female patients after diagnosis (the middle panel) or after progression (the right panel).Finally, more advanced stage is associated with higher hazards for all the transitions.

CONCLUSIONS
We have presented various methods for analyzing survival outcome data with highdimensional predictors.We first provided a primer on time to event data and the unique features of survival analysis that make it distinct from other areas of statistics.We then reviewed regularization approaches for extending classical models such as the Cox, AFT, and censored quantile regression models, which lay the foundation for much of the subsequent work in this field, to high-dimensional settings.We briefly touched on feature screening for ultra high-dimensional predictors and discussed high dimensional inference with survival data.Finally, we focused on machine learning for survival prediction and concluded with methods at the forefront of the field of prognostication with competing event data.
This review is intended to provide a roadmap for readers interested in high-dimensional survival analysis (see Supplement B for tabulation of the reviewed methods and their available software), though our review is by no means exhaustive.This is an exciting and rapidly evolving field, with many open questions and new developments.For example, progress in survival predictions with high-dimensional predictors, including deep learning, active learning, and transfer learning will open new avenues to interdisciplinary breakthroughs in biomedical research and data-driven prognostic methods.Also, our review is mainly focused on frequentist methods, and in the last decade, a significant portion of Bayesian works (Lee 2011, Annest et al. 2009, Wang et al. 2013b, Pungpapong 2021, Bonato et al. 2011) have appeared, which make the field even more exciting.The paper pays tribute to the late Sir D.R. Cox, whose work in survival analysis has fundamentally changed the paradigm of biomedical research and will continue to impact future research for years to come.

Figure 3
Figure 3 Example censored quantile regression analysis on n = 153 patients from the BLCSC study.Estimated local quantile measures (Portnoy 2003) from a Cox model are shown in red, and the reference β = 0 is given in black.The figure is a permitted use of Figure 2 in Hong et al. (2019), published in Precision Clinical Medicine.

4. 3
.1.Bagging.Bootstrap aggregation or bagging refers to a means of training an ensemble learner by resampling the data with replacement, training weak learners (e.g., individual survival trees) in parallel, and combining these results over the multiple bootstrapped samples (Breiman 1996).It has three steps.1. Bootstrapping: Resample from the original data of size n with replacement to form a new sample also of size n, and obtain 'B' such samples.2. Parallel Training: With each bootstrap sample, b = 1, . . ., B, independently train the weak learners in parallel.3. Aggregation: Combine the B individual predictions by averaging over them or by taking a majority vote.

Figure 10
Figure 10Architecture for the proposed semi-competing risk deep neural network.

Figure 11
Figure 11Log risk functions of age at diagnosis on each state transition, stratified by sex (solid versus dashed lines) and initial cancer stage (line color).
Fan & Li (2002)aptive Cox LASSO by utilizing P en(β) = j wj|βj|, with smaller weights, wj, assigned to larger coefficients and vice versa.Graphical representation of Cox LASSO with two dimensional predictors.The blue diamond represents the constraint region |β 1 | + |β 2 | ≤ s for a given s.βMP LE and βLASSO represent the maximum partial likelihood and Cox LASSO estimates, respectively, and the red ellipses are contours of the partial likelihood function.As shown, subject to the l 1 constraint, βLASSO is shrunk to zero compared to βMP LE , and Cox LASSO estimates β 1 to be exactly zero.When p > n, it was suggested to use robust estimates such as ridge regression estimates to determine the wj's.Fan & Li (2002)proposed a smoothly clipped absolute deviation (SCAD) penalty, which is a quadratic spline function of |β| with knots at η and αη.Its derivative w.r.t |β|, i.e.,

Table 1
Examples of regularized Cox regression methods and their penalty terms.

Table 2
Specifications of various parametric accelerated failure time models.