Essential Ingredients and Innovations in the Design and Analysis of Group-Randomized Trials

This article reviews the essential ingredients and innovations in the design and analysis of group-randomized trials. The methods literature for these trials has grown steadily since they were introduced to the biomedical research community in the late 1970s, and we summarize those developments. We review, in addition to the group-randomized trial, methods for two closely related designs, the individually randomized group treatment trial and the stepped-wedge group-randomized trial. After describing the essential ingredients for these designs, we review the most important developments in the evolution of their methods using a new bibliometric tool developed at the National Institutes of Health. We then discuss the questions to be considered when selecting from among these designs or selecting the traditional randomized controlled trial. We close with a review of current methods for the analysis of data from these designs, a case study to illustrate each design, and a brief summary. were to 1 of 3 study a greening intervention, a trash cleanup no The primary outcome was self-reported mental assessed using a validated questionnaire administered to residents lot. one per household for a total of 442 community-dwelling adults. Questionnaires were administered in waves during a six-month window before and after the intervention. The authors reported that sample size calculations considered the anticipated ICC and repeated measurements on participants but did not provide details about how this calculation was done. The statistical analyses included 342 residents not lost to follow-up; the authors reported using random-effects regression models that accounted for the nested design. Investigators compared each active intervention to the control. The intervention effects were modeled as the net difference over time between control and one of the active intervention conditions. Twenty-eight results were reported for a summary measure of mental health and for six subscales. Three significant effects were reported. This study example nested same participants and after intervention. The paper would have been strengthened pa-rameter sample size estimation and values later observed in the data by reporting the model used for primary analysis, and either by focusing on the primary outcome analysis or by adjusting for multiple comparisons.


INTRODUCTION
Randomized trials are the gold standard in biomedical research because they provide the strongest evidence for causal inference by minimizing the risk of bias in the assignment of participants to treatments. This article reviews the development of methods for the design and analysis of grouprandomized trials (GRTs). It also reviews developments for two closely related designs, the individually randomized group treatment (IRGT) trial and the stepped-wedge group-randomized trial (SW-GRT). Their key features are described in the Essential Ingredients section. The most important reports for the development of these methods are described in the Evolution of Methods for Design and Analysis section. The issues that guide choices among these three designs and the more familiar individually randomized controlled trial (RCT) are identified in the Decision Points section. The analytic methods that are now employed with these designs are described in the Current Analytic Strategies section. Illustrative examples are provided in the Case Studies section. We close with a brief summary.

Group-Randomized Trials
A GRT differs from the RCT by randomizing groups rather than individuals to study conditions, with outcomes measured in participants from each group (105). In the traditional GRT considered here, there is no crossover of groups to different study conditions. Also called a cluster-randomized trial (16,29,33,55), the GRT is the best comparative design available if there is a strong rationale for randomization of groups rather than individuals, such as concern for contamination or use of a group-level intervention.
The key feature of the GRT is the randomization of groups to study conditions. Outcomes on participants from the same group are likely to be correlated as a result of shared experience, common exposures, or participant interaction (83). This correlation violates the independence of errors assumption that underlies the analytic methods for RCTs, requiring methods that are appropriate to the nested design (16,29,33,55,105). The correlation is often measured by the intraclass or intracluster correlation (ICC); we use ICC to refer quite generally to this withingroup correlation.
Murray et al. (111) characterized the design characteristics for recent GRTs involving cancer or cancer-related outcomes. Most compared two study conditions. Most employed a pretestposttest design, though some used a posttest-only design and others included multiple pretest and/or posttest measurements. Most employed a cohort design observing the same participants from each group at each measurement occasion; others employed a cross-sectional design observing different participants from each group at each measurement occasion. Most employed some form of restricted randomization, including stratification, matching, or constrained randomization.
In spite of the abundance of materials on the design and analytic methods for GRTs, state-ofthe-practice reviews have repeatedly shown that a large proportion of GRTs fail to account for the ICC both in sample size calculations and in the analysis, in spite of the adverse consequences for power and type 1 errors (e.g., 2,13,25,26,28,32,34,40,43,69,71,75,76,100,110,111,118,123,126,129,140,145).

Individually Randomized Group Treatment Trials
An IRGT differs from the RCT in that ICC is induced in IRGTs by the method of intervention delivery (116). ICC can happen if participants receive some portion of their intervention in a group format (e.g., exercise class), if multiple participants share the same intervention agent (e.g., the same instructor, therapist, or surgeon), or if participants interact in some other way related to the method of intervention delivery (e.g., through a virtual chat room created for participants in the same study condition). The IRGT is the best comparative design available if randomization of individuals is possible, but ICC is nevertheless induced by features of the method of intervention delivery.
The key feature of the IRGT is that the method of intervention delivery leads to correlation among outcomes taken on groups of participants in the same study condition, creating the same type of ICC seen in GRTs. Investigators must account for the ICC in the sample size to avoid low power and in the analysis to avoid type 1 errors (5,12,78,116,124,141). The situation is even more complex in IRGTs if the method of intervention delivery creates multiple overlapping groups or if the group structure changes over time (3,8,125).
The methods literature for IRGTs is much more limited than for GRTs, and the issues inherent in this design are poorly understood. Recent reviews suggest that most investigators who employ the IRGT design are not aware of these issues and do not use appropriate methods for sample size or analysis (6,89,116,117).

Stepped-Wedge Group-Randomized Trials
An SW-GRT is a type of GRT that has seen increasing use over the last decade and is now commonly used to evaluate service delivery interventions (73). However, SW-GRTs are more complex than GRTs or IRGTs and face greater risks for bias.
The key feature of the SW-GRT is the crossover of all groups from the intervention condition to the control condition in a random order and on a staggered schedule. As in other GRTs, the observations from participants from the same group will be correlated; however, the impact of the ICC is reduced in the SW-GRT compared with that in the GRT or IRGT because groups are crossed with rather than nested within study conditions. At the same time, the intervention effect is confounded with calendar time by design (61,73). Moreover, the effect of the intervention may vary depending on how much time has passed since the intervention was introduced (61, 72). Finally, the pattern of correlation over time can be complex (80) because SW-GRTs involve repeated measurements on the same groups and sometimes on the same participants.
Copas et al. (19) recently summarized the design characteristics used in SW-GRTs. In a complete SW-GRT, data are collected initially when all groups are in the control condition, again in each group at each crossover point or step, and often again after all groups are in the intervention condition; in an incomplete SW-GRT, data are not collected from all groups at all steps. SW-GRTs vary in the number of groups that cross over in each step, the number of steps, and the time between crossovers. They may employ restricted randomization, as described above, to balance groups in sets to be randomized to the crossover schedule. In the continuous recruitment short exposure design, participants are recruited over time and exposed for a short period; participants may be measured only once or repeatedly. In the closed cohort design, participants are identified at the beginning of the study, participate throughout, and are measured repeatedly. In the open cohort design, many participants are identified at the beginning, but some may leave, whereas other participants are recruited over time; participants may be measured only once or multiple times.
The methods literature on SW-GRTs has developed rapidly over the last decade. Reviews have noted deficiencies in the reporting of study design, sample size, analytic methods, and ethical conduct (

EVOLUTION OF METHODS FOR DESIGN AND ANALYSIS
A primary objective of this article is to identify the most influential reports in the evolution of methods for the design and analysis of GRTs, IRGTs, and SW-GRTs. Toward that end, PubMed was searched to identify all methods papers related to GRTs, IRGTs, and SW-GRTs (search parameters are included at the end of Supplemental Table 1). The results were augmented by articles, books, chapters, and other reports known to the authors. PubMed was then searched for other papers by any of the authors of the identified reports, yielding a total of 4,514 candidate items. The lead author reviewed each title and excluded items that were not focused on design or analytic methods for GRTs, IRGTs, or SW-GRTs, leaving 926 items; 799 focused on GRTs, 49 on IRGTs, and 74 on SW-GRTs, and 4 addressed all three designs.
The relative citation ratio (RCR) and citation counts were used to assess the influence of each item. The RCR is a metric that uses relative citation rates to measure influence at the article level (74), standardized across fields of study. The RCR was available for 791 items; for the remaining 135 items, the Web of Science, Scopus, and Google Scholar were used to obtain citation counts. GRT items (N = 25) with an RCR ≥ 7.98 (99th percentile for all PubMed entries) or without an RCR but with ≥200 citations 1 were retained. IRGT (N = 9) items and items addressing all three designs (N = 4) with an RCR ≥ 3.45 (95th percentile) or without an RCR but with ≥100 citations were retained. SW-GRT items (N = 13) with an RCR ≥ 4.91 (97.50 percentile) or without an RCR but with >150 citations were retained. This approach provided 50 methods reports related to these designs that were deemed influential; unless otherwise noted, 2 the items cited in this section comprise the set of 50 influential reports. Supplemental Table 1 presents the full list of 926 items, ranked within design by RCR and citations, and highlights the 50 items identified as influential.
Although these items were judged to be influential, they represent the history of the development of these methods and not necessarily the current state of the science. We summarize the state of the science for analysis in the Current Analytic Strategies section.

Items Relevant to All Three Designs
In 1978, Cornfield (20) identified the two penalties associated with group randomization: extra variation associated with the group and limited degrees of freedom for the test of the intervention effect. These two penalties must be addressed in the design and analysis of any GRT, IRGT, or SW-GRT.
Turner et al. (137,138) published reviews of the methods for design and analysis of GRTs, IRGTs, and SW-GRTs in 2017.

Group-Randomized Trials
Several influential papers addressed general issues for GRTs. Those included a 1997 commentary drawing attention to the methodological issues inherent in GRTs (11), a widely cited 1997 paper on optimal design (121), a 2003 paper on the use of GRTs for evaluating the effectiveness of change and improvement strategies (31), and a 2013 paper on methods for process evaluation (50).
Others presented methods for sample size calculations.  (54) presented methods based on the coefficient of variation. 1 The citation count thresholds generally discriminated between the items that fell above and below the RCR threshold for each design. The sliding scale reflected the number of items for each design. 2 An asterisk indicates that the citation was not among the 50 influential reports.
Murray (105) published the first textbook on the design and analysis of GRTs followed two years later by the text from Donner & Klar (29). Subsequent textbooks were published by Hayes & Moulton (55) and Eldridge & Kerry (33).
In 2004, the first CONSORT statement on GRTs appeared (17), providing a checklist to identify the methodological information to include in trial reports that is deemed essential to interpreting the trial. An update was published in 2012 (18).
Also in 2004, Murray et al. (113) published a review of methodological issues in GRTs, summarizing work on both design and analytic methods.

Individually Randomized Group Treatment Trials
Several influential reports have addressed the risk of type 1 errors in studies in which participants receive their treatment in a group format or from a shared interventionist. These papers have appeared in the biomedical (78,143), psychological (6,21,99), and educational literature (115).
In 2005, Roberts & Roberts (124) addressed the analytic challenges specific to IRGTs. The investigators noted that mixed models that specified the groups used to deliver the intervention as levels of a random effect provided an appropriate analysis.
In 2008, Boutron et al. (12) extended the CONSORT statement to nonpharmacologic interventions, which include many IRGTs. They pointed to the need to provide details on how the intervention is delivered (e.g., individually, in groups, via a common interventionist) and to address the implications for analysis of having nonindependent observations within one or more study conditions.
In 2017, Heo et al. (68) described sample size methods for IRGTs, focusing on partially nested designs, which reflect a common situation in an IRGT in which participants in the intervention condition receive their treatment in a group format while participants in the control condition do not. Also in 2017, Sterba (132) reviewed modeling developments for a variety of IRGT designs.

Stepped-Wedge Group-Randomized Trials
Hussey & Hughes (73) first described methods for the SW-GRT design in 2007. Copas et al. (19) later identified three main types of SW-GRTs and discussed design choices with regard to the number and length of the steps, incomplete and complete designs, and randomization methods, including restricted randomization methods. Hemming et al. (61) outlined the rationale, design, analysis, and reporting of SW-GRTs. They noted that the SW-GRT is particularly well suited for evaluations of health service delivery interventions.
Several influential papers have addressed sample size methods for SW-GRTs. Woertman et al. (144) described an approach based on a design effect and noted that the sample size will depend on group size, the ICC, the number of steps, the number of baseline measurements, and the number of measurements between steps; a subsequent letter (60) * corrected an important error in this paper, which was accepted by the authors (23) * . Baio et al. (4) presented simulation methods for sample size estimation. Hemming et al. (63) considered power for several different SW-GRT designs, including incomplete cross-sectional SW-GRTs, SW-GRTs with multiple levels of nesting, and complete SW-GRTs. Girling & Hemming (45) proposed an algorithm to optimize the design of SW-GRTs and showed that for large studies the best design may be a hybrid of GRT and www.annualreviews.org • The Evolution of Group-Randomized Trials SW-GRT design components. Hemming & Taljaard (64) compared power for GRTs and SW-GRTs when the number of groups is fixed, and they reported that the GRT tends to be more efficient when the ICC is small and that the SW-GRT tends to be more efficient when the ICC is large, dependent on group size. Scott et al. (128) addressed the use of small-sample corrected generalized estimating equations (GEE) for SW-GRTs. They demonstrated the viability of a marginal model in studies having at least 10 groups.
Several state-of-the-practice reviews of SW-GRTs have been published (10,14,22,101), reporting wide variation in data analytic methods and reporting standards. The recent CONSORT statement for SW-GRTs (67) provides reporting guidelines.

More General Material
Though not focused specifically on GRTs, IRGTs, or SW-GRTs, and so not included in the set of 50 influential reports, several well-known books have contributed substantially to the development of design and analytic methods for these designs. In 1953, Lindquist (94) published the first design and analytic text that addressed nested designs. In 1987, Goldstein (46) published the first text on multilevel models for educational and social research, including methods that have come to be used widely in GRTs and IRGTs; subsequent editions have elaborated on those methods (47)(48)(49). In 1992, Bryk & Raudenbush (15) published their text on hierarchical linear models, with a second edition in 2002 (122).

DECISION POINTS
Numerous issues guide the choice among the four designs under consideration here. Because the RCT and particularly the double-blinded RCT provide the strongest evidence for causal inference among the four designs, it should be adopted if possible (16,29,33,55,105). That said, an RCT is not always viable owing to the nature of the intervention or to factors beyond the investigator's control (e.g., 102,130), and so it is important to understand the alternatives and the decision points to be used in choosing one design over another.
In choosing a design, the investigator should carefully consider the research question, the context of the research, and the nature of the intervention. Figure 1 presents three key questions to guide the selection process.
The first question applies to all situations: Is there a strong rationale for randomizing groups rather than individuals to study conditions? Recognizing that the RCT provides the strongest evidence, the Ottawa Statement on the ethical design and conduct of GRTs calls for investigators to provide a clear rationale for choosing randomization of groups rather than individuals (134). Acceptable reasons include the evaluation of a group-level intervention or the group effects of an individual-level intervention; concern about substantial contamination if multiple study conditions are implemented in the same group; the need to reduce costs, enhance compliance, or secure cooperation of investigators; and administrative convenience.
Contamination is often the primary reason for choosing group randomization. Contamination occurs when participants who are not selected to receive the intervention receive at least some part of it. If the risk of contamination is low, individual randomization is preferred, as an RCT or IRGT will be more efficient than a GRT or SW-GRT. If the risk of contamination is not low, individual randomization is unwise and group randomization is preferred (18,105,134,136). To illustrate, consider a school-based behavioral intervention to prevent cigarette smoking among adolescents. If individual students within a school are randomized to intervention and control, Is there a strong rationale for randomizing groups rather than individuals to study conditions?

Figure 1
Three questions (blue) to guide the selection of the appropriate study design. Footnotes: (a) If the intervention is delivered through a physical group or a virtual group, or through shared interventionists who each work with multiple participants, positive ICC can develop over the course of the trial. (b) There may be logistical reasons to randomize groups, or it may not be possible to deliver the intervention to individuals without substantial risk of contamination. (c) There may be legitimate political or logistical reasons to roll out the intervention to all groups before the end of the trial. Note: Quite generally, an RCT design is preferred to an IRGT design because it is both more efficient and less subject to threats to internal validity. A parallel-arm GRT design is preferred to an SW-GRT design because the SW-GRT is subject to additional threats to internal validity. The article text provides more details. Abbreviations: GRT, group-randomized trial; ICC, intraclass or intracluster correlation; IRGT, individually randomized group treatment; RCT, randomized controlled trial; SW-GRT, stepped-wedge group-randomized trial.
control students are likely to be exposed to components of the intervention, either by teachers or by other students. In this situation, individual randomization is unwise, as risk for contamination is not low; group randomization of schools is preferred so that all participants within the same school are in the same study condition, minimizing the risk of contamination.
If there is no strong rationale for group randomization, the next question is, Do participants receive their treatment in a group format or from a shared interventionist? Delivery of the intervention in a group format, or through intervention agents who each interact with multiple participants, usually leads to correlated outcomes because it creates the opportunity for shared experience, common exposures, or participant interaction. In the case of a blinded placebocontrolled RCT in a large population (e.g., blood pressure-lowering medication versus placebo), there is no reason to expect that correlation of outcomes might develop, and in this case, the RCT is preferred. In contrast, when the intervention is a group-based treatment [e.g., a grouptherapy session for cardiac rehabilitation (131)] or involves other interactions among intervention participants (e.g., through an online community or through a shared interventionist), then the observations taken from participants who have such shared experiences, common exposures, or interactions are expected to become correlated. The trial is an IRGT trial, and the magnitude of the correlation will depend on the frequency and intensity of those experiences. This correlation must be accounted for to avoid a type 1 error (116,124). Correlation of outcomes may arise in one study condition or in multiple study conditions.
If there is a strong rationale for randomization of groups, the next question is, Is there a strong rationale for rolling out the intervention to all groups before the end of the trial? There may be political or logistical reasons for an affirmative answer. For example, a health care system composed of multiple clinics may decide that a change should be made in the way that a particular health service is delivered. The health system may refuse to consider a GRT in which only half of the www.annualreviews.org • The Evolution of Group-Randomized Trials clinics receive the intervention. If the rollout of the intervention to the clinics can be spread over time, and if the health system agrees that the timing of the rollout can be allocated randomly, an SW-GRT design can provide stronger evidence for causal inference compared with a nonrandomized design with a staggered rollout (19,112). Nevertheless, given that the SW-GRT confounds time and study condition by design, and usually requires more time and more measurements, the GRT design should be used if possible (61,73,112).
Consideration of the three questions in Figure 1 will allow the investigator to select the appropriate design for the research question, given the context of the study and the nature of the intervention. It is advisable to assemble a team of both substantive and methodological experts to consider the issues raised here when choosing among the four general designs. Once that choice is made, many more decisions will be required on details such as the measurement schedule, how participants will be recruited, whether the same participants will be followed over time, the risk of selection bias due to lack of blinding, etc. The reader is referred to excellent texts for a more comprehensive description of these issues (16,29,33,55,103,105).

Group-Randomized Trials
Three main analytical approaches can account for the ICC in GRTs: two-stage analysis, mixedeffects regression, and GEE. In a two-stage analysis, the multiple observations from each group are reduced in a first stage to a single score, such as a proportion or mean, and those group scores are compared between study conditions in a second stage using standard methods such as a twosample t-test, Wilcoxon rank-sum test, or permutation test (44,56,108,119,147). The two-stage analysis is a good choice when the number of groups is small because it yields robust inferences as long as the group sizes are neither very small nor highly variable. A disadvantage is that it is more difficult to adjust for individual-level covariates (16,56). The more common approach is to analyze individual-level data in one stage using a regression model that accounts for the ICC (111). In mixed-effects regression, the groups are modeled as random effects; in GEE, there are no random effects, but the correlation structure is modeled directly. Mixed-effects regression and GEE can more easily adjust for individual-level and group-level covariates and for heterogeneity in group size.
GEE coefficients describe changes in the population average effect given changes in the covariates, including the intervention. Inferences for the treatment effect can be made by relying on the sandwich variance estimator, which is robust to misspecification of the correlation structure; however, this approach requires a sufficient number of groups-preferably, at least 50 (39,107). However, many GRTs randomize fewer groups than 50 (111). Methods to compensate for small numbers of groups have been developed (37,42,81,93,97,104) but do not always perform well and are not universally available across statistical software packages. Li & Redden (93) found that their performance depends not only on the number of groups, but also on the variation in group sizes. Ford & Westgate (42) proposed methods that outperform previous methods, but they call for further study to provide a valid bias-corrected estimator that can be applied in any situation.
Mixed-effects coefficients describe changes in the individual given changes in the covariates, conditional on the random effects. As such, they are group-specific coefficients (commonly referred to as subject-specific coefficients in the longitudinal analysis literature); they are identical to the population-average coefficients from GEE in a linear or log model but different in a logistic model. Mixed-effects regression accommodates a smaller number of groups than GEE, though the two-stage approach is preferred for small studies. In addition, mixed-effects regression can easily be extended to account for multiple levels in the data hierarchy. A challenge is specifying the correct model that accounts for all relevant fixed and random effects (105); another is choosing an appropriate method for calculating the degrees of freedom for testing the intervention effect, especially when the number of groups is small (82,127). Li & Redden (92) studied the performance of a variety of methods in the case of binary outcomes and recommended the between-within method, which was robust to variation in group size and provided correct type 1 error rates even when the total number of groups was as low as ten.
For binary outcomes, there is a choice of analyzing treatment effects using absolute risk, relative risk, or odds ratio (139). Reporting of both absolute and relative effect estimates is recommended (18), but models with binomial distributions and log or identity link functions do not always converge. A log-link with a Poisson distribution and robust standard errors may be a suitable alternative (148).
Obtaining reliable ICC estimates for binary outcomes can be difficult. Mixed-effects logistic regression does not yield ICC estimates on the proportions scale-which is the scale required for most sample size formulae-and there is no single formula for converting from the logit to the proportions scale (35). GEE has the advantage of producing ICC estimates on the proportions scale directly. Wu et al. (146) compared several methods of estimating the ICC for binary outcomes and found that estimates were quite different and often negatively biased. They recommended that sample size and analytic methods should allow the ICC to vary by study condition, rather than assume a constant ICC, as is commonly done.
Hooper et al. (70) provided recommendations for analyzing GRTs that provide a baseline measurement of the outcome. Comparing study conditions in terms of changes from baseline (known as difference in difference) is popular but can be inefficient and misleading. It is more efficient to treat baseline as a covariate using an analysis of covariance (ANCOVA) (84,105). With more than two time points in the model, we recommend random coefficients models (109). As an alternative to specifying fixed effects for study condition, time, and their interaction, omitting the study condition main effect can increase precision; the model should include random effects for group and the interaction between group and time. When using ANCOVA for a cohort design, including both individual-level and group-level versions of the baseline measurement can increase power (84).

Individually Randomized Group Treatment Trials
IRGTs are typically analyzed using mixed-effects regression methods to allow for different correlation structures in the study conditions (5,9,90,116,124). The early work focused on the simplest IRGT design in which it is important to allow for ICC in the intervention condition but not in the control condition. More recent work has addressed both changes in the group structure over time and participants who receive their intervention in more than one small group at the same time. Building on the early work by Goldstein (46,47), methodologists have developed both crossclassified (95,96,125,132) and multiple-membership mixed-effects models (125). Bauer et al. (8) proposed dynamic-group models that reflect changes in the structure or function of groups over time as an alternative to Goldstein's multiple-membership model.

Stepped-Wedge Group-Randomized Trials
SW-GRTs are more complicated to analyze and have increased risks of bias over parallel GRTs. By design, the intervention effect is confounded with time and so the analysis must adjust for time as a covariate. The most popular method of analysis for the cross-sectional SW-GRT is mixed-effects regression with a fixed time-varying indicator for study condition and a fixed categorical effect for time as well as a random intercept for group to account for the ICC (73). However, this model www.annualreviews.org • The Evolution of Group-Randomized Trials assumes a constant correlation between any two individuals in the same group, no matter how far apart in time they are measured. A more realistic correlation structure would allow the strength of the correlation to decay with increasing time separation. Kasza & Forbes (79) showed that omitting a correlation decay in an SW-GRT with continuous outcomes when such decay exists will underestimate standard errors and increase the type 1 error rate. Kasza et al. (80) proposed a model that allows the ICC to decay exponentially across discrete periods of time and provided SAS code. Although this model does not seem to be possible in other major statistical software packages, and computational issues can arise, it is currently recommended for analyzing data from a crosssectional SW-GRT. It can also be used for cohort designs by including an additional random intercept for the individual. Grantham et al. (51) examined even more flexible models that allow correlations to decay as a continuous function of time. Extensions allow treatment effects to vary across groups (65,66,72).
Alternative SW-GRT approaches exist that model fixed time effects using parametric or semiparametric curves and allow for time-varying intervention effects (41,65,72,114); however, their properties have not been extensively examined. Scott et al. (128) examined GEE with small sample corrections for cohort SW-GRTs. Additional work is required to examine SW-GRT methods for binary and time-to-event outcomes and with a small number of groups; methods for estimating ICCs and testing goodness of fit are also needed. Reporting guidelines for SW-GRTs have appeared only recently (67).
Nonparametric methods may be more robust to incorrect specification of the correlation structures and when there are few groups. Ji et al. (77) and Wang & De Gruttola (142) proposed using mixed-effects regression to obtain an estimate for the treatment effect followed by permutation tests to obtain confidence intervals and p-values. Thompson et al. (135) avoided mixed-effects regression altogether: The intervention effect was estimated using group-level data from the intervention and control conditions in vertical slices of time, and time-specific estimates were then combined as an inverse-variance weighted average; the allocation of groups to different intervention times was then randomly permuted to obtain p-values and confidence intervals.

Sample Size Estimation
Numerous resources exist to guide sample size estimation for GRTs, IRGTs, and SW-GRTs. The NIH recently provided an online sample size calculator for GRTs using continuous or binary outcomes, which includes extensive instructional material (https://researchmethodsresources.nih. gov). Kreidler et al. (85) provided a Web-based program for power computation for linear models, including those applicable to GRTs and IRGTs. Hemming et al. (62) introduced an R-Shiny application that supports sample size calculations for GRTs, IRGTs, and SW-GRTs and permits sensitivity analyses under simple and complex correlation structures; further work is required to verify that these methods work well in the case of noncontinuous outcomes and in the case of a small number of groups. Li et al. introduced alternative sample size methods based on GEE for continuous and binary outcomes (91). Moerbeek & Teerenstra's (103) 2016 text on power analysis of trials with multilevel data addresses sample size issues for GRTs, IRGTs, and SW-GRTs and includes many ICC estimates.

Effect of Greening Vacant Land on Mental Health of Community-Dwelling Adults
This study (130) was a GRT conducted in Philadelphia to evaluate whether interventions to green vacant urban land can improve self-reported mental health. A total of 110 noncontiguous geographic areas containing 541 vacant lots were randomly allocated to 1 of 3 study conditions: a greening intervention, a trash cleanup intervention, or no intervention. The primary outcome was self-reported mental health, assessed using a validated questionnaire administered to residents selected at random from the area surrounding each lot. Only one participant per household was selected, for a total of 442 community-dwelling adults. Questionnaires were administered in waves during a six-month window before and after the intervention. The authors reported that sample size calculations considered the anticipated ICC and repeated measurements on participants but did not provide details about how this calculation was done. The statistical analyses included 342 residents not lost to follow-up; the authors reported using random-effects regression models that accounted for the nested design. Investigators compared each active intervention to the control. The intervention effects were modeled as the net difference over time between the control and one of the active intervention conditions. Twenty-eight results were reported for a summary measure of mental health and for six subscales. Three significant effects were reported. This study is an example of a nested cohort design (105), with the same participants measured before and after the intervention. The paper would have been strengthened by reporting the parameter estimates used in the sample size estimation and the values later observed in the data (18), by reporting the model used for the primary analysis, and either by focusing on the primary outcome analysis or by adjusting for multiple comparisons.

The ACTonHEART Study
This study (131) was proposed as a prospective IRGT trial to evaluate the efficacy of a brief groupadministered intervention integrating education on heart-healthy behaviors with acceptance and mindfulness skills to modify cardiovascular risk factors and psychological well-being among cardiac patients. Approximately 168 patients were to be recruited and randomized individually to usual care or intervention. Block randomization was to be used to balance gender and baseline cardiovascular risk. The primary outcome variables were proposed as mean changes for low-density lipoprotein (LDL) cholesterol, resting systolic blood pressure, body mass index, and psychological well-being, assessed at baseline, 6 weeks, and 6 and 12 months.
This trial was to have two conditions, with nesting of participants within small groups in only one condition. Although patients were randomized individually, the intervention participants received the ACTonHEART intervention in 6 90-min group-therapy sessions over 6 weeks. The control group received usual care. The authors reported that the power calculation considered the anticipated ICC. Intention-to-treat analyses were proposed to examine outcome measures, and the ICC was to be considered in the multilevel analysis. The authors also reported that twotailed tests would be used with an alpha of 0.01 (0.05/4) to account for the four primary outcomes. This study is an example of an IRGT with nesting of participants in small groups in one condition. The paper would have been strengthened by reporting details adequate to replicate the sample size calculation, including the number of therapy groups, the value of the ICC used, and the model proposed for the primary analysis.

Effects of a Free School Breakfast Program on Children's Attendance, Academic Achievement, and Short-Term Hunger
This study (102) was a 1-year SW-GRT in 14 New Zealand primary schools in low-socioeconomic resource areas. The study's purpose was to evaluate the effect of introducing a free, daily school breakfast program on the primary outcome of optimal school attendance, defined as attendance of 95% or higher. Eligible schools were randomly assigned to 1 of 4 sequences (3-4 schools per www.annualreviews.org • The Evolution of Group-Randomized Trials sequence). Schools randomly allocated to the first sequence implemented the intervention condition immediately upon study entry; thereafter, additional schools crossed over to the intervention condition each term, according to their random assignment. By the final term, all schools were in the intervention condition.
The study was designed to have 85% power to detect a 10% absolute change in the proportion of students with optimal attendance. Sample size methods considered the SW-GRT design and accounted for an anticipated ICC within schools of 0.05. The analysis was performed according to the intention-to-treat principle. The analysis used generalized linear mixed models for categorical outcomes and general linear mixed models for continuous outcomes. The analysis accounted for time as well as for the ICC within schools. Repeated measurements on the same child within the same school over time were also considered in the multilevel analysis. The primary outcome was not statistically significant, nor were all but one secondary outcome.
This study is an example of a closed cohort SW-GRT: All participants were identified at the beginning of the trial and participated until the end of the study, with repeated measurements on the same participants (19). The paper would have been strengthened by reporting full details for the sample size calculation to allow replication and by addressing the issue of decay of the overtime correlation, though that was not standard practice when this study was designed.

SUMMARY
GRTs randomize groups to study conditions and measure outcomes in participants from those groups to evaluate the effect of an intervention. The GRT is the best comparative design available if the intervention cannot be delivered separately to individuals, when there is a substantial risk of contamination, or when there are other strong reasons for preferring group randomization. The presence of positive ICC violates the independence of errors assumption that underlies the analytic methods for RCTs requiring different methods to protect the type 1 error rate (16,29,33,55,105).
IRGTs randomize individuals to study conditions; however, the intervention is delivered in a group setting or by shared interventionists, inducing correlation similar to that in a GRT. The IRGT is the best comparative design available if randomization of individuals is possible but the method of intervention delivery is expected to create positive ICC. Special analytic methods are required to account for the correlation to protect the type 1 error rate (5,12,78,116,124,141).
SW-GRTs randomize the order in which groups cross over from the control condition to the intervention condition and provide both control and intervention observations for each group. The SW-GRT is helpful if a GRT is not possible, though the GRT is preferred because it faces fewer threats to internal validity. Because groups are crossed with conditions, the impact of the positive ICC expected within each group is diminished; thus SW-GRTs often provide more power or require fewer groups than a GRT, particularly if the ICC is large.
All three designs have a more complex correlation structure than does the traditional RCT and require sample size estimation and data analytic methods that reflect this structure. They are analyzed most often by mixed-effects regression models, but they can also be analyzed using GEE, two-stage methods, and randomization-based methods. Model-based inference depends on the correct specification of the correlation structure, whereas GEE-based inference depends on the correct specification of the data structure (24). Degrees of freedom are based on the number of groups rather than the number of participants for both GRTs and SW-GRTs; for IRGTs, degrees of freedom are based on the number of groups for any conditions that use intervention delivery methods that induce within-group correlation and on the number of participants for any conditions that do not.
The RCR (74) and citation counts were used to identify 50 influential methods reports for these designs, and we have highlighted the contributions of those reports. We offer three questions to guide the choice from among these three designs and the traditional RCT. Finally, we have summarized current analytic strategies for the three designs and presented case studies as examples.

DISCLOSURE STATEMENT
The authors are not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.