There are no products listed under this category.

# Gatekeeping Strategies for Clinical Trials

## Biotech Research

Gatekeeping Strategies for Clinical Trials That Do Not Require All Primary Effects to Be Significant

Alexei Dmitrienko (Eli Lilly and Company)

Walter W. Offen (Eli Lilly and Company)

Peter H. Westfall (Texas Tech University)

# Summary

- In this paper we describe methods for addressing multiplicity issues arising in theanalysis of clinical trials with multiple endpoints and/or multiple dose levels.
- Effi-cient “gatekeeping strategies” for multiplicity problems of this kind are developed.One family of hypotheses (comprised of the primary objectives) is treated as a “gate-keeper,” and the other family or families (comprised of secondary and tertiary objec-tives) are tested only if one or more gatekeeper hypotheses have been rejected. Wediscuss methods for constructing gatekeeping testing procedures using weighted Bon-ferroni tests, weighted Simes tests, and weighted resampling-based tests, all within aclosed testing framework.
- The new strategies are illustrated using an example froma clinical trial with co-primary endpoints, and using an example from a dose-findingstudy with multiple endpoints. Power comparisons with competing methods showthe gatekeeping methods are more powerful when the primary objective of the trialmust be met.

## Introduction

### Hypotheses tested in clinical trials are commonly divided into primary and secondary.

The primary hypothesis is related to the primary trial endpoint which describes themost important features of the disease under study.

O’Neill [1] defines the primaryendpoint as “a clinical endpoint that provides evidence sufficient to fully characterizeclinically the effect of a treatment in a manner that would support a regulatoryclaim for the treatment.” In many cases, the primary hypothesis test determines theoverall conclusion from the trial. Secondary hypotheses also play an important rolein characterizing the effects of the study drug. However, a significant improvement ina secondary endpoint in isolation is not generally considered as substantial evidenceof therapeutic benefit.The interpretation of a positive finding with respect to a secondary outcome vari-able depends heavily on its clinical importance. The following general types of sec-ondary outcome variables are based on D’Agostino [2]:Type I. Separate components of the primary trial objective.

These may be veryimportant secondary endpoints that are difficult to incorporate into the for-mal power calculation because the expected improvement is relatively small.All-cause mortality plays this role in a large number of trials including cardio-vascular and critical care studies.

Type II. Endpoints that help interpret the primary findings. These endpoints arevery helpful for understanding the big picture, e.g., understanding the benefitswith respect to various aspects of the disease. The effect of osteoarthritis drugsis typically measured using pain and physical function indices; however, thereare other important outcome variables such as patient global assessment andquality-of-life measures.A similar classification of secondary endpoints is proposed in the guideline en-titled “Points to consider on multiplicity issues in clinical trials” published by theCommittee for Proprietary Medicinal Products (CPMP) [3].With this classification in mind, it is critically important to prospectively definenot only the hypotheses of interest but also a decision rule that will be used to guidethe decision-making process at study completion (Chi [4]). The decision rule canhave a flexible hierarchical structure with several data-driven clinical decision paths.The clinical statistician “translates” this decision rule into a statistical procedure thatincorporates the interrelationships among the primary and secondary trial hypotheses.Multiplicity problems arising in the context of primary and secondary trial hy-potheses can be effectively dealt with using hierarchical testing procedures, also known as “gatekeeping procedures.” Consider two families of outcome variables, e.g., pri-mary variables that can lead to new regulatory claims (gatekeeper) and secondaryvariables that may become the basis for additional claims. The gatekeeper family istested without an adjustment for the other family, and the second family is examinedonly if the gatekeeper has been successfully passed; see Bauer et al. [5], Westfall andKrishen [6], and Gong, Pinheiro and DeMets [7].Gatekeeping procedures proposed in the literature have been designed for the casewhen the gatekeeper family is passed only if all of the hypotheses in the family havebeen rejected, which we refer to as “serial” gatekeeping procedures. Serial procedurescan be too restrictive; for example, the requirement to reject all primary trial hy-potheses before performing the secondary analyses may be inappropriate when theco-primary endpoints can lead to separate regulatory claims. Similarly, the analysisof individual doses of an experimental drug in a dose-finding setting can be enhancedby using a hierarchical testing strategy, e.g., one examines the higher doses first andstudies the lower doses if at least one of the higher doses (but not necessarily both)has shown a significant difference from the control.

Thus, while it is well known that one can proceed in serial fashion, it is apparentlynot known that gatekeeping tests also can be performed in “parallel” fashion, whereone may proceed to the secondary family when at least one of the primary testsexhibits significance. The main contribution of this paper is the development of suchparallel gatekeeping procedures.An alternative to gatekeeping strategies is the prospective alpha allocation scheme(PAAS) proposed by Moyé [8], in which the primary and secondary endpoints aretested simultaneously at levels less than the common 0.05 threshold, ensuring thatthe total threshold is still 0.05. This approach has been designed for the cases whenthe secondary endpoints may potentially provide the basis for a new regulatory claim.If the primary effects are truly null, the PAAS strategy is more powerful than agatekeeping strategy. On the other hand, when some of the primary hypotheses arefalse, the power gains of gatekeeping strategies can be substantial, as we demonstratein Section 6.In this paper, we develop parallel gatekeeping methods using the powerful closedtesting principle of Marcus et al. [9]. Section 2 introduces closed gatekeeping pro-cedures based on the weighted version of the Bonferroni test. Section 3 discussesextensions based on the weighted Simes test and parametric resampling. The de-scribed gatekeeping strategies are illustrated in Sections 4 and 5 using two clinicaltrial examples, one with two co-primary endpoints and the other a dose-finding study.Section 6 compares the power of the various testing procedures.

## Bonferroni gatekeeping procedures

Consider a family of null hypotheses H1,...,Hm and assume that the hypotheses aregrouped into two families, F1 = {H1,...,Hk} and F2 = {Hk+1,...,Hm}. The firstfamily serves as a gatekeeper in the sense that F1 is tested without an adjustmentfor F2 and Hk+1,...,Hm are examined only if the gatekeeper has been successfullypassed. It will be assumed throughout the paper that F1 is a parallel gatekeeper, i.e.,Hk+1,...,Hm can be tested if at least one hypothesis in F1 has been rejected. In theclinical trial context, F1 and F2 can represent sets of the primary and secondary trialhypotheses, respectively. We are interested in constructing a testing strategy thatcontrols the familywise error rate (FWE) with respect to both families of hypothesesin the strong sense (Hochberg and Tamhane [10]).To apply the closed testing principle to the problem of testing F1 and F2, con-sider an arbitrary intersection hypothesis H in the closed family associated withH1,...,Hm. Let p1,...,pm denote the raw p-values for H1,...,Hm. Further, let pHdenote the p-value associated with H, obtained from a test procedure whose size isno more than α when H is true. The closed testing principle states that an originalhypothesis is rejected provided all of the p-values associated with the intersectionhypotheses containing it are significant. Therefore, the adjusted p-value associatedwith the hypothesis Hi equals ˜pi = maxH∈Hi pH, where Hi denotes the set of all in-tersection hypotheses that contain Hi. The closed testing procedure rejects Hi when˜pi ≤ α, and strongly controls the FWE at the α level, over the combined family(H1,...,Hm). Note that the adjusted p-value depends on the test procedure chosento test H; the key to our gatekeeping procedures is the particular choice of tests foreach H, as we now describe.Parallel gatekeeping procedures can be defined using weighted Bonferroni tests forthe intersection hypotheses. Select an intersection hypothesis H and consider a setof m weights v1(H),...,vm(H) such that0 ≤ vi(H) ≤ 1, vi(H)=0if δi(H)=0,m∑i=1vi(H) ≤ 1.Here δi(H)=1if H ∈ Hi and 0 otherwise. The weighted Bonferroni p-value associ-ated with H is given bypH = min1≤i≤m(δi(H)pi/vi(H)),where δi(H)pi/vi(H)=1if vi(H) = 0, and H is rejected if pH ≤ α. By the Bonferroniinequality, the size of this test is no greater than α.

Therefore, the resulting closedtesting procedure for the original hypotheses controls the FWE in the strong sensefor any set of weight vectors.

The PAAS procedure fits with the current framework as follows: Suppose theweights are fixed for all hypotheses (vi(H) ≡ vi) and that H is tested using p(P AAS)H=min1≤i≤m δi(H)pi/vi. Then “reject all Hi for which pi ≤ viα” is equivalent to theclosed testing procedure using p-values p(P AAS)Hfor the intersection hypotheses. Viewedas a closed testing procedure, one can see that (i) the weights for the PAAS proce-dure are identical for all intersections, and (ii) the PAAS method uses extremelyconservative tests for some of the intersection hypotheses; for example, the levels ofthe singletons are viα, potentially much less than α. More power can be obtainedusing more powerful tests for the intersections; this is the essential contribution ofHolm [19].In what follows we describe an FWE-controlling closed procedure for testing F1and F2 that meets the following parallel gatekeeping criteria:

- Condition 1. The adjusted p-values for the gatekeeper hypotheses H1,...,Hk donot depend on the significance of the p-values associated with Hk+1,...,Hm.
- Condition 2. The adjusted p-values associated with the secondary hypotheses Hk+1,...,Hmare greater than the minimum of ˜p1,..., ˜pk. This means that the hypotheses inF2 can be tested if at least one hypothesis in F1 has been rejected.The following algorithm shows how to choose the weight vectors to satisfy Condi-tions 1 and 2. Suppose that w1,...,wm represent the relative importance of the nullhypotheses in F1 and F2 with w1 +...+wk = 1 and wk+1 +...+wm = 1. For example,w1 may be set to 0.8 if H1 corresponds to the most important primary trial endpointand 0.2 may be distributed evenly across the remaining gatekeeper hypotheses asso-ciated with the less important outcome variables. Using the wi, we define weightsvi(H) to use in a Bonferroni test of hypothesis H. Every hypothesis H will be testedusing a potentially different set of weights (v1(H),...,vm(H)).Algorithm 1. Selecting the weights for the parallel Bonferroni gatekeepingprocedureSelect a hypothesis in the closed family and denote it by H. Consider the followingthree mutually exclusive cases:Case 1. If H contains all k gatekeeper hypotheses, i.e., H ∈ ∩ki=1Hi, let vi(H) = wi,i = 1,...,k, and vi(H) = 0, i = k + 1,...,m.Case 2. If H contains r gatekeeper hypotheses (1 ≤ r ≤ k −1), i.e., H ∈ (∪ki=1Hi)∩(∩ki=1Hi)c, let vi(H) = wiδi(H), i = 1,...,k, andvi(H) = wiδi(H)(1 −k∑j=1wjδj(H))/m∑j=k+1wjδj(H), i = k + 1,...,m.

### Case 3.

If H does not contain any gatekeeper hypotheses, i.e., H ∈ (∪ki=1Hi)c, letvi(H) = 0, i = 1,...,k, andvi(H) = wiδi(H)/m∑j=k+1wjδj(H), i = k + 1,...,m.For example, suppose there are two primary and two secondary hypotheses, withequal weights assumed for all tests. There are 24 −1 = 15 intersection hypotheses H.Table 1 shows the weights vi(H) that are used for each test. The parallel gatekeepingprocedure simply uses weighted Bonferroni tests for every intersection, with weightsas shown.It is also instructive to compare the proposed weighting scheme with the weight-ing scheme underlying serial gatekeeping procedures discussed by Westfall and Kr-ishen [6]. Westfall and Krishen showed that serial gatekeeping strategies can be set upby sequentially carrying out two weighted Holm tests [19], i.e., by using the followingalgorithm for defining weight vectors in the closed test for H1,...,Hm:Algorithm 2. The weighting scheme for the serial Bonferroni gatekeepingprocedureSelect a hypothesis in the closed family and denote it by H. Consider the followingtwo mutually exclusive cases:Case 1. If H contains at least one gatekeeper hypothesis, i.e., H ∈ ∪ki=1Hi, letvi(H) = wiδi(H)/∑kj=1 wjδj(H), i = 1,...,k, and vi(H) = 0, i = k + 1,...,m.Case 2. If H does not contain any gatekeeper hypotheses, i.e., H ∈ (∪ki=1Hi)c, letvi(H) = 0, i = 1,...,k, and vi(H) = wiδi(H)/∑mj=k+1 wjδj(H), i = k+1,...,m.One can verify that this choice of the weight vectors ensures that the adjustedp-values associated with Hk+1,...,Hm are greater than the maximum of ˜p1,..., ˜pk.In other words, one can test the hypotheses in F2 only if all hypotheses in F1 havebeen rejected. To construct a parallel gatekeeping procedure, one needs to modifythe serial weighting scheme by assigning smaller weights to H1,...,Hk when H con-tains some but not all gatekeeper hypotheses. For example, assume that H is theintersection of H1,...,Hk−1 and Hk+1,...,Hm but does not contain Hk. The weightvector associated with the serial gatekeeping approach (see Algorithm 2) isw1/k−1∑j=1wj,...,wk−1/k−1∑j=1wj,0,...,0 .It is important to note that weights for H1,...,Hk−1 are defined in such a waythat they add up to 1. This immediately implies that the weights assigned to the hypotheses in F2 are set to zero. To modify the serial weighting scheme, note thatthe weights associated with the gatekeeper hypotheses contained in H do not addup to 1 in this scenario, i.e., w1 + ... + wk−1 < 1. This gives us an ability toincorporate the secondary p-values into the decision rule. Specifically, the gatekeeperhypotheses will receive the pre-specified weights, i.e., w1,...,wk−1, and the remainder(i.e., 1 − ∑k−1j=1 wj) will be distributed among the secondary hypotheses according totheir importance, i.e., the following weight vector will be constructed:w1,...,wk−1,0,wk+11 −k−1∑j=1wj ,...,wm1 −k−1∑j=1wj .An important property of the parallel weighting scheme defined in Algorithm 1is that the adjusted p-values associated with the gatekeeper hypotheses are given by˜pi = pi/wi, i = 1,...,k. This implies that the hypotheses in F1 are tested usingthe weighted Bonferroni tests that take into account the relative importance of thehypotheses. As a result, the gatekeeper hypotheses will be rejected whenever theirBonferroni-adjusted p-values are significant regardless of the values of pk+1,...,pmand thus Condition 1 is satisfied. Further, it is easy to demonstrate that the adjustedp-values ˜pk+1,..., ˜pm are greater than the minimum of ˜p1,..., ˜pk. The closed testingprocedure based on the weighting scheme in Algorithm 1 satisfies Condition 2 andtherefore it presents a valid parallel gatekeeping procedure.

3 Gatekeeping procedures based on Simes and re-sampling testsThe gatekeeping procedure introduced in the previous section is based on the Bon-ferroni test and thus ignores the correlation among the individual test statistics. Itis troubling that (i) Bonferroni-based tests are conservative for large correlation and(ii) Bonferroni-based tests may all be insignificant even when all unadjusted p-valuesare significant. Resampling-based procedures [11] can alleviate problem (i); while useof Simes [12] test can alleviate problem (ii).Simes proposed the following test for an intersection hypothesis H in the closedfamily. Let t denote the number of elemental hypotheses contained in H and p(1)H ≤p(2)H ≤ ··· ≤ p(t)H denote the ordered p-values for the elemental hypotheses. TheSimes p-value is then given by pH = tmin1≤j≤t p(j)H/j. Exact Type I error controlwas proven under independence by Simes [12] and conservative Type I error controlwas established under positive dependency among p-values by Sarkar [13]. This testalways produces p-values as small or smaller than the Bonferroni test, for whichpH = tp(1)H. The Simes test is popular because of its improved power and because of its close connection with procedures that control the false discovery rate (Benjaminiand Hochberg [14]).Since our methodology uses weighted Bonferroni tests, we use the weighted Simestest proposed by Benjamini and Hochberg [15]. Consider an intersection hypothesis Hand a weight vector vi(H), i = 1,...,m. Let t and p(1)H,...,p(t)H be defined as aboveand let v(1)H,...,v(t)H denote the weights corresponding to the ordered p-values. Theweighted Simes p-value is equal topH = min1≤l≤tp(l)H/l∑i=1v(i)H.Proof of Type I error control for this procedure under positive regression dependencyis given by Kling and Benjamini [16], thus any closed testing procedure that usessuch tests for the intersection hypotheses controls the FWE strongly under the sameconditions.Resampling-based p-values for H1,...,Hm can be obtained by using parametricresampling. The resulting adjusted p-values are useful (i) to assess robustness of theSimes-based and Bonferroni-based tests to correlation that may be typical for clini-cal trials, and (ii) to provide alternative large-sample multiplicity-adjusted p-valuesthat directly incorporate correlation. To obtain the parametric resampling-based p-values, consider an intersection hypothesis H and let pH denote the observed p-value(weighted Bonferroni or weighted Simes) for H. Assuming the usual multivariate nor-mal MANOVA assumptions for our data, with true correlation matrix ρ, the “truep-value” is pH(ρ) = P(PH ≤ pH|ρ). An approximate p-value is obtained as theplug-in estimator pH(ˆρ), which can easily be simulated as follows (see Westfall andYoung [11], p. 122-125, and Westfall et. al [17], p.130-131, for details):1. Given a consistent estimate of ρ (denoted by ˆρ), generate B sets of n indepen-dent identically distributed N(0, ˆρ) vectors, where n is the combined samplesize and B is the number of simulations.2. Compute a vector of raw p-values for H1,...,Hm from each simulated dataset. Calculate the combined Bonferroni or Simes p-value for each intersectionhypothesis in the closed family using the weight vectors defined by Algorithm1. The obtained p-value for H from the ith simulated data set will be denotedby p∗H(i).3. The resampling-based p-value for H is equal topH(ˆρ) =1BB∑i=1δ(p∗H(i) ≤ pH),where δ(·) is the indicator function.

A clinical trial with co-primary endpointsConsider a clinical trial in patients with acute respiratory distress syndrome (ARDS).The trial is conducted to compare one dose of a new drug to placebo. The therapeuticbenefits of experimental treatments in ARDS trials are commonly measured using thenumber of days alive and off mechanical ventilation during a 28-day study period and28-day all-cause mortality rate (see [18] for a detailed description of a recent trial inARDS patients). Let H1 and H2 denote the null hypotheses of no treatment effectwith respect to the number of ventilator-free days and 28-day all-cause mortality. Ittypically takes fewer patients to detect a clinically relevant improvement in the num-ber of ventilator-free days compared to 28-day mortality. For this reason, the numberof ventilator-free days often serves as the primary endpoint in ARDS trials. However,either of these two endpoints can be used to make regulatory claims. Additionally,there is interest in including information about the drug effects on the number of daysthe patients were out of the intensive care unit (ICU-free days) and general qualityof life in the product label. Denote the secondary hypotheses associated with thesesecondary endpoints by H3 and H4. We wish to develop a parallel gatekeeping proce-dure that will test the secondary hypotheses only if at least one primary hypothesishas been rejected.Suppose that the weights for the two gatekeeper hypotheses are given by w1 = 0.9and w2 = 0.1 and the secondary hypotheses are equally weighted, i.e., w3 = w4 = 0.5.To define the adjusted p-values for H1,...,H4, consider the closed family associ-ated with the four hypotheses of interest. The closed family includes 15 intersectionhypotheses. It is convenient to adopt the following binary representation of the in-tersection hypotheses. If an intersection hypothesis equals H1, it will be denoted byH1000. Similarly,H1100 = H1 ∩ H2, H1010 = H1 ∩ H3, H1001 = H1 ∩ H4, etc.The decision matrix in Table 2 serves as a useful tool that facilitates the computationof p-values for each intersection hypothesis in the closed family and also the adjustedp-values associated with the four original hypotheses. The p-values shown in Table2 are based on the weighted Bonferroni rule. To implement the Simes gatekeepingprocedure, one needs to test the individual intersection hypotheses using the weightedSimes test introduced in Section 3.Each row in Table 2 corresponds to an intersection hypothesis. The p-valuesassociated with the intersection hypotheses are defined using the weighting schemedefined in Algorithm 1.

The adjusted p-value for H1, H2, H3 and H4 equals thelargest p-value in the corresponding column, i.e.,˜p1= max[p1111,p1110,p1101,p1100,p1011,p1010,p1001,p1000],(1)

˜p2= max[p1111,p1110,p1101,p1100,p0111,p0110,p0101,p0100],˜p3= max[p1111,p1110,p1011,p1010,p0111,p0110,p0011,p0010],˜p4= max[p1111,p1101,p1011,p1001,p0111,p0101,p0011,p0001].Inferences with respect to H1, H2, H3 and H4 are performed by comparing theseadjusted p-values with the pre-specified α.As an illustration,

Table 3 presents the raw and adjusted p-values produced bythe Bonferroni and Simes gatekeeping procedures under three scenarios. Table 3shows that all four Bonferroni- and Simes-adjusted p-values are significant at the0.05 level under the assumptions of Scenario 1. This means that the gatekeeping pro-cedures have rejected both gatekeeper hypotheses and continued to test the secondaryhypotheses, both of which were also rejected. It is worth noting that the adjusted p-values for the mortality endpoint are considerably larger than the corresponding rawp-value. This is caused by the fact that this endpoint was considered less importantthan the number of ventilator-free days and was assigned a small weight (w2 = 0.1).Further, the adjusted p-values for the quality of life assessment are also larger thanthe corresponding raw p-value. This happened because the gatekeeping proceduresadjusted the raw p-value upward to make it consistent with the p-value associatedwith the more important gatekeeper hypothesis. In general, the amount by whichsecondary p-values are adjusted upward is determined largely by the magnitude ofraw p-values associated with the gatekeeper hypotheses. Note also that the weightedSimes test always produces p-values that are as small or smaller than those of theweighted Bonferroni test. As a result, the Simes gatekeeping procedure produced ad-justed p-values that are uniformly smaller than the corresponding Bonferroni-adjustedp-values, without sacrificing Type I error control.Considering Scenario 2, we see that an increase in the raw p-value for the numberof ventilator-free days did not affect the magnitude of the Bonferroni-adjusted p-valuefor mortality. This highlights the fact that the gatekeeper hypotheses are tested in-dependently of each other when the Bonferroni gatekeeping procedure is used. Sinceboth gatekeeping procedures have rejected the mortality hypothesis, the secondaryanalyses were undertaken and, as a result, the secondary hypothesis with the highlysignificant p-value was rejected. Again, the adjusted p-value for the quality of lifeassessment is substantially larger than the raw one. The magnitude of the upwardadjustment simply reflects the fact that the more important gatekeeper null hypoth-esis is likely to be true and thus the Bonferroni and Simes procedures need moreevidence against this secondary hypothesis to reject it.Scenario 3 illustrates an important property of the Simes gatekeeping procedure.This procedure is known to reject all null hypotheses whenever all raw p-values aresignificant. Thus, the Simes gatekeeping procedure rejected all four null hypothesesin Scenario 3, whereas the Bonferroni procedures failed to detect significance with respect to the number of ventilator- and ICU-free days.5 A clinical trial with multiple endpoints and mul-tiple dosesTable 4 summarizes the results of a dose-finding study in patients with hypertension.The study was conducted to evaluate effects of low, medium and high doses of aninvestigational drug compared to placebo. The effects were measured by computingthe reduction in systolic and diastolic blood pressure (SBP and DBP) measurements.The design of the testing strategy is of course done prior to data collection, buthaving the data visible helps to clarify what is done. In this study, it was felt a priorithat (i) SBP is more indicative of true effect than DBP, and hence was placed higherin the hierarchy, and (ii) both the medium and high doses where considered equallyimportant, and potentially equally powerful, while the lower dose was consideredless likely to exhibit significance. Accordingly, the following four families of nullhypotheses were considered: F1 was comprised of the null hypotheses related to theHigh vs. Placebo and Medium vs. Placebo comparisons for SBP; F2 was comprisedof the null hypotheses related to the High vs. Placebo and Medium vs. Placebocomparisons for DBP; F3 contained the null hypothesis for the Low vs. Placebocomparison for SBP; and F4 contained the null hypothesis for the Low vs. Placebocomparison for DBP. The null hypotheses in F1 and F2 were tested in parallel fashionand were equally weighted within each family, reflecting equal importance of the highand medium doses.Table 5 displays the raw and adjusted p-values for the individual tests. The rawp-values were computed using the pooled-variance ANOVA two-sided contrast tests.The hierarchical strategy is better in this application than is the usual Bonferroni-Holm [19] or Simes-Hommel procedures [20]; the adjusted p-values for the Simes-Hommel procedure are (in the vertical order of the table) 0.0348, 0.0032, 0.0573,0.0080, 0.0430 and 0.0848, showing generally less significance. On the other hand,this example shows a case where the serial gatekeeping strategy would have workedbetter, since both hypotheses are significant in the first gate F1. However, had thep-values for the first gate not both been significant, the serial strategy would not haveallowed continuation, as is allowed with our parallel testing procedure.The resampling-based calculations shown in Table 5 used N=50,000,000, so thatthe Monte Carlo error is very small. We can see that, while resampling generallymakes the adjusted p-values smaller, as is guaranteed for Bonferroni tests and hasbeen proven recently for weighted Simes tests with positively dependent tests statis-tics [21], the results are not much affected by correlation. Thus, as correlation haslittle effect on the adjusted p-values, we can recommend general use of the procedures without parametric resampling, except perhaps in borderline cases. (On the otherhand, if non-normality is a great concern, then one should consider non-parametricresampling).6 Power comparisonsA study was conducted to compare the performance of the Bonferroni, Simes andPAAS testing procedures in the case of four null hypotheses grouped into two families(e.g., two primary and two secondary hypotheses). It was assumed that the fourindividual null hypotheses are of the formHi = {µi = 0}, i = 1,2,3,4,where µ1,...,µ4 represent the means of four normally distributed random variablesX1,...,X4 with standard deviation 1 and common correlation coefficient ρ. Theunadjusted two-sided p-values were defined as pi = 2min[Φ(Xi),1 − Φ(Xi)], i =1,...,4, where Φ denotes the cumulative distribution function of the standard normaldistribution.The Bonferroni and Simes gatekeeping procedures were performed under the as-sumption that the null hypotheses were equally weighted within each family, i.e.,w1 = ··· = w4 = 0.5. The adjusted p-values associated with these two gatekeep-ing procedures were computed as outlined in Sections 2 and 3. Two versions of thePAAS method were used. First, the four null hypotheses were weighted equally andthe adjusted p-values were defined as ˜pi = 4pi, i = 1,...,4. Second, the two primaryhypotheses received more weight and the PAAS-adjusted p-values were defined asfollows:˜p1 = 2.5p1, ˜p2 = 2.5p2, ˜p3 = 10p3, ˜p4 = 10p4.Results for Bonferroni and Simes were obtained using simulation with 1,000,000 sam-ples; all PAAS results were calculated analytically.Table 6 summarizes the results of the study. It shows that the PAAS proceduresare more powerful than the gatekeeping procedures with respect to the secondaryanalyses when the primary null hypotheses are both true (µ1 = µ2 = 0), but thatthere are substantial power gains from using the Bonferroni and Simes gatekeepingprocedures for testing the primary variables. In particular, if the regulatory mandateis to show a significance in the primary analysis, then the gatekeeping proceduresshow uniformly higher probability of meeting the regulatory mandate, as shown inthe “F1” column.It is important to emphasize that the PAAS procedures treat the four hypothesesin the two families as co-primary and test them simultaneously. This approach is justified when secondary endpoints may provide the basis for a new regulatory claim(Type I secondary endpoints by the D’Agostino classification [2]). Under the gate-keeping strategies considered here, one does not test the secondary hypotheses unlessat least one primary null hypothesis has been rejected. In other words, the gatekeep-ing approach assumes that the secondary findings will not lead to separate regulatoryclaims but can only provide supportive evidence for the claims based on the primaryendpoints (Type II secondary endpoints by the D’Agostino classification).As suggested by Westfall and Krishen [6], the greatest gains in efficiency for thegatekeeping approach occur when the primary hypotheses have higher power; in oursimulation we see that the gatekeeping procedures beat the PAAS procedures even forthe secondary endpoints in these cases. An example of such a case is in dose-finding,where the higher doses are expected to exhibit higher power than the lower doses.However, in fairness, it is important to note that for many disorders the sec-ondary endpoints have greater power than the primary endpoints. Examples includeoncology, where regulatory agencies may insist that sponsors demonstrate a benefitwith respect to survival (time to death), although a key secondary endpoint, time-to-progressive-disease, generally has greater power. Another example is in depres-sion, where regulators have insisted on the 17-item Hamilton Depression Scale totalscore as the primary outcome measure. The literature contains several subscales thathave been shown to have greater power in discriminating between active drug andplacebo [22]. So it may be true that under such scenarios the PAAS method wouldhave greater power at detecting significant differences from a set of two primary andtwo secondary endpoints, for example. However, regulatory agencies generally wouldnot consider the study as supporting efficacy unless the primary null hypothesis wasrejected.7 DiscussionThis paper presents a framework for performing parallel gatekeeping inferences andoutlines applications of gatekeeping procedures in clinical trials. Using the closedtesting principle, we show how to construct powerful testing procedures that allowone to proceed to lower levels of the hierarchy when not all of the co-primary testsare significant. The benefits of the parallel gatekeeping approach are (i) regulatoryacceptability in cases where a front gate of co-primary endpoints must be passedwith at least one significance, and (ii) good power. The gatekeeping approach is mostappropriate when the secondary analyses do not lead to separate regulatory claimsbut play a supportive role.While the method is eas and can therefore recommend it over the Bonferroni method. Furthermore, we findthat the Simes gatekeeping procedure is relatively robust to correlation structure in aclinical example when correlations induced by multiple dose group comparisons andmultiple endpoints, and therefore using that modification to accommodate correlationmay not generally be needed.References[1] O’Neill RT. Secondary endpoints cannot be validly analyzed if the primary end-point does not demonstrate clear statistical significance. Controlled Clinical Tri-als 1997; 18:550-556.[2] D’Agostino RB. Controlling alpha in clinical trials: the case for secondary end-points. Statistics in Medicine 2000; 19:763-766.[3] Committee for Proprietary Medicinal Products. Points to consider on multiplic-ity issues in clinical trials. London.[4] Chi GYH. Multiple testings: multiple comparisons and multiple endpoints.

Drug Information Journal 1998; 32:1347S-1362S.[5] Bauer P, Röhmel J, Maurer W, Hothorn L. Testing strategies in multi-dose ex-periments including active control. Statistics in Medicine 1998; 17:2133-2146.[6] Westfall PH, Krishen A. Optimally weighted, fixed sequence, and gatekeepingmultiple testing procedures. Journal of Statistical Planning and Inference 2001;99:25-40.[7] Gong J, Pinheiro JC, DeMets DL. Estimating significance level and power com-parisons for testing multiple endpoints in clinical trials. Controlled Clinical Trials2000; 21:313-329.[8] Moyé LA. Alpha calculus in clinical trials: considerations and commentary forthe new millennium. Statistics in Medicine 2000; 19:767-779.[9] Marcus R, Peritz E, Gabriel KR. On closed testing procedure with special refer-ence to ordered analysis of variance. Biometrika 1976; 63:655-660.[10] Hochberg Y, Tamhane AC. Multiple Comparison Procedures. Wiley: New York,1987.[11] Westfall PH, Young SS. Resampling-Based Multiple Testing: Examples andMethods for P-Value Adjustment. Wiley: New York, 1993.

Page 15

For more information, see http://www.multxpert.com/wiki/Gatekeeping Papers 15[12] Simes RJ. An improved Bonferroni procedure for multiple tests of significance.Biometrika 1986; 63:655-660.[13] Sarkar S. Some probability inequalities for ordered MTP2 random variables: Aproof of the Simes conjecture. Annals of Statistics 1998; 26:494-504.[14] Benjamini Y, Hochberg Y. Controlling the false discovery rate - a practical andpowerful approach to multiple testing. Journal of the Royal Statistical Society.Series B 1995; 57:289-300.[15] Benjamini Y, Hochberg Y. Multiple hypothesis testing and weights. ScandinavianJournal of Statistics 1997; 24:407-418.[16] Kling Y, Benjamini Y. A cost-based approach to multiplicity control. Unpub-lished manuscript. 2002.[17] Westfall PH, Ho SY, Prillaman BA. Properties of multiple intersection-uniontests for multiple endpoints in combination therapy trials. Journal of Biophar-maceutical Statistics 2001; 11:125-138.[18] ARDS Network Ventilation with lower tidal volumes for acute lung injury andthe acute respiratory distress syndrome. New England Journal of Medicine 2000;342:1301-1308.[19] Holm S. A simple sequentially rejective multiple test procedure. ScandinavianJournal of Statistics 1979; 6:65-70.[20] Hommel G. A stagewise rejective multiple test procedure based on a modifiedBonferroni test. Biometrika 1988; 75:383-386.[21] Benjamini Y. Personal communication. 2002.[22] Faries D, Herrera J, Rayamajhi J, DeBrota D, Demitrack M, Potter WZ. Theresponsiveness of the Hamilton depression rating scale. Journal of PsychiatricResearch 2000; 34:3-10. Table 1 Weights assigned to the intersection hypothesis testsIntersectionWeightshypothesisH1H2H3H4H1 ∩ H2 ∩ H3 ∩ H40.5 0.5 0.00.0H1 ∩ H2 ∩ H30.5 0.5 0.00.0H1 ∩ H2 ∩ H40.5 0.5 0.00.0H1 ∩ H20.5 0.5 0.00.0H1 ∩ H3 ∩ H40.5 0.0 0.25 0.25H1 ∩ H30.5 0.0 0.50.0H1 ∩ H40.5 0.0 0.00.5H10.5 0.0 0.00.0H2 ∩ H3 ∩ H40.0 0.5 0.25 0.25H2 ∩ H30.0 0.5 0.50.0H2 ∩ H40.0 0.5 0.00.5H20.0 0.5 0.00.0H3 ∩ H40.0 0.0 0.50.5H30.0 0.0 1.00.0H40.0 0.0 0.01.0

Table 2 Decision matrix for the parallel Bonferroni gatekeeping procedureIntersectionP-values for intersectionOriginal hypotheseshypothesishypothesesH1H2H3H4H1111p1111 = min(p1/0.9,p2/0.1)p1111p1111p1111p1111H1110p1110 = min(p1/0.9,p2/0.1)p1110p1110p11100H1101p1101 = min(p1/0.9,p2/0.1)p1101p11010p1101H1100p1100 = min(p1/0.9,p2/0.1)p1100p110000H1011p1011 = min(p1/0.9,p3/0.05,2p4/0.05) p10110p1011p1011H1010p1010 = min(p1/0.9,p3/0.1)p10100p10100H1001p1001 = min(p1/0.9,p4/0.1)p100100p1001H1000p1000 = p1/0.9p1000000H0111p0111 = min(p2/0.1,p3/0.45,p4/0.45)0p0111p0111p0111H0110p0110 = min(p2/0.1,p3/0.9)0p0110p01100H0101p0101 = min(p2/0.1,p4/0.9)0p01010p0101H0100p0100 = p2/0.10p010000H0011p0011 = min(p3/0.5,p4/0.5)00p0011p0011H0010p0010 = p300p00100H0001p0001 = p4000p0001Note: The table shows p-values associated with the intersection hypotheses. Theadjusted p-values for the original hypotheses H1, H2, H3 and H4 are defined as thelargest p-value in the corresponding column in the right-hand panel of the table (seeEquation (1)).

Table 3 Bonferroni and Simes gatekeeping procedures in the acute respi-ratory distress syndrome trialAdjusted p-valueFamilyEndpointWeight Raw p-value Bonferroni SimesScenario 1PrimaryVent-free days0.90.0240.02670.0260PrimaryMortality0.10.0030.03000.0260Secondary ICU-free days0.50.0260.02890.0260Secondary Quality of life0.50.0020.02670.0253Scenario 2PrimaryVent-free days0.90.0840.09330.0840PrimaryMortality0.10.0030.03000.0300Secondary ICU-free days0.50.0260.09330.0840Secondary Quality of life0.50.0020.04000.0400Scenario 3PrimaryVent-free days0.90.0480.05330.0480PrimaryMortality0.10.0030.03000.0300Secondary ICU-free days0.50.0260.05330.0480Secondary Quality of life0.50.0020.04000.0400