Lines are open and interactive, so please utilize your mute button If you do not have a mute button, please press star-6 to self-mute I’d like to turn the call over to Ms. Diane St. Germain You may begin >> Diane St. Germain: Thank you Hello On behalf of the International Society for Quality of Life Research and the National Cancer Institute, I would like to welcome you to the webinar series, Best Practices for Integrating Patient-Recorded Outcomes in Oncology Clinical Trials The webinar series is comprised of six webinars, each approximately 45 minutes in length The webinars are designed to be viewed independently to meet the individual learning needs of the participants, or the series can be viewed sequentially in its entirety Today’s webinar is the fourth in the series and is titled “How to Develop the Statistical Plan and Sample Size Calculation for the Patient Reported Outcome Component of a Clinical Study” presented by Dr. Amylou Dueck and Dr. Diane Fairclough >> Amylou Dueck: Great Hi I’m Amylou Dueck I’m a biostatistician at Mayo Clinic, and today I’ll be talking about how to develop the statistical plan and sample size for the PRO component of a clinical study with my co-presenter Dr. Diane Fairclough So, before we begin, we’re going to assume that for this — for a particular study that you’re working on the statistical component, that the PRO hypotheses, the tools, and the administration time points have already been identified And really, the overarching theme for the presentation is that the statistical section is very similar to other clinical endpoints So, the PRO endpoint should be considered in the hierarchy of all the other clinical endpoints for a given study Particularly, it may be a primary endpoint, if you’re looking at a regulatory submission using the PRO endpoint It could be a secondary endpoint where if a clinical endpoint for the primary endpoint shows the same result, then you might resort to the secondary outcomes of pros to make a decision about choosing among treatments, or it could be information also, for example looking at impact of side effects But all in all, the entire PRO component should be integrated with the aims, the endpoint, and the methodology in order to complete the final output, which would be the power estimation So, our outline for our talk We’re going to start talking about the components of the QOL statistical section, and then we’ll move into talking about endpoints, statistical analyses, and data collection Then, we’ll get into the more statistical issues, particularly multiple testing, use of summary measures and longitudinal analyses, the topic of missing data, and then finally talking about sample size and power consideration for QOL endpoint We’ll end with some additional resources But, first what’s in a QOL statistical section? Well, just like any other statistical section, we would expect to see endpoints, clear delineation of the sample size, a clear statistical analysis plan, which should include your statistical analyses, how you’re planning to handle multiplicity, as well as how you’re handling missing data And then you should have some clear power calculations So, other things that I look for when I’m reviewing QOL statistical sections might be some of the things listed here on this slide They might not specifically be in the staff section, but I do look for them at somewhere in a protocol, and this can include the QOL background and hypotheses I like to see which patients are being targeted for the QOL component Is it every single patient coming onto this study, or is it a specific subset? Are there other eligibility requirements? So, for example, requiring some minimal level of a scale in order to look at something like improvement, whether the QOL component is mandatory or optional I look for description of the questionnaire, including validation and reliability information, and specifically I do like to see references I look for what available languages there are I look for the scoring algorithm I look for how to interpret these endpoints for references to clinical significance I look for a clear delineation of the time points, and also how the tools are going to be administered Let’s start with defining the endpoint, and selecting a statistical analysis In terms of defining the endpoint, even if you know exactly what time frame you want to administer it, you still have lots of options for defining that endpoint And, really, it’s that definition of the endpoint that impacts the proper statistical analysis plan, so what testes you’ve actually used to analyze the endpoint And obviously, we need to know what analysis you’re running

in order to carry out the power calculation In terms of identifying that endpoint, I often think about what is the setting of the study? Is it — is the intervention of limited duration? So, for example, is this like a three to six-month type of intervention, or is it an extended duration type study where you’re giving treatment for five years or giving treatment until progression? I also think about is this study in the adjuvant setting, or is it mostly curative, or are we talking about a palliative-type intervention? It can all influence how you define your endpoints So for the limited duration trials, these are mostly the adjuvant type studies, or curative in ten type studies But really, the endpoint should reflect the outcomes at the end of the trial Since during the trial it could be useful to characterize toxicity and endpoint from — if long-term endpoints are equivalent, then these short-term endpoints could be useful But again, it depends on the setting of the study For the extended duration trials, we consider these — the typical example would be these palliative-type studies, or the studies where you’re treating until progression The PROs can be useful to describe the expected toxicity and the impact on QLL if your other endpoints, or clinical endpoints like survival or disease progression are equivalent An interesting time point to try to capture are these time points right near the end of treatment, so near the time of progression or just immediately thereafter You might consider using area end of the curve over a predefined period, rather than comparison at a specific time point, because we have people in con study for differing amounts of time In the adjuvant setting for extended duration trials, you might focus on outcomes after a window of time So, you might not want to necessarily focus on this early toxicity period, but more look at these more stable, longer term outcomes in those sorts of studies So, as I said before, the endpoint that you select really dictates the types of statistical analyses that make sense And this table here shows the range of possible endpoints and the suggested statistical analysis that would be performed if you selected such a time point So, for example, if we picked a time point, like a fixed time point say — here the example is the FACT-G total score at six months You might compare that between arms using a T-test or Wilcoxon rank sum test or, alternatively, you could use analysis of covariates Alternatively, there’s ways to modify that fixed time point into a percent change from baseline at a fixed time point You can construct various summary measures as listed here You see that the type of analysis varies depending on that endpoint Is it continuous? Is it a binary outcome? Is it a longitudinal type outcome? But we see a range of possible analyses here All would be valid and considered adequate depending on the type of endpoint you’re selecting So, this graphic shows an example of a study where pain was measured weekly over 12 weeks And just to show the possible different endpoints, even if we have the time points specified, there’s still numerous different endpoints you could choose to analyze So, for example, we could look at a comparison at 12 weeks, and this might employ a T-test or Wilcoxon rank sum test, analysis of variants, or analysis of covariates type test Similarly, we could look at a more intermediate time point We could look at six weeks We could look at an area under the curve where we’re comparing the total area into the curve between arms up to a particular time point Or we could compare actual profiles, the longitudinal profiles over time between arms So other analyses or other considerations for a given analysis plan should take into account any other design features So, for example, if your study is a randomized cluster design, then the proper analysis should incorporate that clustering feature of the design In terms of intent to treat verses per-protocol, the intent to treat approach is the gold standard and it’s considered unbiased due to the randomization, and really the closer to evaluating the effectiveness of a particular intervention Some recommend a per-protocol analysis because it’s closer to true efficacy, but you really need to be careful here because it is subject to selection bias And I typically use the per-protocol types of analyses for descriptive purposes only, and I’m very cautious about using comparison based on per-protocol type definitions All right Let’s talk about other types of analyses, particularly let’s talk about responder analysis So, responder analysis is where we define a threshold for a change from baseline in a continuous variable, which is clinically meaningful We’ll see some examples here in a minute But we really need to be careful about defining those thresholds,

particularly whether the study is blinded or not So in an un-blinded study, a very — requiring a small amount of change may not actually capture clinically meaningful change in such a context But the responder definitions actually could change depending on the setting and depending on whether blinding was present or not But basically, in this type of analysis, you define — you essentially classify patients as either being a responder or being a non-responder And then the comparison is really based on this binary outcome So, the proper statistical test can be a chi-squared test, it could be a Fisher’s exact test, it could be a Mantel-Haenszel test, or any other type of analysis that could be performed on the binary outcome So, there’s pluses and minuses to using this type of approach On the plus side, it simplifies the statistical analysis and interpretation, and it ensures that a statistically significant result actually represents as clinically meaningful benefits You’re capturing clinically meaningfulness and designing this endpoint upfront On the negative side, converting a continuous endpoint into a binary outcome does increase the sample size requirements You’re losing a little bit of power by condensing this endpoint Also, this threshold could be somewhat arbitrary, and we’ll see some examples of that Also, interpretation can be problematic if its response is temporary So if you have just a quick peak right at the very start of study, it may not actually represent a, you know, kind of a more global difference between arms It could have been — could be detecting something that is not necessarily what you had intended to capture So, this is just an example showing two different responder definitions out of a study that looked at pain The first one being pain intensity palliation, and the second one being pain intensity progression You see these two different responder definitions actually capture or incorporate type duration, so they require two consecutive follow-up visits And in this particular instance, also incorporates analgesic use, in addition to a requirement for change in the actual pain intensity Same on the pain progression side So another analysis when you’re using responder definitions that you may incorporate into your statistical analysis plan is the use of cumulative distribution functions So, essentially what this plot shows is every possible cut point So what you’re doing is checking to see that your responder analysis is non-dependent on just this single cut point that you’ve selected So, specifically this graph shows in a continuous manner the outcomes, but it also delineates for three different plausible endpoints the difference among the three treatment arms, showing that it’s really not just a single cut point that’s showing the difference; you’re actually seeing it at multiple different cut points Other types of analyses you may think about including in your statistical analysis plan that is somewhat unique to PROs, that maybe you don’t see for some of the clinical endpoints, will be confirming some of the PRO measurement properties So, this is an example where a responder definition was written in up front for this study It was defined as a greater than or equal to 50 percent reduction in the total symptoms for this particular measure And what this analysis actually looks at is the relationship between the responder definition and patient subjective global rating of how much they’ve changed over time And so, what this shows is that among patients that were considered responders, 86 percent of them felt that they were much improved or very much improved, verses 77 percent for those that were non-responders felt that they were minimally improved, no change, minimally worse, much worse, or very much worse So, essentially this is showing that this difference really was clinically meaningful to patients That this was a noticeable difference in their scores All right Let’s talk about data collection, which is also important when we’re constructing our statistical analysis plan There’s two components Both I think about how we want to administer your PRO, and then also what is the additional data that we need to carry out our analysis, because both of these need to be written upfront into the protocol So in terms of administration, I often recommend the PROs be administered before other clinical assessments or procedures to avoid some kind of immediate negative bias if a patient is informed of negative or a positive outcome I also think about strategies to minimize missing data, and I try to implement those upfront in the protocols So, I think about what time points are feasible; I think about how compliance is going to be monitored prospectively; I think about how the system for data collection can be implemented or modified in order to maximize missing data So, in terms of other data elements that I collect in addition to the PROs scores, I often collect the date and time of administration I’ll collect the mode of administration or other administration details, such as the language of administration I collect compliance information, including reasons for missing data if there’s a missed

assessment at some point for a given patient I also think about what auxiliary data I might need if I’m planning on using imputation later in my analysis strategy So, the other thing I think about just to highlight some of these data requirements I think about how I’m going to actually present these results after the fact, and I want to make sure that I’m capturing the data I need in order to fill in the details when I go to present it So, I often think about the consort PRO extension, which was published last year, and the topic of a later seminar in this series, but shortly, or this is the consort diagram that is recommended in the PRO extension And specifically the things that pop out include citing specific reasons for missing booklets; also, including information about the language If assessments were missed because of language I also think about reasons for lost to follow-up and making sure that I capture that information, as well And finally, at the very bottom here ultimately we need to know how many patients are valuable, and really be able to delineate all through this flow diagram what’s happening at all different levels, in terms of who’s missing data and who actually is providing data These are important to think about in terms of your data collection So with that, I’m going to hand this over to Diane Fairclough >> Diane Fairclough: Welcome I am from the Colorado School of Public Health and also a biostatistician So, multiplicity is generated by multiple endpoints And in cancer trials, we have survival, disease control, toxicity, as well as our PROs, and they need to be innovated in the context of all of those endpoints And it depends on the goals of the study how we’re going to incorporate those In registration trials, we’re going to have to have very strict control of Type I errors In early development we might have loose control, because we expect confirmation in later study But, our strategy is going to depend on a lot of issues A few of which are, you know, what — which of the endpoints would most influence clinical practice And so, you know, in a palliative care, it might be quality of life, but unlikely to be quality of life over survival in a stage one population So — and also thinking through what the path between the intervention and the distant outcomes are, you know, which are the most proximal, which are the most distal Okay Part of the strategy is to establish a hierarchy of endpoints So, we’ll have primary or co-primary endpoints, secondary, and then exploratory If PRO is going to be part of a claim, in most cases it will need to be designated as a primary endpoint So, there are multiple procedures for adjustment of the alpha values to minimize the Type I error And there’s a whole family called gate-keeping or sequential approaches So, first, the secondary — oh, the secondary endpoint is only tested if the primary endpoint is statistically significant And we just mentioned some cases where actually we would want to look at secondary endpoints if the primary endpoint is not significant But that needs to be absolutely and explicitly stated beforehand If that’s going to be the strategy Developing that strategy post hoc is just not going to fly with the scientific community And then there are the typical alpha adjustment methods that we all know Bonferroni is fairly considerable data that can be a little bit less conservative if we use step-up or step-down methods such as homes, or hosperg [phonetic sp] is probably the most well-known

And then also noting that multiplicity can also arise from multiple treatment groups, from subgroup analyses, and looking at multiple time points So, one part of the strategy of handling multiple endpoints is to consider summary measures or statistics Summaries over time have the advantage that they significantly reduce the number of tests Examples are area under the curve, which really translates to the average difference between the curves and — or a mean, a post-basement measurement minus baseline And the choices will depend on what we expect and what’s going to be important Other summaries include those that are over measures So, the FACT-G total score is a sum of all of the sub-scales of the FACT-G The SF-36 visible component is a weighted average of all of the eight different sub-scales of the SF-36 They can be harder to interpret in some cases, if perhaps one or two of those components change and the rest don’t, or if components change in different directions But if there are consistent changes over all of the components, then it actually improves the power to detect differences So, for example, you know, what would you expect might be the best summary for this expected trajectory? Basically, we’re seeing a steady decline over time, and so the slope would actually be good summary statistic, if we expected this kind of trajectory Alternatively, in some cases, we might expect the decline with a plateau or an improvement with a plateau, and in the summary statistics — particularly if we’re interested in what happens at the later part of the study — might be the average of the later assessments minus the baseline And finally, we might see a temporary decline and we might consider maybe the area under the curve for this In each case, it’s thinking through what we think is going to be our expected trajectory and matching our endpoint to that So, now transitioning to models for longitudinal data There are two primary models: a mixed effect models growth curve and mixed model for repeated measures And the primary difference between these conceptually is how we think about time So, if we think about time as being continuous, typically in an extended duration trials, it might be, you know, weeks from randomizations or weeks from diagnosis We would consider this mixed model of growth curve, and the advantage is that it can accommodate missed timed assessments, which are really quite likely in many cancer trials as treatments get delayed, as follow-up increases due to maybe a rest time necessary to overcome hematopoietic toxicity In this case the covariance structures are usually modeled with random intercepting slope plus some residual errors The mixed model repeated measures thinks about time as ordered events, and it’s more typical of a study that has a small number, three, four, maybe five assessments totally, and the assessments are classified kind of as events So, for example, it might be baseline early on therapy, late on therapy, and then off therapy And then the covariance structures typically an

unstructured covariance of k by k And these models now are so standard that they can be implemented in pretty much every software package — SAS, SPSS, Stata, R — pretty much covers the majority of analysis packages An important part of these models is that they’re likely to base and that they assume that the missingness is missing at random, which will translate soon into us right into talking about what do we do about messiness So, missing data in oncology studies is rarely missing completely at random, particularly the missingness that’s associated with disease progression or mortality It might be for administrative missingness, where somebody forgets to do the assessment But it’s rarely random for the unpreventable missing data So, what do we do? Well, the consensus is to use methods that assume missing at random with supplementary sensitivity analyses The missing at random methods are generally biased when estimating change over time within treatment groups when there’s substantial drop-out due to disease progression or mortality but are often robust when comparing the differences between two treatment groups This is particularly true because we never compare an active treatment to a placebo in oncology trials We’re always two active treatments There will be the very rarest exceptions where the sensitivity analyses are not needed I actually have one adjuvant trial that I was associated with, and we had less than 5 percent missing data over the whole study And so the sensitivity analysis in that setting is unlikely to make differences So, what do we do? Well, first thing we do is we avoid methods that assume missing completely at random, and these include limiting the analysis to only those patients who complete all the assessments We also should avoid, and this is one that I see done often, but — and I don’t think people appreciate that it’s making stronger assumptions about missingness — and that’s repeated univariate analyses And that could be just looking at a single time point or the time point relative to baseline These missing data assumptions are much stronger for that type of analysis then to use all available data in a mixed model And the other thing is to avoid criteria that excludes significant numbers of participants For example, requiring at least one follow-up if there’s a high rate of early drop-out If the drop-out between baseline and next assessment is one or 2 percent, that’s not likely to impact the results; but if it’s 25 percent, that’s going to create significant bias by excluding all those patients So, pretty much the methods that are recommended for sensitivity analysis only really work if we can convert a missing not at random problem into a missing at random problem So in imputation, we basically hope that we have auxiliary information that conditional on that information the data becomes missing at random Assessments by a caregiver might be one of those variables or time to death might be one of those variables Had a mixed models assume that the data are missing at random within the pattern, and if we can satisfy that assumption, then we can convert the problem and deal with the missing data

And the joint models assume that the data are missing at random conditionally on the jointly modeled outcomes, and I’ll go into detail of all of these a little bit more But preplanning is important to make sure that we’ve actually measured that auxiliary information So, multiple imputations is becoming a very, very popular method, and it’s readily available in most software But, unfortunately, a lot of times it actually makes the problem appear to go away, but it hasn’t really solved the problem And it’s a lot of work to get results that would be similar to a reference model, those mixed models, unless there is that auxiliary data that’s strongly correlated to both the outcome and the missing data mechanism Otherwise the missing data, the values that are imputed under the missing at random assumptions and results, will be similar to the maximum likelihood model Note that these auxiliary data would not be variables, it could be covariates like age or gender, but rather something that’s more typically a surrogate variable for that would be affected by the intervention So it might be, as I mentioned before, a caregiver’s assessment or it might be time to disease progression or to death So, specifying in your analysis plan that multiple imputation will be used is just insufficient I’ve definitely seen analysis plans where that’s all they say, but we really need to know what is the method of imputation There’s six, seven, eight different methods Probably the best known is the MCMC, but, you know, an alternative is predictive mean matching There’s regression, approximate phasing bootstrap So, many methods, and so you need to be specific about which one You need to be specific about the strategy you’re going to use for longitudinal assessments Most multiple imputation strategies were designed for cross-sectional data, so the question is are you going to just throw all your longitudinal assessments into the pot and try to do those all at one time or are you going to first impute at the baseline missing values and then the second time point missing values conditional on the baseline and sequentially handle it And then what ancillary information will be included in multiple imputation So, the second method that’s popular for the sensitivity analyses is pattern mixture models The advantage is that it’s easy to visualize the trajectory within patterns, and the estimates are unbiased within the patterns, unless the patterns are pooled to the — and in a lot of actual applications, you actually see people pulling lots of patterns, so that they can actually estimate all of the parameters along the entire timeframe And then, the challenges are then justifying the assumptions that are used to estimate all the parameters So, let me illustrate that slightly So, this is data where the means are displayed by the time of the last assessment So, we have — the star represents those that only had the baseline assessment And you can kind of think of different strategies for extrapolating these curves that might be reasonable and something that would seem reasonable

But this is another study where I’ve done the same type of patterns, and how would you define the restrictions for this study? I don’t know how I would, and I definitely don’t know how I would have guessed how to do it prior to the beginning of the study, before I actually saw these patterns So, those are the challenges of pattern mixture models So, again, the challenges are specifying how the patterns will be defined and how an estimate with parameters usually thinks it needs to be extrapolated will be defined So, the third typical strategy are joint models So the outcome of interest can be modeled jointly with auxiliary data So it might be, for example, caregiver assessment, or it might be timed to an event, disease progression or death Again, the assumption is that the data are missing at random conditional on the observed outcome data and the auxiliary data And it requires pre-planning to make sure that that data is collected So, the final section we’re going to talk about is sample size and power So, the way you get — do your calculations should be transparent, and the hardest part of that procedure is actually specifying the effect that you feel is clinically significant and the studies should be powered to detect Sometimes it’s based on a minimally important difference or a clinically meaningful difference There, you know, thinking about that in terms of what’s clinically meaningful in an individual basis verses what’s clinically meaningful in a group basis is a little bit of a challenge There’s kind of a generic effect size that we’ve been using in quality of the life of a half a standard deviation Cohen defines that as a moderate effect, and it’s been empirically observed enough times that it’s great for the poor man’s approach, if you have no other information that’s where you start But, whatever you do, you should be referring to what your source of that difference is The Type I error or alpha is typically .05, but it might vary if you’re using a bonfrome [phonetic sp.] for handling multiplicity or some other method And power is typically defined at either 80 or 90 percent The simplified calculations need to incorporate how we’re going to handle missing data or the expected missing data rate So, for the simplest case, if we were just comparing means at two different — at one different time point that we had pre-specified, and we said we wanted to detect the half a standard deviation, we need 64 patients per arm or 120 patients total Let’s say we then expected 20 percent dropout by that time point Well, then we would inflate it for the missing data, and that would require 154 patients enrolled If alternatively our outcome was binary, we used a typical chi-square calculation

If we wanted to detect 30 percent versus 15 percent with 80 percent power, we’d need 121 patients per arm and 242 Do we also inflate this for missing data? Well, maybe, but if we’re going to use the responder definition that Amylou talked about beforehand, we actually classify patients with missing data as non-responders But that needs to be clear in our statement of how we’ve derived our expected sample size What do we do with longitudinal endpoints? Well, the first point — step is to really define what the endpoint is, and how it is defined is a function of the repeated measures You know, is there a critical point in time, so we might just look at the post treatment assessments Is there a summary measure, an AUC, or a mean of the post baselines? The example I have here is the average of the last two assessments minus the baseline as a summary measure The second step is then defining what the expected differences are at each time point So, in this trial, we expect no difference at baseline, but possibly a five-point difference at each of the follow-up assessments, and then we have to specify a working covariance So, in this case, I’m going to assume that the standard deviation is 15 — that’s sigma — and I’m going to assume that look at different scenarios for correlations between 0.5 and 0.8 And that’s a pretty typical range for patient reported outcomes of the correlation over time So, this is a little bit simple assumption of a working covariance The third step is then to calculate the delta associated with our summary measure data and the standard deviation of that summary measure And so we basically take linear combinations of those values So, we’re going to take a weighted average of the 0 5 5 5, and it turns out that that’s going to be five And then we’re going to do some pre and post multiplication of the covariance structure, and the result is 1.5 times 1 minus the correlation times the standard deviation squared And then, the fourth step is basically — now it’s the easy part We basically can put that delta data in the difference and the standard deviation in a standard sample size calculation And in this place, we’re going to use row equals 0.5 That will be the most conservative of our sample size calculations So, going back up we also need to inflate that for drop-out still So, I’d like to thank you We basically have given you a 30,000-foot view of a very complex issue and occasionally dropping down to ground level, but there is a long way to go from this brief presentation And some other resources that you ought to consider looking at is — first steps are the FDA Guidance Document, some general PRO statistical consideration to papers, and this is the conflict of interest — my book that talks about missing data in the design of analysis of quality of life studies Thank you >> Diane St. Germain: Okay Thank you very much That concludes this webinar in the series Thank you