<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>1477-7525-2-26</ui>
   <ji>1477-7525</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>Sample size and power estimation for studies with health related quality of life outcomes: a comparison of four methods using the SF-36</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Walters</snm>
               <mi>J</mi>
               <fnm>Stephen</fnm>
               <insr iid="I1"/>
               <email>s.j.walters@shef.ac.uk</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Sheffield Health Economics Group, School of Health and Related Research, University of Sheffield, Regent Court, 30 Regent St, Sheffield, United Kingdom, S1 4DA</p>
            </ins>
         </insg>
         <source>Health and Quality of Life Outcomes</source>
         <issn>1477-7525</issn>
         <pubdate>2004</pubdate>
         <volume>2</volume>
         <issue>1</issue>
         <fpage>26</fpage>
         <url>http://www.hqlo.com/content/2/1/26</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">15161494</pubid>
               <pubid idtype="doi">10.1186/1477-7525-2-26</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>16</day>
               <month>4</month>
               <year>2004</year>
            </date>
         </rec>
         <acc>
            <date>
               <day>25</day>
               <month>5</month>
               <year>2004</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>25</day>
               <month>5</month>
               <year>2004</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2004</year>
         <collab>Walters; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.</collab>
      </cpyrt>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>We describe and compare four different methods for estimating sample size and power, when the primary outcome of the study is a Health Related Quality of Life (HRQoL) measure. These methods are: 1. assuming a Normal distribution and comparing two means; 2. using a non-parametric method; 3. Whitehead's method based on the proportional odds model; 4. the bootstrap. We illustrate the various methods, using data from the SF-36. For simplicity this paper deals with studies designed to compare the effectiveness (or superiority) of a new treatment compared to a standard treatment at a single point in time. The results show that if the HRQoL outcome has a limited number of discrete values (&lt; 7) and/or the expected proportion of cases at the boundaries is high (scoring 0 or 100), then we would recommend using Whitehead's method (Method 3). Alternatively, if the HRQoL outcome has a large number of distinct values and the proportion at the boundaries is low, then we would recommend using Method 1. If a pilot or historical dataset is readily available (to estimate the shape of the distribution) then bootstrap simulation (Method 4) based on this data will provide a more accurate and reliable sample size estimate than conventional methods (Methods 1, 2, or 3). In the absence of a reliable pilot set, bootstrapping is not appropriate and conventional methods of sample size estimation or simulation will need to be used. Fortunately, with the increasing use of HRQoL outcomes in research, historical datasets are becoming more readily available. Strictly speaking, our results and conclusions only apply to the SF-36 outcome measure. Further empirical work is required to see whether these results hold true for other HRQoL outcomes. However, the SF-36 has many features in common with other HRQoL outcomes: multi-dimensional, ordinal or discrete response categories with upper and lower bounds, and skewed distributions, so therefore, we believe these results and conclusions using the SF-36 will be appropriate for other HRQoL measures.</p>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Introduction</p>
         </st>
         <p>Health Related Quality of Life (HRQoL) measures are becoming more frequently used in clinical trials as primary endpoints. Investigators are now asking statisticians for advice on how to plan (e.g. estimate sample size) and analyse studies using HRQoL measures.</p>
         <p>Sample size calculations are now mandatory for many research protocols and are required to justify the size of clinical trials in papers before they will be accepted by journals <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. Thus, when an investigator is designing a study to compare the outcomes of an intervention, an essential step is the calculation of sample sizes that will allow a reasonable chance (power) of detecting a predetermined difference (effect size) in the outcome variable, at a given level of statistical significance. Sample size is critically dependent on the purpose of the study, the outcome measure and how it is summarised, the proposed effect size and the method of calculating the test statistic <abbrgrp><abbr bid="B2">2</abbr></abbrgrp>. For simplicity in this paper we will assume that we are interested in comparing the effectiveness (or superiority) of a new treatment compared to a standard treatment at a single point in time.</p>
         <p>HRQoL measures such as the Short Form (SF)-36, Nottingham Health Profile (NHP) and European Organisation for Research and Treatment of Cancer (EORTC) QLQ-C30 are described in Fayers and Machin <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> and are usually measured on an ordered categorical (ordinal) scale. This means that responses to individual questions are usually classified into a small number of response categories, which can be ordered, for example, poor, moderate and good. In planning and analysis, the responses are often analysed by assigning equally spaced numerical scores to the ordinal categories (e.g. 0 = 'poor', 1 = 'moderate' and 2 = 'good') and the scores across similar questions are then summed to generate a HRQoL measurement. These 'summated scores' are usually treated as if they were from a continuous distribution and were Normally distributed. We will also assume that there exists an underlying continuous latent variable that measures HRQoL (although not necessarily Normally distributed), and that the actual measured outcomes are ordered categories that reflect contiguous intervals along this continuum.</p>
         <p>However, this ordinal scaling of HRQoL measures may lead to several problems in determining sample size and analysing the data <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr></abbrgrp>. The advantages in being able to treat HRQoL scales as continuous and Normally distributed are simplicity in sample size estimation and statistical analysis. Therefore, it is important to examine such simplifying assumptions for different instruments and their scales. Since HRQoL outcome measures may not meet the distributional requirements (usually that the data have a Normal distribution) of parametric methods of sample size estimation and analysis, conventional statistical advice would suggest that non-parametric methods be used to analyse HRQoL data <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>.</p>
         <p>The bootstrap is an important non-parametric method for estimating sample size and analysing data (including hypothesis testing, standard error and confidence interval estimation) <abbrgrp><abbr bid="B6">6</abbr></abbrgrp>. The bootstrap is a data based simulation method for statistical inference, which involves repeatedly drawing random samples from the original data, with replacement. It seeks to mimic, in an appropriate manner, the way the sample is collected from the population in the bootstrap samples from the observed data. The 'with replacement' means that any observation can be sampled more than once. HRQoL outcome measures actually generate data with discrete, bounded and non-standard distributions. So, in theory, computer intensive methods such as the bootstrap that make no distributional assumptions may therefore be more appropriate for estimating sample size and analysing HRQoL data than conventional statistical methods.</p>
         <p>Conventional methods of sample size estimation for studies with HRQoL outcomes are extensively discussed in Fayers and Machin <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. However, they did not use the bootstrap to estimate sample sizes for studies with HRQoL outcomes. As a consequence of this omission, the aim of this paper is to describe and compare four different methods, including the bootstrap for estimating sample size and power when the primary outcome is a HRQoL measure.</p>
         <p>To illustrate this, we use some HRQoL data from a randomised controlled trial, the Community Postnatal Support Worker Study (CPSW), which aimed to compare the difference in health status in a group of women who were offered postnatal support (intervention) from a community midwifery support worker compared with a control group of women who were not offered support <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. The primary outcome (used to estimate sample size for this study) was the general health dimension of the SF-36 at 6 weeks postnatally.</p>
      </sec>
      <sec>
         <st>
            <p>Methods</p>
         </st>
         <sec>
            <st>
               <p>SF-36 Health Survey</p>
            </st>
            <p>The SF-36 is the most commonly used health status measure in the world today <abbrgrp><abbr bid="B8">8</abbr></abbrgrp>. It originated in the USA <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>, but has been validated for use in the UK <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. It contains 36 questions measuring health across eight different dimensions &#8211; physical functioning (PF), role limitation because of physical health (RP), social functioning (SF), vitality (VT), bodily pain (BP), mental health (MH), role limitation because of emotional problems (RE) and general health (GH). Responses to each question within a dimension are combined to generate a score from 0 to 100, where 100 indicates "good health". Thus, the SF-36 generates a profile of HRQoL outcomes, (see Figure <figr fid="F1">1</figr>), on eight dimensions.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Distribution of the eight SF-36 dimensions in the Sheffield population, females aged 16&#8211;45 (n = 487) [10]</p>
               </caption>
               <text>
                  <p>Distribution of the eight SF-36 dimensions in the Sheffield population, females aged 16&#8211;45 (n = 487) [10].</p>
               </text>
               <graphic file="1477-7525-2-26-1"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Which sample size formulae?</p>
            </st>
            <p>In principle, there are no major differences in planning a study using HRQoL outcomes, such as the SF-36, to those using conventional clinical outcomes. Pocock <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> outlines five key questions regarding sample size:</p>
            <p>1. What is the main purpose of the trial?</p>
            <p>2. What is the principal measure of patient outcome?</p>
            <p>3. How will the data be analysed to detect a treatment difference?</p>
            <p>4. What type of results does one anticipate with standard treatment?</p>
            <p>5. How small a treatment difference is it important to detect and with what degree of certainty?</p>
            <p>Given answers to all of the five questions above, we can then calculate a sample size.</p>
            <p>The choice of the sample size formulae strictly depends on the way data will be analysed, which in turn depends on specific characteristics of the data analysed. For this reason this paper is not only a comparison of four methods of sample size calculation, but also the comparison of the power of four different methods of analysis. We describe four methods of sample-size estimation when using the SF-36 in the comparative clinical trials of two treatments (Table <tblr tid="T1">1</tblr>). The first method (Method 1) assumes the various individual dimensions of the SF-36 are continuous and Normally distributed. The second method (Method 2) assumes the SF-36 dimensions are continuous. The third method (Method 3) assumes the SF-36 is an ordered categorical outcome. The fourth method uses a bootstrap approach.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Effect size and sample size formulae</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Method 1</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Method 2</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Method 3</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Assumptions</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Normally distributed continuous data</p>
                     </c>
                     <c ca="left">
                        <p>Non-normally distributed continuous data</p>
                     </c>
                     <c ca="left">
                        <p>Ordinal data, constant and relatively small odds ratio, large sample size</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Summary Measure</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Mean and mean difference</p>
                     </c>
                     <c ca="left">
                        <p>Median</p>
                     </c>
                     <c ca="left">
                        <p>Odds Ratio (<it>OR</it><sub><it>Ordinal</it></sub>)</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Hypothesis test</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>Two-independent samples <it>t</it>-test</p>
                     </c>
                     <c ca="left">
                        <p>Mann-Whitney <it>U </it>test</p>
                     </c>
                     <c ca="left">
                        <p>Mann-Whitney <it>U </it>test or equivalent proportional odds model</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Effect Size</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <graphic file="1477-7525-2-26-i1.gif"/>
                        </p>
                     </c>
                     <c ca="left">
                        <p><it>p</it><sub><it>Noether </it></sub>= <it>Pr</it>(<it>Y </it>><it>X</it>)</p>
                     </c>
                     <c ca="left">
                        <p>
                           <graphic file="1477-7525-2-26-i2.gif"/>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>Sample size formulae</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <graphic file="1477-7525-2-26-i3.gif"/>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <graphic file="1477-7525-2-26-i4.gif"/>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <graphic file="1477-7525-2-26-i5.gif"/>
                        </p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><it>&#916;</it><sub>Normal </sub>is the standardised effect size index, <it>&#956;</it><sub>T </sub>and <it>&#956;</it><sub>C </sub>are the expected group means of outcome variable under the null and alternative hypotheses and <it>&#963; </it>is the standard deviation of outcome variable (assumed the same under the null and alternative hypotheses). p<sub>Noether </sub>is an estimate of the probability that an observation drawn at random from population Y would exceed an observation drawn at random from population X. Let <it>&#960;</it><sub>iT </sub>be the probability of being in category i in Group T and <it>&#947;</it><sub>iT </sub>be the expected cumulative probability of being in category i or less in Group T (i.e. <it>&#947;</it><sub>iT </sub>= Pr(Y &#8804; y<sub>i</sub>)). <graphic file="1477-7525-2-26-i6.gif"/> is the combined mean (of the proportion of patients expected in groups T and C) for each category i. z<sub>1-<it>&#945;</it>/2 </sub>and z<sub>1-<it>&#946; </it></sub>are the appropriate values from the standard Normal distribution for the 100 (1 - <it>&#945;</it>/2) and 100 (1 - <it>&#946;</it>) percentiles respectively. Number of subjects per group n for a two-sided significance level <it>&#945; </it>and power 1 - <it>&#946;</it>.</p>
               </tblfn>
            </tbl>
            <p>In this paper the bootstrap has two roles. It is one of the four methods of sample size calculation and consequently analysis but it is also the method used to estimate the power curves presented in the figures. The bootstrap, in the way used in this paper, is a procedure for evaluating the performance of the statistical procedures, including tests. The bootstrap is non parametric in the sense that it evaluates the performance of any test statistic without making assumptions about the form of the distribution. For the methods of sample size estimation, we consider three test statistics, Methods 1, 2 and 3 and evaluate two of them in two ways, one using the usual assumptions (Normality or continuity), and the other by generating bootstrap distributions from the data.</p>
         </sec>
         <sec>
            <st>
               <p>Method 1 Normally distributed continuous data &#8211; comparing two means</p>
            </st>
            <p>Suppose we have two independent random samples x<sub>1</sub>, x<sub>2</sub>,...,x<sub>m </sub>and y<sub>1</sub>, y<sub>2</sub>,....,y<sub>n</sub>, of HRQoL data of size m and n respectively. The x's are y's are random samples from continuous HRQoL distributions having cumulative distribution functions (cdfs), F<sub>x </sub>and F<sub>y </sub>respectively. We will consider situations where the distributions have the same shape, but the locations may differ. Thus if <it>&#948; </it>denotes the location difference (i.e. mean (y) - mean (x) = <it>&#948;</it>), then F<sub>Y</sub>(y) = F<sub>X</sub>(y - <it>&#948;</it>), for every y. We shall focus on the null hypothesis H<sub>0</sub>: <it>&#948; </it>= 0 against the alternative H<sub>A</sub>: <it>&#948; </it>&#8800; 0. We can test these hypotheses using an appropriate significance test (e.g. <it>t</it>-test). With a Normal distribution under the location shift assumption and with n = m, the necessary sample size to achieve a power of 1-<it>&#946; </it>is given in Table <tblr tid="T1">1</tblr>.</p>
         </sec>
         <sec>
            <st>
               <p>Method 2 continuous data using non-parametric methods</p>
            </st>
            <p>If the HRQoL outcome data (i.e. the GH dimension of the SF-36) is assumed continuous and plausibly not sampled from a Normal distribution then the most popular (not necessarily the most efficient), non-parametric test for comparing two independent samples is the two-sample Mann-Whitney U (also known as the Wilcoxon rank sum test) <abbrgrp><abbr bid="B12">12</abbr></abbrgrp>.</p>
            <p>Suppose (as before) we have two independent random samples of x's and y's and we want to test the hypothesis that the two samples have come from the same population against the alternative that the Y observations tend to be larger than the X observations. As a test statistic we can use the Mann-Whitney (MW) statistic U, i.e., U = #(y<sub>j </sub>> x<sub>i</sub>), i = 1,...,m; j = 1,...,n, which is a count of the number of times the y<sub>j</sub>s are greater than the x<sub>i</sub>s. The magnitude of U has a meaning, because U/nm is an estimate of the probability that an observation drawn at random from population Y would exceed an observation drawn at random from population X, i.e. Pr(Y > X).</p>
            <p>Noether <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> derived a sample size formula for the MW test (see Table <tblr tid="T1">1</tblr>), using an effect size p<sub>Noether</sub>, (i.e. Pr(Y > X)), that makes no assumptions about the distribution of the data (except that it is continuous), and can be used whenever the sampling distribution of the test statistic U can be closely approximated by the Normal distribution, an approximation that is usually quite good except for very small n <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>.</p>
            <p>Hence to determine the sample size, we have to find the 'effect size' p<sub>Noether </sub>or the equivalent statistic Pr(Y > X). There are several ways of estimating p<sub>Noether</sub>, under various assumptions, one non-parametric possibility is p<sub>Noether </sub>= U/nm. Unfortunately, this can only be estimated after we have collected the data and calculated the U statistic or by computer simulation (as we shall see later). If we assume that X ~ N(<it>&#956;</it><sub>X</sub>, <it>&#963;</it><sup>2</sup><sub>X</sub>) and Y ~ N(<it>&#956;</it><sub>Y</sub>, <it>&#963;</it><sup>2</sup><sub>Y</sub>) then a parametric estimate of Pr(Y > X) using the sample estimates of the mean and variance (<graphic file="1477-7525-2-26-i7.gif"/>) is given by <abbrgrp><abbr bid="B15">15</abbr></abbrgrp>:</p>
            <p>
               <graphic file="1477-7525-2-26-i8.gif"/>
            </p>
            <p>where <it>&#934; </it>is the Normal cumulative distribution function.</p>
            <p>If we assume the SF-36 is Normally distributed then equation 1 allows the calculation of two comparable 'effect sizes' p<sub>Noether </sub>and <it>&#916;</it><sub>Normal </sub>thus enabling the two methods of sample size estimation to be directly contrasted. If the SF-36 is not Normally distributed then we cannot use equation 1 to calculate comparable effect sizes and must rely on the empirical estimates of p<sub>Noether </sub>= U/nm calculated post hoc from the data. Alternatively, under the location shift assumption, we can use bootstrap methods to estimate p<sub>Noether</sub>.</p>
         </sec>
         <sec>
            <st>
               <p>Method 3 &#8211; Ordinal data and Whitehead's Odds Ratio</p>
            </st>
            <p>Whitehead <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> has derived a method for estimating sample sizes for ordinal data and suggested the odds ratio (OR<sub>Ordinal</sub>), which is the odds of a subject being in a given category or lower in one group compared with the odds in the other group, as an effect size. To use Whitehead's formulae the proportion of subjects in each scale category for one of the groups must also be specified.</p>
            <p>Suppose there are two groups T and C and the HRQoL outcome measure of interest Y has k ordered categories y<sub>i </sub>denoted by i = 1,2,...,k. Let <it>&#960;</it><sub>iT </sub>be the probability of a randomly chosen subject being in category i in Group T and <it>&#947;</it><sub>iT </sub>be the expected cumulative probability of being in category i or less in Group T (i.e. <it>&#947;</it><sub>iT </sub>= Pr(Y &#8804; y<sub>i</sub>)). For category i, where i takes values from 1 to k-1, the OR<sub>i </sub>is given in Table <tblr tid="T1">1</tblr>.</p>
            <p>The assumption of proportional odds specifies that the OR<sub>i </sub>will be the same for all categories from i = 1 to k-1. As the derivation of the sample size formulae and analysis of data is based on the Mann-Whitney U test, Whitehead's method can be regarded as a 'non-parametric' approach, although it still relies on the assumption of a constant OR for the data. Whitehead's method also assumes a relatively small log odds ratio and a large sample size, which will often be the case in HRQoL studies where dramatic effects are unlikely <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>. Table <tblr tid="T1">1</tblr> gives the number of subjects per group n<sub>Ordinal </sub>for a two-sided significance level <it>&#945; </it>and power 1-<it>&#946;</it>.</p>
            <p>Whitehead's <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> method for sample determination is derived from the proportional odds model. The proportional odds model is equivalent to the MW test when there is only a 0/1 (or group) variable in the regression <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. The advantage of the proportional odds model, over the MW test is that it allows the estimation of confidence intervals for the treatment group effect and for the adjustment of the HRQoL outcome for other covariates.</p>
            <p>If the number of categories is large, it is difficult to postulate the proportion of subjects who would fall in a given category. Whitehead <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> points out that there is little increase in power (and hence saving in the number of subjects recruited) to be gained by increasing the number of categories beyond five. An even distribution of subjects within categories leads to the greatest efficiency.</p>
            <p>Shepstone <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> demonstrates how the three seemingly different effect size measures <it>&#916;</it><sub>Normal</sub>, OR<sub>Ordinal </sub>and p<sub>Noether</sub>, which are all numerical expressions of treatment efficacy can be combined into a common scale. If Y and X are the values of an outcome (higher values more preferable) for randomly selected individuals from the Treatment and Control groups respectively, then A<sub>YX </sub>= Pr(Y > X), i.e. the probability that the Treatment patient has an outcome preferable to that of the Control patient, is equivalent to the effect size statistic p<sub>Noether</sub>. If we let A<sub>XY </sub>= Pr(X > Y), i.e. the probability that a random individual from group 2 (Control) has a better outcome than a random individual from group 1 (Treatment), then</p>
            <p><it>&#955; </it>= <it>A</it><sub><it>YX </it></sub>- <it>A</it><sub><it>XY </it></sub>= Pr(<it>Y </it>><it>X</it>) - Pr(<it>X </it>><it>Y</it>) &#160;&#160;&#160; (2)</p>
            <p>and</p>
            <p>
               <graphic file="1477-7525-2-26-i9.gif"/>
            </p>
            <p>Shepstone <abbrgrp><abbr bid="B18">18</abbr></abbrgrp> shows that for ordinal and continuous outcomes <it>A</it><sub><it>YX </it></sub>- <it>A</it><sub><it>XY </it></sub>= <it>&#955; </it>and <it>A</it><sub><it>YX </it></sub>/ <it>A</it><sub><it>XY </it></sub>= <it>&#952; </it>are equivalent to the Absolute Risk Reduction (ARR) and OR for binary outcomes. A<sub>XY </sub>and A<sub>YX</sub>, or their equivalent statistics Pr(X > Y) and Pr(Y > X) can be calculated by either a parametric approach for continuous outcomes (Equation 1) via a theoretical distribution (e.g. Normal) or a non-parametric approach without any distributional assumptions via the Mann-Whitney U statistic. (Since A<sub>XY </sub>and A<sub>YX </sub>can be estimated by <graphic file="1477-7525-2-26-i10.gif"/> and <graphic file="1477-7525-2-26-i11.gif"/> where U<sub>XY </sub>and U<sub>YX </sub>are the values of the Mann-Whitney U statistics).</p>
            <p>If the outcomes are continuous and/or can be fully ranked and there are no ties in the data then Pr(X = Y) = 0 and <it>&#955; </it>= A<sub>YX </sub>- A<sub>XY </sub>= Pr(Y > X) - Pr(X > Y) and <it>&#952; </it>= A<sub>YX</sub>/A<sub>XY </sub>= Pr(Y > X)/Pr(X > Y) can be estimated exactly. Conversely, if there are a large number of ties in the data, i.e. x<sub>i </sub>= y<sub>i</sub>, (which is likely for HRQoL outcomes, with their discrete response categories) then Pr(X = Y) > 0. In this case any pairs for which x<sub>i </sub>= y<sub>i</sub>, contribute 1/2 a unit to both U<sub>YX </sub>and U<sub>XY</sub>. Hence the two A statistics A<sub>YX </sub>and A<sub>XY </sub>can only be estimated approximately and thus the approximate estimates of <it>&#952; </it>and <it>&#955; </it>in the case of ties will be denoted by <it>&#952;</it>' and <it>&#955;</it>' respectively.</p>
         </sec>
         <sec>
            <st>
               <p>Method 4 &#8211; Computer simulation &#8211; the bootstrap</p>
            </st>
            <p>Methods 1 and 2 assume the HRQoL outcome is continuous and the simple location shift model is appropriate. Here this would imply that, on a certain scale, the difference in effect of the intervention compared to the control is constant or, at least that the intervention shifts the distribution of the HRQoL scores under the control to the right (or to the left if the intervention is harmful) but keeping its shape. However, the boundedness of the SF-36 outcomes renders this location shift assumption questionable, especially if the proportion of cases the upper limit is high. Therefore, we used bootstrap methods to compare the power of the t-test and MW test with allowance for ties for detecting a shift in location using three dimensions of the SF-36 (GH, RP and V) as outcomes <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>. These three dimensions illustrate the different distributions of HRQoL outcomes that are likely to occur in practice.</p>
            <p>Suppose (as before) we have two independent random samples of x's and y's from continuous distributions having cdfs, F<sub>x </sub>and F<sub>y </sub>respectively. Again we will consider situations where the distributions have the same shape, but the locations may differ; i.e. mean (y) - mean (x) = <it>&#948;</it>. If we focus on the null hypothesis H<sub>0</sub>: <it>&#948; </it>= 0 against the alternative H<sub>A</sub>: <it>&#948; </it>&#8800; 0, then we can test this hypothesis using an appropriate significance test (i.e. t-test, Mann-Whitney or proportional odds model). However, we did not evaluate the proportional odds model as part of the bootstrap. This was because the proportional odds model is equivalent to the MW test when there is only a 0/1 variable in the regression, and the p-values from the MW test and the significance of the regression coefficient for the group variable are indentical <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>.</p>
            <p>The bootstrap strategy is to use pilot data to a provide a non-parametric estimate <graphic file="1477-7525-2-26-i12.gif"/> of F and to use a simulation method for finding the power of the test associated with any specified sample size n if the data follow the estimated distribution functions under the null and alternative hypotheses. If we denote the distribution function estimate by <graphic file="1477-7525-2-26-i13.gif"/>, under the alternative hypothesis <it>&#948;</it>, we can estimate the approximate power, <graphic file="1477-7525-2-26-i14.gif"/> (<it>G, &#948;, &#945;, n</it>) by the following computer simulation procedure <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>.</p>
            <sec>
               <st>
                  <p>Algorithm 1</p>
               </st>
               <p>Power and sample size estimation using the bootstrap</p>
               <p>1. Draw a random sample with replacement of size 2 n from F. The first n observations in the sample form a simulated sample of <b>x</b>'s, denoted by x<sub>1</sub>*,...,x<sub>n</sub>*, with estimated cdf <graphic file="1477-7525-2-26-i12.gif"/>*. Then <it>&#948; </it>is added to each of the other n observations in the sample to form the simulated sample of <b>y</b>'s, denoted by y<sub>1</sub>*,...,y<sub>n</sub>*, with estimated cdf <graphic file="1477-7525-2-26-i13.gif"/>*. (The y*'s and x*'s have been generated from the same distribution except that the distribution of the y*'s is shifted <it>&#948; </it>units to the right.)</p>
               <p>2. The test statistic, Mann-Whitney or t-test, is calculated for the <b>x</b>*'s and <b>y</b>*'s, yielding t(<b>x*, y*</b>). If t(<b>x*, y*</b>) &#8805; T<sub>1-<it>&#945;</it>/2</sub>, (where T<sub>1-<it>&#945;</it>/2 </sub>is the critical value of the test statistic) a success is recorded; otherwise a failure is recorded.</p>
               <p>3. Steps 1 and 2 are repeated B times. The estimated power of the test, <graphic file="1477-7525-2-26-i14.gif"/> (<it>G, &#948; &#945;, n</it>), is approximated by the proportion of successes among the B repetitions. (In all cases discussed in this paper, B = 10,000).</p>
               <p>The bootstrap procedure described in Algorithm 1 assumes a simple location shift model. For bounded HRQoL outcomes the procedure is in principle the same but more imagination is needed to specify the effect of the new treatment in comparison with the control treatment. Under the simple location shift model, individual improvement of <it>&#948; </it>points in HRQoL is assumed: for bounded HRQoL outcome scores we have to assume an effect <it>&#948;</it><sub>1 </sub>(x) such that x + <it>&#948; </it>(x) remains in the interval determined by the lower and upper boundary of the HRQoL outcome. (In the case of the SF-36 GH dimension between 0 and 100). One function is to assume a constant treatment effect whenever possible. We assumed a constant additional treatment effect of 5 points, until a GH score of 95: patients with a GH score of 95 or more were truncated at 100.</p>
               <p>The software Resampling Stats was used for implementing Algorithm 1 <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. The bootstrap computer simulation procedure involved using SF-36 data from a general population survey based on 487 women aged 16&#8211;74 as the pilot dataset (Figure <figr fid="F1">1</figr>) <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
            </sec>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <sec>
            <st>
               <p>Sample size estimation &#8211; Method 1</p>
            </st>
            <p>When planning the CPSW study we went through Pocock's <abbrgrp><abbr bid="B11">11</abbr></abbrgrp> five key questions regarding sample size.</p>
            <sec>
               <st>
                  <p>What is the main purpose of the trial?</p>
               </st>
               <p>To assess whether additional postnatal support by trained Community Postnatal Support Workers could have a positive effect on the mother's general health.</p>
            </sec>
            <sec>
               <st>
                  <p>What is the principal measure of patient outcome?</p>
               </st>
               <p>The primary outcome was the SF-36 general health perception (GH) dimension at six weeks postnatally.</p>
            </sec>
            <sec>
               <st>
                  <p>How will the data be analysed to detect a treatment difference?</p>
               </st>
               <p>We believed that the mean difference in GH scores between the two groups was an appropriate comparative summary measure and that a two-independent samples <it>t</it>-test would be used to analyse the data.</p>
            </sec>
            <sec>
               <st>
                  <p>What type of results does one anticipate with standard treatment?</p>
               </st>
               <p>Unfortunately no information was available from previous studies of new mothers to calculate a sample size based on the GH dimension of the SF-36. Therefore as the CPSW study was only going to involve women of childbearing age we estimated the standard deviation of the GH outcome from a previous survey of the Sheffield general population using (n = 487) females aged 16 to 45 (Figure <figr fid="F1">1g</figr>). This gave us an estimated SD of 20 <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
            </sec>
            <sec>
               <st>
                  <p>How small a treatment difference is it important to detect and with what degree of certainty?</p>
               </st>
               <p>Using the GH dimension of the SF-36, a five-point difference is the smallest score change achievable by an individual and considered as "clinically and socially relevant" <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>.</p>
               <p>Using Method 1, assuming a standard deviation <it>&#963; </it>of 20 and that a location shift or mean difference (<it>&#956;</it><sub>ET </sub>- <it>&#956;</it><sub>EC</sub>) of 5 or more points between the two groups is clinically and practically relevant, gives a standardised effect size, <it>&#916;</it><sub>Normal</sub>, of 0.25. Using this standardised effect size with a two-sided 5% significance level and 80% power gives the estimated required number of subjects per group as 253.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Sample size estimation &#8211; Method 2</p>
            </st>
            <p>Suppose we believe the GH dimension to be continuous, but not Normally distributed and are intending to compare GH scores in the two groups with a Mann-Whitney U test (with allowances for ties). Therefore, Noether's method will be appropriate. As before if we assume a mean difference of 5 and a standard deviation of 20 for the GH dimension of the SF-36, then using equation 1 leads to an parametric estimate of the effect size p<sub>Noether </sub>= Pr(Y > X) of 0.57 and consequently Pr(X > Y) of 0.43. Substituting p<sub>Noether </sub>= 0.57 in the formula for Method 2 (in Table <tblr tid="T1">1</tblr>) with a two-sided 5% significance level and 80% power gives the estimated number of subjects per group as 267.</p>
            <p>Method 1 has given us a slightly smaller sample size estimate than Method 2. The two methods can be regarded as equivalent when the two populations are Normally distributed, with equal variances. In this case, the MW test will require about 5% more observations than the two-sample t-test to provide the same power against the same alternative. For non-Normal populations, especially those with long tails, the MW test may not require as many observations as the two-sample t-test <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>.</p>
            <p>Empirically, calculating a parametric estimate of Pr(Y > X) from the observed effect size data (using the observed sample means and standard deviations), leads to values very similar to the non-parametric estimate. For example, for the GH dimension in the CPSW data in Table <tblr tid="T2">2</tblr>, the observed non-parametric estimate of Pr(Y > X) was 0.542 compared to a parametric estimate of 0.537.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>CPSW Study [7] Observed Effect Sizes for Control vs. Intervention Groups</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c ca="left">
                        <p>
                           <b>SF-36 Dimension</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Group</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>n</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>mean</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>sd</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Mean Diff <it>&#948;</it></b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>
                              <it>&#916;</it>
                              <sub>Normal</sub>
                           </b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>P<sub>Noether </sub>Parametric</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>Pr(Y > X) Non-parametric</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Physical</p>
                     </c>
                     <c ca="left">
                        <p>Control</p>
                     </c>
                     <c ca="left">
                        <p>241</p>
                     </c>
                     <c ca="left">
                        <p>89.9</p>
                     </c>
                     <c ca="left">
                        <p>14.5</p>
                     </c>
                     <c ca="left">
                        <p>2.6</p>
                     </c>
                     <c ca="left">
                        <p>0.17</p>
                     </c>
                     <c ca="left">
                        <p>0.548</p>
                     </c>
                     <c ca="left">
                        <p>0.561</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Function</p>
                     </c>
                     <c ca="left">
                        <p>Intervention</p>
                     </c>
                     <c ca="left">
                        <p>254</p>
                     </c>
                     <c ca="left">
                        <p>87.3</p>
                     </c>
                     <c ca="left">
                        <p>15.8</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Role</p>
                     </c>
                     <c ca="left">
                        <p>Control</p>
                     </c>
                     <c ca="left">
                        <p>241</p>
                     </c>
                     <c ca="left">
                        <p>74.3</p>
                     </c>
                     <c ca="left">
                        <p>38.1</p>
                     </c>
                     <c ca="left">
                        <p>9.1</p>
                     </c>
                     <c ca="left">
                        <p>0.23</p>
                     </c>
                     <c ca="left">
                        <p>0.566</p>
                     </c>
                     <c ca="left">
                        <p>0.568</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Physical</p>
                     </c>
                     <c ca="left">
                        <p>Intervention</p>
                     </c>
                     <c ca="left">
                        <p>254</p>
                     </c>
                     <c ca="left">
                        <p>65.2</p>
                     </c>
                     <c ca="left">
                        <p>39.5</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Bodily</p>
                     </c>
                     <c ca="left">
                        <p>Control</p>
                     </c>
                     <c ca="left">
                        <p>241</p>
                     </c>
                     <c ca="left">
                        <p>75.6</p>
                     </c>
                     <c ca="left">
                        <p>23.7</p>
                     </c>
                     <c ca="left">
                        <p>4</p>
                     </c>
                     <c ca="left">
                        <p>0.17</p>
                     </c>
                     <c ca="left">
                        <p>0.547</p>
                     </c>
                     <c ca="left">
                        <p>0.552</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Pain</p>
                     </c>
                     <c ca="left">
                        <p>Intervention</p>
                     </c>
                     <c ca="left">
                        <p>254</p>
                     </c>
                     <c ca="left">
                        <p>71.6</p>
                     </c>
                     <c ca="left">
                        <p>23.8</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>General</p>
                     </c>
                     <c ca="left">
                        <p>Control</p>
                     </c>
                     <c ca="left">
                        <p>241</p>
                     </c>
                     <c ca="left">
                        <p>77.7</p>
                     </c>
                     <c ca="left">
                        <p>17.7</p>
                     </c>
                     <c ca="left">
                        <p>2.4</p>
                     </c>
                     <c ca="left">
                        <p>0.13</p>
                     </c>
                     <c ca="left">
                        <p>0.537</p>
                     </c>
                     <c ca="left">
                        <p>0.542</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Health</p>
                     </c>
                     <c ca="left">
                        <p>Intervention</p>
                     </c>
                     <c ca="left">
                        <p>254</p>
                     </c>
                     <c ca="left">
                        <p>75.3</p>
                     </c>
                     <c ca="left">
                        <p>18.5</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Vitality</p>
                     </c>
                     <c ca="left">
                        <p>Control</p>
                     </c>
                     <c ca="left">
                        <p>241</p>
                     </c>
                     <c ca="left">
                        <p>51.1</p>
                     </c>
                     <c ca="left">
                        <p>20.7</p>
                     </c>
                     <c ca="left">
                        <p>1.3</p>
                     </c>
                     <c ca="left">
                        <p>0.06</p>
                     </c>
                     <c ca="left">
                        <p>0.517</p>
                     </c>
                     <c ca="left">
                        <p>0.514</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>Intervention</p>
                     </c>
                     <c ca="left">
                        <p>254</p>
                     </c>
                     <c ca="left">
                        <p>49.8</p>
                     </c>
                     <c ca="left">
                        <p>21.7</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Social</p>
                     </c>
                     <c ca="left">
                        <p>Control</p>
                     </c>
                     <c ca="left">
                        <p>241</p>
                     </c>
                     <c ca="left">
                        <p>81.6</p>
                     </c>
                     <c ca="left">
                        <p>22.7</p>
                     </c>
                     <c ca="left">
                        <p>4.7</p>
                     </c>
                     <c ca="left">
                        <p>0.20</p>
                     </c>
                     <c ca="left">
                        <p>0.556</p>
                     </c>
                     <c ca="left">
                        <p>0.561</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Function</p>
                     </c>
                     <c ca="left">
                        <p>Intervention</p>
                     </c>
                     <c ca="left">
                        <p>254</p>
                     </c>
                     <c ca="left">
                        <p>76.9</p>
                     </c>
                     <c ca="left">
                        <p>24.2</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Role</p>
                     </c>
                     <c ca="left">
                        <p>Control</p>
                     </c>
                     <c ca="left">
                        <p>241</p>
                     </c>
                     <c ca="left">
                        <p>77.9</p>
                     </c>
                     <c ca="left">
                        <p>36.4</p>
                     </c>
                     <c ca="left">
                        <p>1.1</p>
                     </c>
                     <c ca="left">
                        <p>0.03</p>
                     </c>
                     <c ca="left">
                        <p>0.509</p>
                     </c>
                     <c ca="left">
                        <p>0.515</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Emotional</p>
                     </c>
                     <c ca="left">
                        <p>Intervention</p>
                     </c>
                     <c ca="left">
                        <p>254</p>
                     </c>
                     <c ca="left">
                        <p>76.8</p>
                     </c>
                     <c ca="left">
                        <p>35.5</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Mental</p>
                     </c>
                     <c ca="left">
                        <p>Control</p>
                     </c>
                     <c ca="left">
                        <p>241</p>
                     </c>
                     <c ca="left">
                        <p>72.9</p>
                     </c>
                     <c ca="left">
                        <p>17.2</p>
                     </c>
                     <c ca="left">
                        <p>-0.2</p>
                     </c>
                     <c ca="left">
                        <p>-0.01</p>
                     </c>
                     <c ca="left">
                        <p>0.497</p>
                     </c>
                     <c ca="left">
                        <p>0.499</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Health</p>
                     </c>
                     <c ca="left">
                        <p>Intervention</p>
                     </c>
                     <c ca="left">
                        <p>254</p>
                     </c>
                     <c ca="left">
                        <p>73.1</p>
                     </c>
                     <c ca="left">
                        <p>16.7</p>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Effect size <it>&#916;</it><sub>Normal </sub>= mean difference divided by the pooled standard deviation. Effect size p<sub>Noether </sub>Pr(Y<sub>Control </sub>> X<sub>Intervention</sub>), based on U/nm, where U = MW test statistic (with allowance for ties). Parametric estimate of Pr (Y > X) based on equation 1.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Sample size estimation &#8211; Method 3</p>
            </st>
            <p>Assuming a mean difference of 5 (i.e. <graphic file="1477-7525-2-26-i15.gif"/>) and a common standard deviation of 20 (i.e. <graphic file="1477-7525-2-26-i16.gif"/>) for the GH dimension of the SF-36, then equation (1) leads to a parametric estimate of the effect size p<sub>Noether </sub>= Pr(Y > X) of 0.57. This in turn leads to a parametric estimate of the ARR (from equation 2), <it>&#955;</it>' = 0.57 - 0.43 = 0.14 and an estimated (from 3) OR <it>&#952;</it>' = 0.57/0.43 = 1.33.</p>
            <p>If we assume OR<sub>Ordinal </sub>= OR = 1.33 then the assumption of proportional odds specifies that the OR<sub>Ordinali </sub>will be the same for all 34 categories of the GH dimension. If we also assume the proportion of subjects in each category in the control group is the same as in Figure <figr fid="F1">1g</figr>. Then under the assumption of proportional odds OR<sub>Ordinal </sub>= 1.33, the anticipated cumulative proportions (<it>&#947;</it><sub>iT</sub>) for each category of treatment T are given by:</p>
            <p>
               <graphic file="1477-7525-2-26-i17.gif"/>
            </p>
            <p>After calculating the cumulative proportions (<it>&#947;</it><sub>iT</sub>), the anticipated proportions falling into each treatment category, <it>&#960;</it><sub><it>iT </it></sub>can be determined from the difference in successive <it>&#947;</it><sub>iT</sub>. Finally, the combined mean (<graphic file="1477-7525-2-26-i6.gif"/>) of the proportions of treatments C and T for each category is calculated.</p>
            <p>Substituting OR<sub>Ordinal </sub>= 1.33 and <graphic file="1477-7525-2-26-i18.gif"/> = 0.0067 with a two-sided 5% significance level and 80% power gives the estimated number of subjects per group as 584. Given this sample size, and with the distribution shown in Figure <figr fid="F1">1g</figr> and an OR of 1.33, we can work out what the corresponding mean values are. The estimated mean GH score was 77.6 in the treatment group and 75.0 in the control group. This is an estimated mean difference of 2.6 points, which is smaller than the five-point mean difference used earlier.</p>
            <p>It is difficult to translate a shift in means into the shift in the probabilities on an ordinal scale, without several assumptions. If we assume proportions in each category in the control group as shown in Figure <figr fid="F1">1g</figr> and proportional odds shift, then an OR<sub>Ordinal </sub>of 1.63 is approximately equal to a mean shift of 5.0. This leads to <graphic file="1477-7525-2-26-i18.gif"/> = 0.007 and a sample size estimate of 199 subjects per group with two-sided 5% significance and 80% power. Given this sample size then the corresponding estimated mean GH scores are 74.8 and 79.8 in the control and treatment groups respectively.</p>
         </sec>
         <sec>
            <st>
               <p>Method 4 &#8211; Bootstrap sample size estimation</p>
            </st>
            <p>Figure <figr fid="F1">1g</figr> shows the skewed distribution of the GH dimension and that the underlying assumption of Normality of the distribution required for Method 1 may not be appropriate. Furthermore the, GH dimension is bounded by 0 and 100. Thus, if a new mother already has a GH score of 100 in the control group, then under the intervention no extra improvement can be seen, at least not by the GH dimension of the SF-36. Seven percent of women (35/487) in the Sheffield data had a GH score of 100 and 14.2% (70/487) had a score of 95 or more.</p>
            <p>Figure <figr fid="F2">2</figr> shows the estimated power curves for Methods 1, 2 and 3 and the two bootstrap methods (t and MW tests) at the 5% two-sided significance level for detecting a location shift (mean difference) <it>&#948; </it>= 5 in the SF-36 GH dimension using the data from the general population as our pilot sample, for sample sizes per group varying from 50 to 600. For these general population data a location shift of <it>&#948; </it>= 5 is equivalent to a standardised effect size <it>&#916;</it><sub>Normal </sub>= 0.25 and p<sub>Noether </sub>= Pr(Y > X) = 0.57. The bootstrap methods taking into account the bounded and non-Normal distribution of the data suggest a mean difference d of 4.5 and p = Pr(Y > X) = 0.58.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Estimated power curves for the SF-36 General Health dimension using general population data (females aged 16&#8211;45), based on <it>&#945; </it>= 0.05 (two-sided) with 10,000 bootstrap replications</p>
               </caption>
               <text>
                  <p><b>Estimated power curves for the SF-36 General Health dimension using general population data (females aged 16&#8211;45), based on <it>&#945; </it>= 0.05 (two-sided) with 10,000 bootstrap replications </b>SF-36 General Health Dimension (General Population Females aged 16&#8211;45); n= 487; mean = 74.8; sd = 19.6; 14.2% scoring 95 or more.</p>
               </text>
               <graphic file="1477-7525-2-26-2"/>
            </fig>
            <p>The GH dimension (Figure <figr fid="F1">1g</figr>) of the SF-36 has a large number (> 30) of discrete values or categories, most of which are occupied, and the proportion scoring 0 or 100 is low. Figure <figr fid="F2">2</figr> suggests that the MW test is more powerful than the t-test for the GH dimension based on the bootstrap results for the bounded shift. The power curves shown in Figure <figr fid="F2">2</figr> do not diverge too greatly and thus, the location shift hypothesis is a useful working model.</p>
            <p>In contrast, Figure <figr fid="F3">3</figr> shows the estimated power curves for another dimension of the SF-36: RP, which can only take one of five discrete values (as shown in Figure <figr fid="F1">1c</figr>), for detecting a simple location shift (mean difference) <it>&#948; </it>= 5. For these data a simple location shift of <it>&#948; </it>= 5 is equivalent to a standardised effect size <it>&#916;</it><sub>Normal </sub>= 0.17 and p<sub>Noether </sub>= Pr(Y > X) = 0.55. Since three-quarters of the pilot sample scored 100, the bootstrap methods under the location shift model, taking into account the bounded and non-Normal distribution of the data suggest a mean difference d of 1.2 and p = Pr(Y > X) = 0.51. The power curves shown in Figure <figr fid="F3">3</figr> diverge greatly and the simple location shift hypothesis may not be appropriate for this outcome. Figure <figr fid="F3">3</figr> clearly shows the value of the bootstrap in investigating the impact of the bounded HRQoL distributions on the power of the hypothesis test.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Estimated power curves for the SF-36 Role Physical dimension using general population data (females aged 16&#8211;45), based on <it>&#945; </it>= 0.05 (two-sided) with 10,000 bootstrap replications</p>
               </caption>
               <text>
                  <p><b>Estimated power curves for the SF-36 Role Physical dimension using general population data (females aged 16&#8211;45), based on <it>&#945; </it>= 0.05 (two-sided) with 10,000 bootstrap replications </b>SF-36 Role Limitations Physical Dimension (General Population Females aged 16&#8211;45); n= 487; mean = 85.5; sd = 29.1; 75.4% scoring 100.</p>
               </text>
               <graphic file="1477-7525-2-26-3"/>
            </fig>
            <p>Finally, Figure <figr fid="F4">4</figr> shows the estimated power curves for the Vitality dimension of the SF-36. This computer simulation suggests that if the distribution of the HRQoL dimension are reasonably symmetric (Figure <figr fid="F1">1b</figr>), and the proportion of patients at each bound is low, then under the location shift alternative hypothesis, the t-test appears to be slightly more powerful than the MW test at detecting differences in means.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>Estimated power curves for the SF-36 Energy/Vitality dimension using general population data (females aged 16&#8211;45), based on <it>&#945; </it>= 0.05 (two-sided) with 10,000 bootstrap replications</p>
               </caption>
               <text>
                  <p><b>Estimated power curves for the SF-36 Energy/Vitality dimension using general population data (females aged 16&#8211;45), based on <it>&#945; </it>= 0.05 (two-sided) with 10,000 bootstrap replications </b>SF-36 Energy Dimension (General Population Females aged 16&#8211;45); n= 487; mean = 59.3; sd = 21.1; 1% scoring 100.</p>
               </text>
               <graphic file="1477-7525-2-26-4"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Use of the bootstrap to estimate Type I error</p>
            </st>
            <p>The bootstrap methodology provides an ideal opportunity to consider Type I errors. Resampling Algorithm 1 can easily be adapted for this. It simply involves modification of step 1 and not adding <it>&#948; </it>to the second simulated sample of patients. Under the true null hypothesis of no difference in distributions, the actual Type I error rate can be computed by determining the proportion of simulated cases which had significance levels at or below its nominal value. For a nominal Type I error rate of <it>&#945; </it>= 0.05, we would expect 5% (or 0.05) of the bootstrap samples to give a (false-positive) significant result under the true null hypothesis of no difference in distributions. The robustness of each test can then be determining by comparing the actual Type I error rates to the nominal Type I error rates.</p>
            <p>Statistical tests are said to be robust if the observed Type I error rates are close to the pre-selected or nominal, Type I error rates in the presence of violations of assumptions <abbrgrp><abbr bid="B24">24</abbr><abbr bid="B25">25</abbr></abbrgrp>. Sullivan and D'Agostino <abbrgrp><abbr bid="B26">26</abbr></abbrgrp> describe a test as 'robust' if the actual significance level does not exceed 10% of the nominal significance level (e.g. less than or equal to 0.055 when the nominal significance level is 0.05). They describe a test as 'liberal' if the observed significance exceeds the nominal level by more than 10%. Finally, they describe a test as 'conservative' if the actual significance level is less than the nominal level. A 'conservative' test is of less concern, as the probability of making a Type I error is controlled.</p>
            <p>The overall actual significance levels relative to a nominal level of 0.05 under the null hypothesis of no treatment differences for the GH and RP dimensions are displayed in Table <tblr tid="T3">3</tblr> for a variety of sample sizes. Both tests (t-test and MW) are 'robust' tests of the equality of means (and distributions) for both the GH and RP outcomes.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Actual significance levels for <it>t</it>-test and <it>MW </it>test relative to nominal <it>&#945; </it>= 0.05: using general population data (females aged 16&#8211;45) for the GH and RP dimensions [10]</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c ca="left">
                        <p>
                           <b>Sample sizes</b>
                        </p>
                     </c>
                     <c cspan="2" ca="left">
                        <p>
                           <b>GH dimension</b>
                        </p>
                     </c>
                     <c cspan="2" ca="left">
                        <p>
                           <b>RP Dimension</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="left">
                        <p>
                           <b>t-test</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>MW test</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>t-test</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>
                           <b>MW test</b>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>300, 300</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0511</p>
                     </c>
                     <c ca="left">
                        <p>0.0490</p>
                     </c>
                     <c ca="left">
                        <p>0.0504</p>
                     </c>
                     <c ca="left">
                        <p>0.0495</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>250, 250</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0520</p>
                     </c>
                     <c ca="left">
                        <p>0.0535</p>
                     </c>
                     <c ca="left">
                        <p>0.0497</p>
                     </c>
                     <c ca="left">
                        <p>0.0516</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>200, 200</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0543</p>
                     </c>
                     <c ca="left">
                        <p>0.0484</p>
                     </c>
                     <c ca="left">
                        <p>0.0508</p>
                     </c>
                     <c ca="left">
                        <p>0.0521</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>150, 150</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0514</p>
                     </c>
                     <c ca="left">
                        <p>0.0527</p>
                     </c>
                     <c ca="left">
                        <p>0.0507</p>
                     </c>
                     <c ca="left">
                        <p>0.0510</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>100, 100</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0502</p>
                     </c>
                     <c ca="left">
                        <p>0.0515</p>
                     </c>
                     <c ca="left">
                        <p>0.0518</p>
                     </c>
                     <c ca="left">
                        <p>0.0534</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>75, 75</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0493</p>
                     </c>
                     <c ca="left">
                        <p>0.0515</p>
                     </c>
                     <c ca="left">
                        <p>0.0535</p>
                     </c>
                     <c ca="left">
                        <p>0.0500</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>50, 50</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0506</p>
                     </c>
                     <c ca="left">
                        <p>0.0522</p>
                     </c>
                     <c ca="left">
                        <p>0.0476</p>
                     </c>
                     <c ca="left">
                        <p>0.0512</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>25, 25</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0490</p>
                     </c>
                     <c ca="left">
                        <p>0.0492</p>
                     </c>
                     <c ca="left">
                        <p>0.0526</p>
                     </c>
                     <c ca="left">
                        <p>0.0478</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>10, 10</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>0.0428</p>
                     </c>
                     <c ca="left">
                        <p>0.0464</p>
                     </c>
                     <c ca="left">
                        <p>0.0393</p>
                     </c>
                     <c ca="left">
                        <p>0.0456</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p><it>&#945; </it>= 0.05 (two-sided); 10,000 bootstrap replications.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Extensions of the use bootstrap &#8211; odds ratio shifts rather than simple location shift</p>
            </st>
            <p>When using the proportional odds model to estimate sample size, Whitehead <abbrgrp><abbr bid="B16">16</abbr></abbrgrp> and Julious et al <abbrgrp><abbr bid="B27">27</abbr></abbrgrp> have pointed out that there is little increase in power (and hence saving in the number of subjects recruited) to be gained by increasing the number of categories in a proportional odds model beyond five. Although the model is robust to mild departures from the assumption of proportional odds, with increasing numbers of categories it is less likely that the proportional odds assumption remains true. Therefore, to illustrate this point, we shall use the five discrete category outcome of the RP dimension of the SF-36 to show the effect of the bootstrap sample size estimator when the alternative to the null hypothesis is an odds ratio transformation rather than a simple location shift.</p>
            <p>Figure <figr fid="F5">5</figr> shows the power curves for t-test and MW test for the RP dimension of the SF-36 assuming the alternative hypothesis is a proportional odds shift in HRQoL of OR<sub>Ordinal </sub>= 1.50. As one would expect, the bootstrap power curves in Figure <figr fid="F5">5</figr> show that the MW test or the equivalent proportional odds model is more powerful than the t-test when the alternative hypothesis is an odds ratio shift, although the differences in power for a given sample size are small.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Estimated power curves for the SF-36 Role Physical dimension to detect an Odds Ratio shift using general population data (females aged 16&#8211;45) based on <it>&#945; </it>= 0.05 (two-sided) with 10,000 bootstrap replications</p>
               </caption>
               <text>
                  <p><b>Estimated power curves for the SF-36 Role Physical dimension to detect an Odds Ratio shift using general population data (females aged 16&#8211;45) based on <it>&#945; </it>= 0.05 (two-sided) with 10,000 bootstrap replications </b>Five category SF-36 Role Physical outcome (General population Females aged 16&#8211;45); n = 487, <it>&#947; </it>= (.06, .11, .17, .25, 1.0)</p>
               </text>
               <graphic file="1477-7525-2-26-5"/>
            </fig>
            <p>Sample sizes of over 450 patients per group are required to have an 80% chance of detecting this 'small to moderate' odds ratio (OR = 1.5) effect as statistically significant at the 5% two-sided level. With such 'large' sample sizes statistical theory, via Central Limit Theorem (CLT), guarantees that the sample means will be approximately Normally distributed, which ensures the relatively good performance of the t-test in detecting an OR location shift. The robustness of the two independent samples t-test when applied to three-, four- and five-point ordinal scaled data has previously been demonstrated by Heeren and D'Agostino <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> for far smaller sample sizes than this (as small as 20 per group).</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <sec>
            <st>
               <p>Choice of sample size method with HRQoL outcomes</p>
            </st>
            <p>It is important to make maximum use of the information available from other related studies or extrapolation from other unrelated studies. The more precise the information the better we can design the trial. We would recommend that researchers planning a study with HRQoL measures as the primary outcome pay careful attention to any evidence on the validity and frequency distribution of the HRQoL measures and its dimensions.</p>
            <p>The frequency distribution of HRQoL dimension scores from previous studies should be assessed to see what methods should be used for sample size calculations and analysis. If the HRQoL outcome has a limited number of discrete values (say less than seven categories e.g. RP and RE dimensions, in the case of the SF-36, Figures <figr fid="F1">1c</figr> and <figr fid="F1">1f</figr>) and/or the proportion of cases at the upper bound (i.e. scoring 100) is high (e.g. PF, SF and BP dimensions in our general population sample example dataset, Figures <figr fid="F1">1a,1e,1f</figr>), then we would recommend using Method 3 to estimate the required sample size (Figure <figr fid="F6">6</figr>) <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. In this case, the alternative hypothesis of a location shift model is questionable and the proportional odds model will provide a suitable alternative with such bounded discrete outcomes.</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>Choice of sample size estimation flow diagram</p>
               </caption>
               <text>
                  <p>Choice of sample size estimation flow diagram</p>
               </text>
               <graphic file="1477-7525-2-26-6"/>
            </fig>
            <p>If the HRQoL outcome has a larger number of discrete values (greater than or equal to seven categories say), most of which are occupied and the proportion of cases at the upper or bounds (i.e. scoring 0 or 100, in the case of the SF-36) is low (e.g. VT, GH and MH dimensions in our general population sample example dataset, Figures <figr fid="F1">1b,1g</figr> and <figr fid="F1">1h</figr>), then the simple location shift model is a useful working hypothesis. We would therefore recommend using Methods (1) or (2) to estimate the required sample size (Figure <figr fid="F6">6</figr>).</p>
            <p>Computer simulation has suggested that if the distributions of the HRQoL dimensions are reasonably symmetric (Figure <figr fid="F1">1b</figr>), and the proportion of patients at each bound is low, then under the location shift alternative hypothesis, the t-test appears to be slightly more powerful than the MW test at detecting differences in means (Figure <figr fid="F4">4</figr>). Therefore, if the distribution of the HRQoL outcomes is symmetric or expected to be reasonably symmetric and the proportion of patients at the upper or lower bounds is low then Method 1 could be used for sample size calculations and analysis (Figure <figr fid="F6">6</figr>). The use of parametric methods for analysis (i.e. t-test) also enables the relatively easy estimation of confidence intervals, which is regarded as good statistical practice <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>.</p>
            <p>If the distribution of the HRQoL outcome is expected to be skewed then the MW test appears to be more powerful at detecting a location shift (difference in means) than the t-test (Figures <figr fid="F2">2</figr> and <figr fid="F3">3</figr>). Therefore, in these circumstances, the MW test is preferable to the t-test and Method 2 could be used for sample size calculations and analysis. However, using Method 2 for sample size estimation requires the effect size to be defined in terms of Pr(Y > X), which is difficult to quantify and interpret. Pragmatically we would recommend Method 1 as the effect size <it>&#916;</it><sub>Normal </sub>is rather easier to quantify and interpret than the effect size p<sub>Noether </sub>required for sample size estimation using Method 2 (Figure <figr fid="F6">6</figr>).</p>
            <p>If the HRQoL data have a symmetric distribution, the mean and median will tend to coincide so either measure is a suitable summary measure of location. If the HRQoL data have an asymmetric distribution, then conventional statistical advice would suggest that the median is the preferred summary statistic <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. However, a case when the mean and mean difference might be preferred (even for skewed outcome data) as a summary measure is when heath care providers are deciding whether to offer a new treatment or not to its population. The mean (along with the sample size) provides information about the total benefit (and total cost) from treating all patients, which is needed as the basis for health care policy decisions <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. We cannot estimate the total benefit (or cost) from the sample median.</p>
            <p>If the sample size is "sufficiently large" then statistical theory, using the CLT, guarantees that the sample means will be approximately Normally distributed. Thus, if the investigator is planning a large study and the sample mean is an appropriate summary measure of the HRQoL outcome, then pragmatically there is no need to worry about the distribution of the HRQoL outcome and we can use conventional methods to calculate sample sizes. Although the Normal distribution is strictly only the limiting form of the sampling distribution of the sample mean as the sample size n increases to infinity, it provides a remarkably good approximation to the sampling distribution even when n is small and the distribution of the data is far from Normal. Generally, if n is greater than 25, these approximations will be good. However, if the underlying distribution is symmetric, unimodal and continuous a value of n as small as 4 can yield a very adequate approximation <abbrgrp><abbr bid="B29">29</abbr></abbrgrp>. If a reliable pilot or historical dataset of HRQoL data is readily available (to estimate the shape of the distribution) then bootstrap simulation (Method 4) will provide a more accurate and reliable sample size estimate than Methods 1 to 3 (Figure <figr fid="F6">6</figr>) <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>.</p>
            <p>We had a reliable historical data set of over 400 subjects so we had a large sample to estimate the distributions (cdfs) F<sub>x </sub>and F<sub>y </sub>under the null and alternative hypotheses using Method 4. Lesaffre et al <abbrgrp><abbr bid="B31">31</abbr></abbrgrp> show that bootstrap can give fairly unbiased estimates of power, though for small pilot samples with large variability. In the absence of a reliable pilot set, bootstrapping is not appropriate and we will need to use conventional methods of sample size estimation or simulation models <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>. Fortunately, with the increasing use of HRQoL outcomes in research, historical datasets are becoming more readily available. White and Thompson <abbrgrp><abbr bid="B33">33</abbr></abbrgrp> suggest the estimation of <graphic file="1477-7525-2-26-i12.gif"/> (and hence <graphic file="1477-7525-2-26-i13.gif"/>) should be derived from a pilot dataset, and that the use of baseline data or related data sets (which we have used) is somewhat less satisfactory. They suggest a third possibility for estimating <graphic file="1477-7525-2-26-i12.gif"/> is to use follow-up data viewed in a blinded manner, although only when the blinding can demonstrably be preserved.</p>
            <p>Strictly speaking, our results and conclusions only apply to the SF-36 outcome measure. Further empirical work is required to see whether or not these results hold true for other HRQoL outcomes, populations and interventions. However, the SF-36 has many features in common with other HRQoL outcomes, such as the NHP and QLQ-C30, i.e. multi-dimensional, ordinal or discrete response categories with upper and lower bounds, and skewed distributions; therefore, we see no theoretical reason why these results and conclusions with the SF-36 may not be appropriate for other HRQoL measures.</p>
            <p>Throughout this paper, we only considered the situation where a single dimension of HRQoL is used at a single endpoint. We have assumed a rather simple form of the alternative hypothesis that the new treatment/intervention would improve HRQoL compared to the control/standard therapy. This form of hypothesis (superiority vs. equivalence) may be more complicated than actually presented. However, the assumption of a simple form of the alternative hypothesis that new treatment/intervention would improve HRQoL compared to the control/standard therapy, is not unrealistic for most superiority trials and is frequently used for other clinical outcomes.</p>
            <p>We have based the calculations above on the assumption that there is a single identifiable endpoint, or HRQoL outcome, upon which treatment comparisons are based (in our case the GH dimension of the SF-36). Sometimes there is more than one endpoint of interest; HRQoL outcomes are multi-dimensional (e.g. the SF-36 has eight dimensions including GH). If one of these dimensions is regarded as more important than the others, it can be named as the primary endpoint and the sample size estimates calculated accordingly. The remainder should be consigned to exploratory analyses or descriptions only <abbrgrp><abbr bid="B3">3</abbr></abbrgrp>. Fairclough gives a more comprehensive discussion of multiple endpoints and suggests several methods for analysing HRQoL outcomes <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>.</p>
            <p>More work is required on what is a clinically meaningful effect sizes for the SF-36 and other HRQoL outcomes. There is an extensive literature on the important issue of clinically meaningful change and the minimum important difference (MID) for HRQoL outcomes. As the subject of this paper is the use of computer intensive methods such as the bootstrap we have played down the issue of the MID. Again for brevity and practical purposes of sample size estimation this paper has assumed the MID for the SF-36 outcome is around five points for each dimension. This is an important issue in sample size estimation. The interested reader is referred to a series of papers from the Clinical Significance Consensus Meeting Group for more detailed discussion <abbrgrp><abbr bid="B35">35</abbr></abbrgrp>.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>Finally, we would stress the importance of a sample size calculation (with all its attendant assumptions), and that any such estimate is better than no sample size calculation at all, particularly in a trial protocol <abbrgrp><abbr bid="B36">36</abbr></abbrgrp>. The mere fact of calculation of a sample size means that a number of fundamental issues have been thought about: what is the main outcome variable, what is a clinically important effect, and how is it measured? The investigator is also likely to have specified the method and frequency of data analysis. Thus, protocols that are explicit about sample size are easier to evaluate in terms of scientific quality and the likelihood of achieving objectives.</p>
      </sec>
   </bdy>
   <bm>
      <refgrp>
         <bibl id="B1">
            <aug>
               <au>
                  <snm>Altman</snm>
                  <fnm>DG</fnm>
               </au>
               <au>
                  <snm>Machin</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Bryant</snm>
                  <fnm>TN</fnm>
               </au>
               <au>
                  <snm>Gardner</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>Statistics with Confidence: Confidence intervals and statistical guidelines</source>
            <publisher>London: British Medical Journal</publisher>
            <edition>2</edition>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B2">
            <aug>
               <au>
                  <snm>Machin</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Campbell</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Fayers</snm>
                  <fnm>PM</fnm>
               </au>
               <au>
                  <snm>Pinol</snm>
                  <fnm>APY</fnm>
               </au>
            </aug>
            <source>Sample Sizes Tables for Clinical Studies</source>
            <publisher>Oxford: Blackwell Science</publisher>
            <edition>2</edition>
            <pubdate>1997</pubdate>
         </bibl>
         <bibl id="B3">
            <aug>
               <au>
                  <snm>Fayers</snm>
                  <fnm>PM</fnm>
               </au>
               <au>
                  <snm>Machin</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Quality of Life Assessment, Analysis and Interpretation</source>
            <publisher>Chichester: Wiley</publisher>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Design and Analysis of Trials with Quality of Life as an Outcome: a practical guide</p>
            </title>
            <aug>
               <au>
                  <snm>Walters</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Campbell</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Lall</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Journal of Biopharmaceutical Statistics</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <issue>3</issue>
            <fpage>155</fpage>
            <lpage>176</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1081/BIP-100107655</pubid>
                  <pubid idtype="pmpid">11725929</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Methods for determining sample sizes for studies involving health-related quality of life measures: a tutorial</p>
            </title>
            <aug>
               <au>
                  <snm>Walters</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Campbell</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Paisley</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Health Services &amp; Outcomes Research Methodology</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <fpage>83</fpage>
            <lpage>99</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1023/A:1020102612073</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <aug>
               <au>
                  <snm>Efron</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>RJ</fnm>
               </au>
            </aug>
            <source>An Introduction to the Bootstrap</source>
            <publisher>New York: Chapman &amp; Hall</publisher>
            <pubdate>1993</pubdate>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Costs and effectiveness of community postnatal support workers: randomised controlled trial</p>
            </title>
            <aug>
               <au>
                  <snm>Morrell</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Spiby</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Stewart</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Walters</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Morgan</snm>
                  <fnm>A</fnm>
               </au>
            </aug>
            <source>British Medical Journal</source>
            <pubdate>2000</pubdate>
            <volume>321</volume>
            <fpage>593</fpage>
            <lpage>598</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1136/bmj.321.7261.593</pubid>
                  <pubid idtype="pmpid" link="fulltext">10977833</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <aug>
               <au>
                  <snm>Staquet</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Hays</snm>
                  <fnm>RD</fnm>
               </au>
               <au>
                  <snm>Fayers</snm>
                  <fnm>PM</fnm>
               </au>
            </aug>
            <source>Quality of Life Assessment in Clinical Trials: Methods and Practice</source>
            <publisher>Oxford: Oxford University Press</publisher>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B9">
            <title>
               <p>The MOS 36-item short-form health survey (SF-36). I. Conceptual framework and item selection</p>
            </title>
            <aug>
               <au>
                  <snm>Ware</snm>
                  <fnm>JE</fnm>
                  <suf>Jr</suf>
               </au>
               <au>
                  <snm>Sherbourne</snm>
                  <fnm>CD</fnm>
               </au>
            </aug>
            <source>Medical Care</source>
            <pubdate>1992</pubdate>
            <volume>30</volume>
            <fpage>473</fpage>
            <lpage>483</lpage>
            <xrefbib>
               <pubid idtype="pmpid">1593914</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Validating the SF-36 health survey questionnaire: new outcome measure for primary care</p>
            </title>
            <aug>
               <au>
                  <snm>Brazier</snm>
                  <fnm>JE</fnm>
               </au>
               <au>
                  <snm>Harper</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Jones</snm>
                  <fnm>NMB</fnm>
               </au>
               <au>
                  <snm>O'Cathain</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Thomas</snm>
                  <fnm>KJ</fnm>
               </au>
               <au>
                  <snm>Usherwood</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Westlake</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>British Medical Journal</source>
            <pubdate>1992</pubdate>
            <volume>305</volume>
            <fpage>160</fpage>
            <lpage>164</lpage>
            <xrefbib>
               <pubid idtype="pmpid">1285753</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <aug>
               <au>
                  <snm>Pocock</snm>
                  <fnm>SJ</fnm>
               </au>
            </aug>
            <source>Clinical Trials: A Practical Approach</source>
            <publisher>Chichester: Wiley</publisher>
            <pubdate>1983</pubdate>
         </bibl>
         <bibl id="B12">
            <aug>
               <au>
                  <snm>Lehman</snm>
                  <fnm>EL</fnm>
               </au>
            </aug>
            <source>Nonparametric Statistical Methods Based on Ranks</source>
            <publisher>San Francisco: Holden-Day</publisher>
            <pubdate>1975</pubdate>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Sample Size Determination for Some Common Nonparametric Tests</p>
            </title>
            <aug>
               <au>
                  <snm>Noether</snm>
                  <fnm>GE</fnm>
               </au>
            </aug>
            <source>J American Statistical Association</source>
            <pubdate>1987</pubdate>
            <volume>82</volume>
            <issue>398</issue>
            <fpage>645</fpage>
            <lpage>647</lpage>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Determining the Appropriate Sample Size for Nonparametric Tests for Location Shift</p>
            </title>
            <aug>
               <au>
                  <snm>Hamilton</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Collings</snm>
                  <fnm>BJ</fnm>
               </au>
            </aug>
            <source>Technometrics</source>
            <pubdate>1991</pubdate>
            <volume>3</volume>
            <issue>33</issue>
            <fpage>327</fpage>
            <lpage>337</lpage>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Alternative Estimation Procedures for Pr(X &lt; Y) in Categorised Data</p>
            </title>
            <aug>
               <au>
                  <snm>Simonoff</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Hochberg</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Reiser</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Biometrics</source>
            <pubdate>1986</pubdate>
            <volume>42</volume>
            <fpage>895</fpage>
            <lpage>907</lpage>
            <xrefbib>
               <pubid idtype="pmpid">3814730</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Sample size calculations for ordered categorical data</p>
            </title>
            <aug>
               <au>
                  <snm>Whitehead</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Statistics in Medicine</source>
            <pubdate>1993</pubdate>
            <volume>12</volume>
            <fpage>2257</fpage>
            <lpage>2271</lpage>
            <note>[published erratum appears in <it>Stat Med </it>1994 Apr 30;13(8):871].</note>
            <xrefbib>
               <pubid idtype="pmpid">8134732</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <aug>
               <au>
                  <snm>Campbell</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>Statistics at Square Two: Understanding Modern Statistical Applications in Medicine</source>
            <publisher>London: British Medical Journal</publisher>
            <pubdate>2001</pubdate>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Re-conceptualising and Generalising the Absolute Risk Difference: A unification of Effect Sizes, Odds Ratios and Number-Needed-to-Treat</p>
            </title>
            <aug>
               <au>
                  <snm>Shepstone</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Journal of Epidemiology &amp; Community Health</source>
            <pubdate>2001</pubdate>
            <volume>55</volume>
            <issue>(Suppl 1) 1a</issue>
            <fpage>A7</fpage>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Estimating the Power of the Two-Sample Wilcoxon Test for Location Shift</p>
            </title>
            <aug>
               <au>
                  <snm>Collings</snm>
                  <fnm>BJ</fnm>
               </au>
               <au>
                  <snm>Hamilton</snm>
                  <fnm>MA</fnm>
               </au>
            </aug>
            <source>Biometrics</source>
            <pubdate>1998</pubdate>
            <volume>44</volume>
            <fpage>847</fpage>
            <lpage>860</lpage>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Sample Sizes for the SF-6D Preference Based Measure of Health from the SF-36: A Comparison of Two Methods</p>
            </title>
            <aug>
               <au>
                  <snm>Walters</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Brazier</snm>
                  <fnm>JE</fnm>
               </au>
            </aug>
            <source>Health Services &amp; Outcomes Research Methodology</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>35</fpage>
            <lpage>47</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1023/A:1025876827228</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <aug>
               <au>
                  <snm>Simon</snm>
                  <fnm>JL</fnm>
               </au>
            </aug>
            <source>Resampling Stats: Users Guide. v5.02</source>
            <publisher>Arlington: Resampling Stats Inc</publisher>
            <pubdate>2000</pubdate>
         </bibl>
         <bibl id="B22">
            <aug>
               <au>
                  <snm>Ware</snm>
                  <fnm>JE</fnm>
                  <suf>Jr</suf>
               </au>
               <au>
                  <snm>Snow</snm>
                  <fnm>KK</fnm>
               </au>
               <au>
                  <snm>Kosinski</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Gandek</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>SF-36 Health Survey Manual and Interpretation Guide</source>
            <publisher>Boston, MA The Health Institute, New England Medical Centre</publisher>
            <pubdate>1993</pubdate>
         </bibl>
         <bibl id="B23">
            <aug>
               <au>
                  <snm>Elashoff</snm>
                  <fnm>JD</fnm>
               </au>
            </aug>
            <source>nQuery Advisor Version 3.0 User's Guide</source>
            <publisher>Los Angeles Statistical Solutions</publisher>
            <pubdate>1999</pubdate>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Robustness and power of analysis of covariance applied to data distorted from Normality by floor effects</p>
            </title>
            <aug>
               <au>
                  <snm>Sullivan</snm>
                  <fnm>lM</fnm>
               </au>
               <au>
                  <snm>D'Agostino</snm>
                  <fnm>RB</fnm>
               </au>
            </aug>
            <source>Statistics in Medicine</source>
            <pubdate>1996</pubdate>
            <volume>15</volume>
            <fpage>477</fpage>
            <lpage>496</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/(SICI)1097-0258(19960315)15:5&lt;477::AID-SIM217>3.0.CO;2-R</pubid>
                  <pubid idtype="pmpid">8668873</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Robustness of the two independent samples <it>t</it>-test when applied to ordinal scaled data</p>
            </title>
            <aug>
               <au>
                  <snm>Heeren</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>D'Agostino</snm>
                  <fnm>RB</fnm>
               </au>
            </aug>
            <source>Statistics in Medicine</source>
            <pubdate>1987</pubdate>
            <volume>6</volume>
            <fpage>79</fpage>
            <lpage>90</lpage>
            <xrefbib>
               <pubid idtype="pmpid">3576020</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <title>
               <p>Robustness and power of analysis of covariance applied to ordinal scaled data as arising in randomized controlled trials</p>
            </title>
            <aug>
               <au>
                  <snm>Sullivan</snm>
                  <fnm>lM</fnm>
               </au>
               <au>
                  <snm>D'Agostino</snm>
                  <fnm>RB</fnm>
               </au>
            </aug>
            <source>Statistics in Medicine</source>
            <pubdate>2003</pubdate>
            <volume>22</volume>
            <fpage>1317</fpage>
            <lpage>1334</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/sim.1433</pubid>
                  <pubid idtype="pmpid" link="fulltext">12687657</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Sample sizes for randomized trials measuring quality of life in cancer patients</p>
            </title>
            <aug>
               <au>
                  <snm>Julious</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>George</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Machin</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Stephens</snm>
                  <fnm>RJ</fnm>
               </au>
            </aug>
            <source>Quality of Life Research</source>
            <pubdate>1997</pubdate>
            <volume>6</volume>
            <fpage>109</fpage>
            <lpage>117</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1023/A:1026481815304</pubid>
                  <pubid idtype="pmpid">9161110</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B28">
            <title>
               <p>How should cost data in pragmatic randomised trials be analysed?</p>
            </title>
            <aug>
               <au>
                  <snm>Thompson</snm>
                  <fnm>SG</fnm>
               </au>
               <au>
                  <snm>Barber</snm>
                  <fnm>JA</fnm>
               </au>
            </aug>
            <source>British Medical Journal</source>
            <pubdate>2000</pubdate>
            <volume>320</volume>
            <fpage>1197</fpage>
            <lpage>1200</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1136/bmj.320.7243.1197</pubid>
                  <pubid idtype="pmpid" link="fulltext">10784550</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B29">
            <aug>
               <au>
                  <snm>Hogg</snm>
                  <fnm>RV</fnm>
               </au>
               <au>
                  <snm>Tanis</snm>
                  <fnm>EA</fnm>
               </au>
            </aug>
            <source>Probability and Statistical Inference</source>
            <publisher>New York: McMillan</publisher>
            <edition>3</edition>
            <pubdate>1988</pubdate>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Approximating the Power of Wilcoxon's Rank-Sum Test Against Shift Alternatives</p>
            </title>
            <aug>
               <au>
                  <snm>Troendle</snm>
                  <fnm>JF</fnm>
               </au>
            </aug>
            <source>Statistics in Medicine</source>
            <pubdate>1999</pubdate>
            <volume>18</volume>
            <fpage>2763</fpage>
            <lpage>2773</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/(SICI)1097-0258(19991030)18:20&lt;2763::AID-SIM197>3.3.CO;2-D</pubid>
                  <pubid idtype="pmpid" link="fulltext">10521865</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Calculation of power and sample size with bounded outcome scores</p>
            </title>
            <aug>
               <au>
                  <snm>Lesaffre</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Scheys</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Frohlich</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Bluhmki</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Statistics in Medicine</source>
            <pubdate>1993</pubdate>
            <volume>12</volume>
            <fpage>1063</fpage>
            <lpage>1078</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8341866</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Regression with Bounded Outcome Score: Evaluation of Power by Bootstrap and Simulation in a Chronic Myelogenous Leukaemia Clinical Trial</p>
            </title>
            <aug>
               <au>
                  <snm>Tsodikov</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Hasenclever</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Loeffler</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Statistics in Medicine</source>
            <pubdate>1998</pubdate>
            <volume>17</volume>
            <fpage>1909</fpage>
            <lpage>1922</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/(SICI)1097-0258(19980915)17:17&lt;1909::AID-SIM890>3.0.CO;2-0</pubid>
                  <pubid idtype="pmpid" link="fulltext">9777686</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>Choice of test for comparing two groups, with particular application to skewed outcomes</p>
            </title>
            <aug>
               <au>
                  <snm>White</snm>
                  <fnm>IR</fnm>
               </au>
               <au>
                  <snm>Thomson</snm>
                  <fnm>SG</fnm>
               </au>
            </aug>
            <source>Statistics in Medicine</source>
            <pubdate>2003</pubdate>
            <volume>22</volume>
            <fpage>1205</fpage>
            <lpage>1215</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1002/sim.1420</pubid>
                  <pubid idtype="pmpid" link="fulltext">12687651</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B34">
            <aug>
               <au>
                  <snm>Fairclough</snm>
                  <fnm>DL</fnm>
               </au>
            </aug>
            <source>Design and Analysis of Quality of Life Studies in Clinical Trials</source>
            <publisher>New York: Chapman &amp; Hall</publisher>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B35">
            <title>
               <p>Group vs individual approaches to understanding the clinical significance of differences or changes in quality of life</p>
            </title>
            <aug>
               <au>
                  <snm>Cella</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Bullinger</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Scott</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Barofsky</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <cnm>and the Clinical Significance Consensus Meeting Group</cnm>
               </au>
            </aug>
            <source>Mayo Clinic Proceedings</source>
            <pubdate>2002</pubdate>
            <volume>77</volume>
            <issue>4</issue>
            <fpage>384</fpage>
            <lpage>392</lpage>
            <xrefbib>
               <pubid idtype="pmpid">11936936</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B36">
            <title>
               <p>Statistical review by research ethics committees</p>
            </title>
            <aug>
               <au>
                  <snm>Williamson</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Hutton</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Bliss</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Blunt</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Campbell</snm>
                  <fnm>MJ</fnm>
               </au>
               <au>
                  <snm>Nicholson</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>J Roy Statist Soc A</source>
            <pubdate>2000</pubdate>
            <volume>163</volume>
            <fpage>5</fpage>
            <lpage>13</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1111/1467-985X.00152</pubid>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>
