Statistical power analysis

Disclaimer: Most of the contents on this page were directly copied from Wikipedia.

The power of a statistical test is the probability that it correctly rejects the null hypothesis when the null hypothesis is false (i.e. the probability of not committing a Type II error). That is,

\[ \mbox{power} = \mathbb P\big( \mbox{reject null hypothesis} \big| \mbox{null hypothesis is false} \big) \]

It can be equivalently thought of as the probability of correctly accepting the alternative hypothesis when the alternative hypothesis is true - that is, the ability of a test to detect an effect, if the effect actually exists.

The power is in general a function of the possible distributions, often determined by a parameter, under the alternative hypothesis. As the power increases, the chances of a Type II error occurring decrease. The probability of a Type II error occurring is referred to as the false negative rate (β) and the power is equal to 1−β. The power is also known as the sensitivity.

Power analysis can be used to calculate the minimum sample size required so that one can be reasonably likely to detect an effect of a given effect size|size. Power analysis can also be used to calculate the minimum effect size that is likely to be detected in a study using a given sample size. In addition, the concept of power is used to make comparisons between different statistical testing procedures: for example, between a parametric and a nonparametric test of the same hypothesis.

Factors influencing power

Statistical power may depend on a number of factors. Some of these factors may be particular to a specific testing situation, but at a minimum, power nearly always depends on the following three factors:

the statistical significance criterion used in the test
the magnitude of the effect of interest in the population
the sample size used to detect the effect

A significance criterion is a statement of how unlikely a positive result must be, if the null hypothesis of no effect is true, for the null hypothesis to be rejected. The most commonly used criteria are probabilities of 0.05 (5%, 1 in 20), 0.01 (1%, 1 in 100), and 0.001 (0.1%, 1 in 1000). If the criterion is 0.05, the probability of the data implying an effect at least as large as the observed effect when the null hypothesis is true must be less than 0.05, for the null hypothesis of no effect to be rejected. One easy way to increase the power of a test is to carry out a less conservative test by using a larger significance criterion, for example 0.10 instead of 0.05. This increases the chance of rejecting the null hypothesis (i.e. obtaining a statistically significant result) when the null hypothesis is false, that is, reduces the risk of a Type II error (false negative regarding whether an effect exists). But it also increases the risk of obtaining a statistically significant result (i.e. rejecting the null hypothesis) when the null hypothesis is not false; that is, it increases the risk of a Type I error (false positive).

The magnitude of the effect of interest in the population can be quantified in terms of an effect size, where there is greater power to detect larger effects. An effect size can be a direct estimate of the quantity of interest, or it can be a standardized measure that also accounts for the variability in the population. For example, in an analysis comparing outcomes in a treated and control population, the difference of outcome means \(Y-X\) would be a direct measure of the effect size, whereas \((Y-X)/\sigma\) where σ is the common standard deviation of the outcomes in the treated and control groups, would be a standardized effect size. If constructed appropriately, a standardized effect size, along with the sample size, will completely determine the power. An unstandardized (direct) effect size will rarely be sufficient to determine the power, as it does not contain information about the variability in the measurements.

The sample size determines the amount of sampling error inherent in a test result. Other things being equal, effects are harder to detect in smaller samples. Increasing sample size is often the easiest way to boost the statistical power of a test.

The precision with which the data are measured also influences statistical power. Consequently, power can often be improved by reducing the measurement error in the data.

The design of an experiment or observational study often influences the power. For example, in a two-sample testing situation with a given total sample size n, it is optimal to have equal numbers of observations from the two populations being compared (as long as the variances in the two populations are the same). In regression analysis and Analysis of Variance, there is an extensive theory, and practical strategies, for improving the power based on optimally setting the values of the independent variables in the model.

Interpretation

Although there are no formal standards for power (sometimes referred to as \(\pi\)), most researchers assess the power of their tests using \(\pi\))=0.80 as a standard for adequacy. This convention implies a four-to-one trade off between \(\beta\))-risk and \(\alpha\))-risk. (\(\beta\)) is the probability of a Type II error; \(\alpha\)) is the probability of a Type I error, 0.2 and 0.05 are conventional values for \(\beta\)) and \(\alpha\))). However, there will be times when this 4-to-1 weighting is inappropriate. In medicine, for example, tests are often designed in such a way that no false negatives (Type II errors) will be produced. But this inevitably raises the risk of obtaining a false positive (a Type I error). The rationale is that it is better to tell a healthy patient “we may have found something - let's test further”, than to tell a diseased patient “all is well”.

Power analysis is appropriate when the concern is with the correct rejection, or not, of a null hypothesis. In many contexts, the issue is less about determining if there is or is not a difference but rather with getting a more refined estimate of the population effect size. For example, if we were expecting a population correlation between intelligence and job performance of around .50, a sample size of 20 will give us approximately 80% power (alpha = .05, two-tail) to reject the null hypothesis of zero correlation. However, in doing this study we are probably more interested in knowing whether the correlation is .30 or .60 or .50. In this context we would need a much larger sample size in order to reduce the confidence interval of our estimate to a range that is acceptable for our purposes. Techniques similar to those employed in a traditional power analysis can be used to determine the sample size required for the width of a confidence interval to be less than a given value.

Many statistical analyses involve the estimation of several unknown quantities. In simple cases, all but one of these quantities is a nuisance parameter. In this setting, the only relevant power pertains to the single quantity that will undergo formal statistical inference. In some settings, particularly if the goals are more “exploratory”, there may be a number of quantities of interest in the analysis. For example, in a multiple regression analysis we may include several covariates of potential interest. In situations such as this where several hypotheses are under consideration, it is common that the powers associated with the different hypotheses differ. For instance, in multiple regression analysis, the power for detecting an effect of a given size is related to the variance of the covariate. Since different covariates will have different variances, their powers will differ as well.

It is also important to consider the statistical power of a hypothesis test when interpreting its results. A test's power is the probability of correctly rejecting the null hypothesis when it is false; a test's power is influenced by the choice of significance level for the test, the size of the effect being measured, and the amount of data available. A hypothesis test may fail to reject the null, for example, if a true difference exists between two populations being compared by a t-test but the effect is small and the sample size is too small to distinguish the effect from random chance. Many clinical trials, for instance, have low statistical power to detect differences in adverse effects of treatments, since such effects are rare and the number of affected patients is very small.

A priori vs. post hoc analysis

Power analysis can either be done before (a priori or prospective power analysis) or after (post hoc or retrospective power analysis) data are collected. A priori power analysis is conducted prior to the research study, and is typically used in estimating sufficient sample sizes to achieve adequate power. Post-hoc power analysis is conducted after a study has been completed, and uses the obtained sample size and effect size to determine what the power was in the study, assuming the effect size in the sample is equal to the effect size in the population. Whereas the utility of prospective power analysis in experimental design is universally accepted, the usefulness of retrospective techniques is controversial. Falling for the temptation to use the statistical analysis of the collected data to estimate the power will result in uninformative and misleading values. In particular, it has been shown that post-hoc power in its simplest form is a one-to-one function of the p-value attained. This has been extended to show that all post-hoc power analyses suffer from what is called the “power approach paradox” (PAP), in which a study with a null result is thought to show MORE evidence that the null hypothesis is actually true when the p-value is smaller, since the apparent power to detect an actual effect would be higher. In fact, a smaller p-value is properly understood to make the null hypothesis LESS likely to be true.

Software for Power and Sample Size Calculations

Numerous programs are available for performing [http://www.epibiostat.ucsf.edu/biostat/sampsize.html power and sample size calculations.] These include

WebPower
nQuery Advisor
Russ Lenth's power and sample-size page
Power and sample size.
A large set of power and sample size routines are included in R and Stata
The other programs listed above are specialized for these calculations and are easier to use by people who are not familiar with the more general packages. nQuery, PASS, SAS and Stata are commercial products. The other programs listed above are freely available.

Sidebar

Table of Contents

Statistical power analysis

Factors influencing power

Interpretation

A priori vs. post hoc analysis

Software for Power and Sample Size Calculations

User Tools

Site Tools

Sidebar

Table of Contents

Statistical power analysis

Factors influencing power

Interpretation

A priori vs. post hoc analysis

Software for Power and Sample Size Calculations

Page Tools