
当前您在: 主页 > 高中公式大全 >


2020-09-19 23:31


前面小弟已经发了一个帖子,就是关于我导师收藏的一个运动营养 网站,这是一个个人网站,
里面关于统计的部分非常详细和丰富!看到重视的人不是很多,很是遗憾,这 个网站真的是

看到版中经常有战友问关于样本量大小如何确定的问题。样本量的涉及要根据你的实验设计< br>来确定的。
在这儿小弟就关于样本量如何计算的部分贴出来供大家参考。请各位战友根据自己的 实验设

The traditional approach to estimation of sample size is based on statistical
significance of your outcome measure. You have to specify the smallest effect you
want to detect, the Type I and Type II error rates, and the design of the study.
I present here new formulae for the resulting estimates of sample size. I also include
new ways to adjust for validity and reliability, and I finish with sample sizes
required for several complex cross-sectional designs.

I also advocate a new approach to sample-size estimation based on width of the
confidence interval of your outcome measure. In this new approach, your concern is
with the precision of your estimate of the effect, not with the statistical
significance of the effect. The formulae on these pages still apply, but you halve
the sample sizes.

---------------------- -------------------------------------------------- -------
The Smallest Effect Worth Detecting
I've already spent a whole page on magnitudes of effects. You should go back and
make sure you understand it before proceeding. Or take a risk and read on!

Let's look at a simple example of the smallest effect worth detecting. Your research
project includes the question of differences in height of adults in two regions.
This sounds like a trivial project, but hey, the difference might be caused by a
nutritional deficit, environmental toxin, level of physical activity, or whatever.
OK, what difference in height would you consider to be the smallest difference worth
noticing or commenting on? Almost everyone reading this paragraph will automatically
start thinking either in inches or centimeters. So what's your choice? An inch, or
2.5 cm? Sounds like a nice round figure! Let's go with it for now.

To use my approach to sample-size estimation, you convert this difference into a
value for the effect-size statistic. To do that, you divide it by the standard
deviation, expressed in the same units. The standard deviation here is just the usual
measure of spread, except that we have two groups. So let's assume we have an average
of the standard deviation in both groups. Let's say it is 2 inches, or 5 cm. So,
if you want to detect 2.5 cm, and the standard deviation is 5.0 cm, the smallest
effect worth detecting is 2.55.0, or 0.5.

I'll talk about what I mean by detecting in a minute. First, more about the smallest
effect. You'll discover shortly that the required number of subjects is quite
sensitive to the magnitude of the smallest worthwhile effect. In fact, halving the
magnitude quadruples the number of subjects required to detect it. So the way you
decide on the smallest effect is important. How did we arrive at that minimum
difference of 2.5 cm? In my experience, most researchers dream up a number that sounds
plausible, just like we did here. Well, sorry, but you just can't do it like that.
In fact, you don't have the freedom to choose the minimum effect. In all but a few
special cases, it's the threshold for small effects on the scale of magnitudes: 0.2
for the Cohen effect-size statistic, 10% for a frequency difference, and 0.1 for
a correlation. You need the same sample size to detect each of these effects, and
as we'll see, it's 800 subjects for a simple cross- sectional study in the
old-fashioned way of doing the figuring. It's even more than 800 when you factor
in the validity of your variables. But don't panic. We'll also see that there are
ways of reducing this number, sometimes drastically.

----------------------- -------------------------------------------------- ------
Type I and II Error Rates
Now, what do I mean by detecting? Simply that if the real difference between the
two groups in the population is 2.5 cm (an effect size of 0.5), you want to be sure
that it will turn up as statistically significant in the sample that you draw for
your study. If it doesn't turn up as statistically significant, you have failed to
detect something that you were interested in. Make sense? So our definition of
statistical significance, and our idea of what it means to be sure that it will turn
up, both impact on the required sample size.

First, statistical significance. The difference is statistically significant, by
definition, if the 95% confidence interval does not overlap zero, or if the p value
for the effect is less than 0.05. Values of 95% or 0.05 are also equivalent to a
Type I error rate of 5%: in other words, the rate of false alarms in the absence
of any population effect will be 5%. We don't have any choice here. It has to be
5%, or less preferably, but most researchers opt for 5%. If you want a lower rate
of false alarms, say 1%, you will need more subjects.

Now, what about being sure that the effect will turn up? In other words, if the effect
really is 2.5 cm in the populations, how sure do we want to be that the difference
observed in our sample will be statistically significant? We don't have any choice
here, either. We have to be at least 80% sure of detecting the smallest effect. To
put it another way, the power of the study to detect the smallest effect has to be
at least 80%. Or to put it yet one more way, the Type II error rate--the rate of
failed alarms for the smallest effect--is set at 20% or less. That's one chance in
five of missing the thing you're looking for!?! Sounds a bit high, but keep in mind
that it is the rate for the smallest worthwhile effect. The chance of missing larger
effects is smaller. Once again, if you want to make the error rate lower, say 10%,
you will need more subjects.

---------------- -------------------------------------------------- -------------
Research Design
We're stuck with having to detect 0.2 for the effect- size statistic, 10% for a
frequency difference, or 0.1 for a correlation. And we're stuck with false and failed
alarms of 5% and 20%. All that's left now is how we're going to go about it: the
research design. When it comes to sample sizes, there are only two sorts of research
design: cross-sectional and longitudinal.

Cross-sectional designs include correlational, case-control, and any other design
with single observations for each subject. Some so-called prospective designs, where
subjects are followed up over time, are cross- sectional if there is only one value
for each variable for each subject. Cross-sectional studies need heaps of subjects,
and the number is affected by the validity of the variables.

Longitudinal designs include time series, experiments, controlled trials,
crossovers, and anything else where the dependent variable is measured twice or more.
The data have to be subjected to repeated-measures analysis. The usual thing with
these designs is a measurement before and after you do something, to see if what
you do has any effect. Whether or not you have a control group, it is always the
case that subjects
post measurements on the subjects. Longitudinal designs generally need far fewer
subjects than cross-sectional designs, depending on the reliability of dependent


Sample Size for Cross-Sectional Studies

For variables with perfect validity, you can now look up tables or run special
software to see how many subjects you need. (G*power is a great little free program
for the purpose.) Or use the following simple formula I have worked out:

For Type I and II errors of 5% and 20%, the total number of subjects N is given by:

N = 32ES2, where ES is the smallest effect size worth detecting.
Example: for ES = 0.2, the total N is 800, which means 400 in each group for a
case-control study or a study comparing males and females. So for our study of
differences in height, we'd need 400 in each group.

What about if the outcome is a difference in the frequency of something in the two
groups, for example the frequency of clinical obesity. The minimum worthwhile
difference is 10% (e.g. 25% in one group and 35% in the other). You just think about
that difference as being equivalent to an effect size of 0.2, and plug it into the
formula: 400 in each group again.

And finally what about sample size to detect a correlation, for example the
correlation between physical activity and body fat? Same story: 800 subjects to
detect the minimum worthwhile correlation of 0.1, because a correlation of 0.1 is
equivalent to an effect size of 0.2. For larger correlations use the scale of
magnitudes to convert the correlation to an equivalent effect size, then plug it
into the formula.

For the rare cases where you have the luxury of Type I and II errors of 1% and 10%
respectively, the number is nearly double: N = 60ES2.

Validity of the variables can have a major impact on sample size in cross-sectional
studies. The lower the validity, the more the
subjects you need to detect the signal. If the validity correlation of the dependent
variable is v (Pearson, intraclass, or kappa), the number of subjects increases to

To detect a correlation between variables with validities v and w, the number is
N(v2w2). Sample sizes may therefore have to be doubled or quadrupled when effects
are represented by psychometric or other variables that have modest (~0.7) validity.

Sample Size for Longitudinal Studies

In our first example on this page, we had a cross-sectional design in which we were
interested in the difference in height between people in two regions. Now, in a
longitudinal design, we might want to know whether a stretching exercise makes people
taller. Can you see that the same concept of minimum effect size still holds here?
If we thought one inch was the smallest difference worth detecting between groups,
then it has to be the smallest difference we would like to see as a result of our
stretching exercise. (It might need a medieval rack to make people a whole inch

Once again we don't have a choice about that minimum effect: it's still an effect
size of 0.2 standard deviations, and the standard deviation is still the usual
standard deviation of the subjects. At the moment we have only one group of subjects,
and the standard deviation before we put people on the rack is usually about the
same as after the rack. So you can think about the minimum effect size as a fraction
of either standard deviation. But note well: do not use the standard deviation of
the before-after difference score.

Reliability of the dependent variable is the final piece of the jigsaw. The higher
the reliability, the more reproducible are the values for each subject when you
retest them, which makes it more likely you will detect a change in their values.
So the higher the reliability, the less subjects you need to detect the minimum effect.
Read the earlier section on sample size for an experiment for an overview of the
role of typical error in sample-size estimation, and for an important detail about
the conditions in a reliability study aimed at estimating sample size.

The rest of this section contains details of formulae that you may not need to worry
about. You can use two forms of reliability in the formulae: retest correlation and
within- subject variation.

Using the Retest Correlation

First, a couple of cautions. The retest correlation is for retests with the same
time between the tests as you intend to have in your experiment. For example, if
you are doing an intervention that lasts 2 months, you need a 2-month retest
correlation. Don't use a 1-day retest correlation unless you have good grounds for
believing that it will be the same as a 2-month retest correlation. Also, the spread
between the subjects in your study has to be similar to the spread between the
subjects in the reliability study. If the spread is different, the value of the retest
correlation coefficient will be inappropriate. In that case you will need to
calculate the appropriate value by combining the within and between (S) standard
deviations for your subjects using this formula:
retest correlation r = (S2-s2)S2.

Right, here's the strategy for working out the required sample size when you know
the retest correlation:

Work out the sample size of an equivalent cross-sectional study, N, as shown above.
It's 800 in the traditional approach using statistical significance, or 400 using
my new approach of adequate precision of estimation for trivial effects.
Determine the reliability r of the outcome measure by consulting the literature or
doing a separate study.
For a simple design consisting of a single pre and post measurement on each subject,
and no control group, the number of subjects is:
n = (1 - r)N2
This formula applies also to simple crossover designs, in which subjects receive
an experimental treatment and a control treatment. (One half get the experimental
treatment first; the other half get the control treatment first.)
If there is a control group, the total number of subjects required is:
n = 2(1 - r)N
Yes, you need four times the number of subjects when there is a control group, not
twice the number. Hard to accept, I know.
To take into account the validity of the outcome measure, multiply the above formulae
by 1v2, where v is the concurrent validity correlation (the correlation between
the observed value and the true value of the variable). The simplest estimate of
the concurrent validity is the square root of the concurrent reliability correlation
for the outcome measure, so you simply divide the above formulae by the concurrent
reliability correlation. In general, the concurrent reliability will be greater than
the retest reliability

Using the Within-Subject Variation

You can also think about the difference between the post and pre means in terms of
the within-subject variation (standard deviation). For example, if the performance
of an individual athlete varies by 1% (the within- subject standard deviation
expressed as a coefficient of variation), how many athletes should you test to detect
a 1% change in performance, or a 2% change, or a 0.5% change? Here is the formula:

To detect a fraction of a within-subject standard deviation with 5% false alarms
and 20% failed alarms:
n = 64f2 with a full control group
n = 16f2 for crossovers or experiments without a control group.
Another way to represent the same formulae is to replace f with ds, where d is the
smallest worthwhile post-pre difference you want to detect, and s is the
within-subject standard deviation:
n = 64s2d2 with a full control group
n = 16s2d2 for crossovers or experiments without a control group.
Remember to halve these numbers when you justify sample size using the new approach
based on acceptable precision of the outcome.

Example: You want to detect (p=0.05, 80% power) a 2% change in performance when the
coefficient of variation is 2%. The corresponding value of f is 1.0, which means
you'd need to test 16 athletes in a crossover design, or 32 in each of a control
and experimental group. Or it's 8 or 16+16, if you justify sample size using precision
of estimation.

What's the smallest value of f worth detecting? Is it 1.0? Not an easy question!
To answer it, you usually have to bring in the between-subject variation one way
or another. Why? Because you can't get away from the fact that the magnitude of a
change in the value of a variable usually has to be thought about in terms of the
variation in the values of that variable between subjects. That's what minimum
worthwhile effect sizes are all about. For example, if the between-subject variation
is 5%, the smallest difference worth detecting is 0.2*5% or 1%. So, if your
within- subject variation of 2%, you have to chase an f of 0.5. But if the
between-subject variation is 10%, the smallest worthwhile effect is 0.2*10% or 2%,
so you chase an f of 1.0.

Once you bring the between-subject variation back into the picture, you have all
the ingredients for expressing the reliability as a retest correlation, so you can
use the formulae with the retest correlation. For example, a within of 2% and a
between of 5% implies a retest correlation of (52-22)52 or (25-4)25 or 0.84. A
within of 2% and a between of 10% implies a correlation of (100-4)100, or 0.96.
Use these correlations in the formulae for sample size and you'll get the same answers
as in the formulae using f. But if you have a reasonable notion of the smallest
worthwhile change in a variable without explicitly knowing the between-subject
standard deviation or the correlation, use the formula with d and s (or f).

There is certainly one situation where it's better to use the within-subject
variation: estimation of sample size in studies of athletic performance. When
athletes are subjects and competitive performance is the outcome, the smallest
worthwhile effect is an enhancement that increases the medal prospects of a top
athlete, not the average athlete. For sports like track and field, this minimum
effect is about 0.5 of the typical variation in a top athlete's performance between
events. For example, if the typical variation between events is 1.0%, then you're
interested in enhancements of about 0.5%. So if you use a lab test with the same
typical error as the competitive event, f in the above formulae is simply 0.5, so
you would need 640.52, or 256 subjects for a fully controlled study. That's bad
enough, but if your lab test has a typical variation of 2.0%, f is 0.52.0, which
means 1024 subjects! Oh no! Clearly you need very reliable lab tests if you want
to detect the smallest effects that matter to top athletes. See this Sportscience
article for more information:

Hopkins WG, Hawley JA, Burke LM (1999). Researching worthwhile performance
enhancements. Sportscience 3,

Sample Size for Complex Cross-Sectional Studies

I'll deal with two groups of unequal size, more than two groups, and more than one
independent variable. Anything else requires simulation.

Two Groups of Unequal Size

Up to this point I have assumed equal numbers in each group, because that gives the
most power to detect a difference between the groups. But sometimes unequal numbers
are justified.

The simplest case is where you have far more in one group than another. For example,
you already have the heights for thousands of control subjects from all over the
country, and you want to compare these with the heights of people from a particular
region you are interested in. So, how many subjects do you need in that particular
group? And the answer is... as few as one-quarter the usual number! But you will
need to test, or have the data for, an
group for the number to be that low. How big is infinite? For the purposes of
statistical power, about 5 times as many as in the special-interest group is close

I have a formula, but to understand how to apply it will need a lot of thought. If
you have samples of size n1 and n 2, then your study will have the power equivalent
to a study with a sample size of N equally divided between two groups, where:

N = 4 n1 n2( n1 + n2)
For example, if you have data for 1000 controls (= n1), and 800 (= N) is the number
you would normally require for equal-sized groups, then the above formula shows that
you need to test only 250 cases (= n2). If you make n1 very large, the formula
simplifies to N = 4 n2, or n2 = N4, which is one-quarter the usual total number.

More Than Two Groups

Suppose we wanted to compare the heights of people in more than two regions. What
should we do about the sample size? Do we need more than 400 in each region, less
than 400, or just 400? And the answer is... it depends on what estimates or contrasts
you want to perform.

If you are interested in comparing one particular region with another particular
region, you will still need 400 in each of those regions to keep the same power to
detect a difference. The fact that you have all those other regions in the analysis
matters not a jot, I'm afraid. They don't increase the power of the design unless
the number in each region is about 10 or less, which it never should be!

If you are interested in comparing one particular region with the mean of every other,
you've got the usual two-group design, but with 400 subjects in the region of interest
and 400 divided up equally into the other regions.

If you want to do every possible comparison between pairs of regions, or between
pairs of groups of regions, things start to get complicated. As far as I can see,
with six regions, say, only five completely independent comparisons are possible.
So if you are concerned about inflation of the Type I error, you will need to apply
Bonferroni's correction by reducing the p value to 0.055, or 0.01. Alas, a smaller
p value means a bigger sample size. It's difficult to work out exactly what it should
go up to, because somehow or other the inflated Type II error should also be taken
into account. Certainly, nearly doubling the group size from the usual 400 would
be a good start in this example, because as we've already seen on this page, that
would be equivalent to a p value of 0.01 and a Type II error of 10%, instead of the
usual 0.05 and 20%.

More Than One Independent Variable

Suppose you intend to measure half a dozen things like age, sex, body fat, whatever,
and you want to know the effect of each of them on severity of injury in a particular
sport. How many subjects do you need?

Before we get clever with complex models for this question, let's take in the big
view. If we treat each variable as a separate issue, it should be obvious that there
will be a problem with inflation of the Type I error: none of the variables you've
measured might predict severity of injury in the population, but if you have enough
variables, there's a good chance one will predict injury in your sample. So you'll
need to reduce your p value using Bonferroni's 0.05n, where n is the number of
independent variables. This correction will be too severe if the independent
variables are correlated, but I don't know how to adjust for that.

When you analyze the data, you should look at the effect of the independent variables
separately to start with, but you will also end up using multiple linear regression,
analysis of covariance, or some other complex model, with all the independent
variables on the right-hand side of the model. As I explained on the first page
devoted to complex models, you are now asking a question about how much each variable
contributes to the severity of injury in the presence of (when you control for) the
others. How many subjects do you need to answer this question? Theoretically the
extra independent variables shouldn't make much difference, but I've checked by
simulation to make sure. You need one extra subject for each extra independent
variable. With five extra variables, that makes five extra subjects. Forget it. With
a thousand or so subjects, five won't make any difference.

Here's a different problem involving more than one independent variable, where you
don't have to worry about increasing the sample size to reduce the Type I error.
Suppose you are currently predicting competitive performance from four lab and field
tests, and you want to know whether it's worth adding an expensive fifth test to
the test battery. For this sort of problem, you would model the data by doing a
multiple linear regression, with the expensive test as the last independent variable
in the model. So, how many subjects? It's a specific extra variable in this case,
so there is no inflation of the Type I error, so the sample size is still about 800.
But if all the field tests were in there on an equal footing, and you wanted to know
which ones to drop out of the test battery, then it's back to the bigger sample size
of the previous example. In this case you'd use stepwise regression with a reduced
p value for entry of variables into the model.









本文更新与2020-09-19 23:31,由作者提供,不代表本网站立场,转载请注明出处:https://www.bjmy2z.cn/gaokao/404731.html


  • 爱心与尊严的高中作文题库

    1.关于爱心和尊严的作文八百字 我们不必怀疑富翁的捐助,毕竟普施爱心,善莫大焉,它是一 种美;我们也不必指责苛求受捐者的冷漠的拒绝,因为人总是有尊 严的,这也是一种美。

  • 爱心与尊严高中作文题库

    1.关于爱心和尊严的作文八百字 我们不必怀疑富翁的捐助,毕竟普施爱心,善莫大焉,它是一 种美;我们也不必指责苛求受捐者的冷漠的拒绝,因为人总是有尊 严的,这也是一种美。

  • 爱心与尊重的作文题库

    1.作文关爱与尊重议论文 如果说没有爱就没有教育的话,那么离开了尊重同样也谈不上教育。 因为每一位孩子都渴望得到他人的尊重,尤其是教师的尊重。可是在现实生活中,不时会有

  • 爱心责任100字作文题库

    1.有关爱心,坚持,责任的作文题库各三个 一则150字左右 (要事例) “胜不骄,败不馁”这句话我常听外婆说起。 这句名言的意思是说胜利了抄不骄傲,失败了不气馁。我真正体会到它

  • 爱心责任心的作文题库

    1.有关爱心,坚持,责任的作文题库各三个 一则150字左右 (要事例) “胜不骄,败不馁”这句话我常听外婆说起。 这句名言的意思是说胜利了抄不骄傲,失败了不气馁。我真正体会到它

  • 爱心责任作文题库

    1.有关爱心,坚持,责任的作文题库各三个 一则150字左右 (要事例) “胜不骄,败不馁”这句话我常听外婆说起。 这句名言的意思是说胜利了抄不骄傲,失败了不气馁。我真正体会到它
