Bench Press

The Crossroads of Science and Tech

Statistical goofs

with 2 comments

imageScience News recently put out a very interesting article about the numerous mistakes that many a doctor and scientist have made in evaluating statistics. Given the importance of statistical analyses in research today (who doesn’t worship at the “altar of p < 0.05”?), I was pretty shocked at how poor a typical scientist’s statistical training is.

The article highlights a few key common misconceptions to watch out for:

  1. The opposite of a false-positive is not necessarily a true-positive: How many times have you heard the explanation that a p-value of 0.05 means that “it is at least 95 percent certain that the observed difference between groups, or sets of samples, is real and could not have arisen by chance”? Well, that’s an understandable but unfortunate error. A p-value of 0.05 implies that there is a 5% chance that the result observed is what you would get in that particular experiment if the opposite of what you believe is true: in other words, the probability of a false-positive. However, it does not mean a 95% chance that the hypothesis is correct. Or, to use an example from the Science News article: suppose there is a drug test which detects steroid use in athletes. This test has a 5% false-positive rate (kind of like having a p-value of 0.05). Suppose a specific athlete comes back with a positive drug test. What is the chance that he or she is actually a steroid user? By now, you know the answer is not 95%, but what is it? The actual answer depends on how many athletes actually use steroids. Or, to use real numbers, suppose 5% of a group of 400 tested athletes are actual steroid users. Therefore:
    • There are 20 athletes (5%) who use steroids and 380 (95%) who don’t use
    • Of the 20 athletes that are actual steroid users, the test will correctly identify 19 (95%) as users
    • Of the 380 athletes that are not actual steroid users, the test will incorrectly identify 19 (5%) as steroid users.

    The final result? Of the 38 athletes identified as steroid users, 50% (19) will be false positives! So, even though the test/experiment’s results have a p-value of 0.05, the actual probability that the hypothesis is correct was only 50%, not 95%. Keep that in mind the next time you hear someone use a p-value to assert they have a better than 95% shot at being correct.

  2. Statistical significance does not necessarily mean actually significant: The poster-children for this type of error are the numerous articles floating around that random food/environmental factor XYZ causes a “significantly increased risk” of cancer/heart disease/death/something bad. Just because something observed is highly unlikely to be explained away by chance, doesn’t mean that the actual impact itself is significant by any actual sense of the word. An increase in a risk might be real, but if its an increase in risk of cancer from 0.01% to 0.011%, I’d hardly call that significant.
  3. Large numbers of experiments means large numbers of false positives. The archetype for this type of error are the large genome-wide studies done to find genetic fingerprints which tend to go with a particular disease. If you are studying 20,000 genes with a survey tool which has a p-value of 0.05, elementary multiplication suggests that you’ll find at least 1,000 genes (5%) showing up as hits which aren’t actually related to the disease at all! This isn’t to say that so-called genome-wide association studies are all bunk (the best studies will use multiple means to verify and assess if a gene is related to a given condition), but it should be the first thing that you think about when evaluating claims on a large data set based on standard statistical tools.
  4. “Statistically significant” isn’t always statistically significant. Imagine two clinical trials comparing Drug A and Drug B with placebo. Although data shows that both Drug A and Drug B provide improvements over placebo, only Drug A demonstrates a statistically significant improvement. Does this then mean that Drug A is statistically significantly better than Drug B, which the trial suggested does not provide a statistically significant improvement over placebo? The answer, obviously, is that it depends on the level of improvement. The point of that little mental exercise, however, is that the status of being “statistically significant” doesn’t confer any special significance or power when making a different comparison. An illustration of the real-world consequences of this comes from an example from the Science News article:

    A number of studies have suggested that children and adolescents taking antidepressants face an increased risk of suicidal thoughts or behavior… One set of such studies, for instance, found that with the antidepressant Paxil, trials recorded more than twice the rate of suicidal incidents for participants given the drug compared with those given the placebo. For another antidepressant, Prozac, trials found fewer suicidal incidents with the drug than with the placebo. So it appeared that Paxil might be more dangerous than Prozac.

    But actually, the rate of suicidal incidents was higher with Prozac than with Paxil. The apparent safety advantage of Prozac was due not to the behavior of kids on the drug, but to kids on placebo — in the Paxil trials, fewer kids on placebo reported incidents than those on placebo in the Prozac trials.

    As the previous examples makes it clear, our ability to compare two forms of statistical comparison is extremely limited and subject to all sorts of extra considerations. This is one reason why many statisticians are skeptical of meta-analyses (studies which combine the data from multiple studies), and clearly illustrates why scientists and doctors everywhere need to bone up on their statistical training and their reading of the fine print on studies they use.

Read the list carefully, and don’t make these mistakes!

(Image credit)

  • Alexander Zien

    This comment again highlights the importance of
    * proper understanding and calculus of conditional probabilities, and
    * appropriate accounting for multiple testing.

    Sadly, it also does so by its flaws:

    1. A p-value is *not* “the probability of a false-positive”, as stated above. It is the *conditional* probability of a false-positive *given* the null-hypothesis.

    2. An increase in risk of cancer from 0.01% to 0.011% is an increase of cancer victims by 10%. I'd definitely call this significant.

    3. Large numbers of experiments do *not* necessarily mean large numbers of false positives. It all depends on the cut-off criterion. The claimed statement is true for a constant p-value threshold, but wrong for a constant FDR (false discovery rate) threshold. In fact, this is the definition / goal / benefit of FDRs.

    4. The proper statement would be “statistical significance is not transitive”. Thus statistically significant results cannot easily be combined to obtain new statistically significant results on related questions. However each statistically significant result in itself always remains statistically significant independent of all related tests.

  • http://www.benjamintseng.com/ Ben

    I apologize for the imprecision in my language re: #1, #3, and #4; those distinctions were not lost on me (i.e. on #1: “A p-value of 0.05 implies that there is a 5% chance that the result observed is what you would get in that particular experiment if the opposite of what you believe [null hypothesis] is true”)), but the wording I chose was meant to simplify the concepts (perhaps overly so).

    I disagree with you on #2, though that may be a difference of opinion between us re: the difference between relative and absolute risk.