Archive for the ‘statistical algorithm’ tag
Are you positive it’s positive?
As genomes have been sequenced over the past few decades scientists have looked for new ways to analyze and interpret the wealth of information. They’ve developed numerous algorithms with goals ranging from organizing evolutionary family trees (inspired by plagiarism detecting software) to aligning genetic sequences. All of this to answer the numerous questions that can now be asked thanks to sequence databases. One of the many things scientists have attempted to study is positive selection in protein-coding genes.
Positive selection of advantageous gene mutation is particularly interesting to scientists as it can provide insight into the function of new genes. However, positive selection is difficult to detect and analyze as neutral and deleterious mutations predominate advantageous mutations in frequency. Initially scientists looked for positive selection by simply comparing the ratio (/omega) of nonsynonymous nucleotide substitutions (dN) to the number of synonymous nucleotide substitutions (dS) between homologous protein-coding gene sequences while utilizing Fisher exact tests to accept or reject a null hypothesis of neutral selection1.
Over the years scientists developed additional statistical analyses to infer positive selection. Two of the most popular methods are the branch-site method (BSM) and site-specific method. The BSM utilizes a likelihood ratio test to detect positive selection within a given phylogenic branch. The site-specific method on the other hand utilizes /omega to look for specific amino acid substitutions that are positively selected. Both of these methods have been utilized in hundreds of papers and seemingly provided a great deal of insight into potential points of positive selection within various genomes. What would you say then when told that both of these methods contain significant flaws which provide an inordinate number of false positives?

Bovine Rhodopsin protein with predicted sites in red and experimentally determined in blue. (Adapted from Yokoyama et al. 2008 PNAS)
That’s exactly what Masatoshi Nei and his group believe to have shown in a recent paper evaluating the reliability of the branch-site and site-specific methods. Nei’s group utilized several controlled computer simulations as well as data collected by Shozo Yokoyama, at Emory University, on dim-light vision opsins in vertebrates2 in their studies determining that both the branch-site and site-specific methods yielded far too many false positives. Nei and his group contend:
This low rate of predictability occurs because most of the current statistical methods are designed to identify codon sites with high /omega values, which may not have anything to do with functional changes. The codon sites showing functional changes generally do not show a high /omega value. To understand adaptive evolution, some form of experimental confirmation is necessary.
From this paper it looks like scientists looking for high /omega values may have been chasing ghosts by assuming that amino acid changes result in functional changes indicating proof of positive selection. The potential impact this will have on hundreds of papers is stunning. In the end the take home message is that statistical analyses, no matter how elegant, have their limits and ought to be utilized in conjunction with experimental data as much as possible.
(Sources: 1 – Reliabilities of identifying positive selection by the branch-site and the site-prediction methods , 2 – Elucidation of phenotypic adaptations: Molecular analyses of dim-light vision proteins in vertebrates )
updated: Had to change all the &omega to /omega because WordPress kept changing it into ? for some reason…bah
Go Computers Go
Many problems in science require computing power which goes beyond mere number crunching and extends into the realm of “artificial intelligence.” For years, what researchers considered to be the ultimate test of artificial intelligence was the ability to defeat a chess pro (something which was formally resolved in the favor of our machine overlords when IBM’s Deep Blue beat out reigning chess champ Gary Kasparov). I’ve always been confused by this for, as complex as chess is, there is a game which exists which is so much more complex than chess it makes chess look like a game of tic-tac-toe.
![]()
That game is Go. Invented in China and reaching the West through Japan, Go is a game with relatively simple rules, but requires a depth of understanding to master (personal note: one of the brains behind Bench Press, Eric, is an avid fan of the game) given the intricacies of gameplay and the structure of the board. This makes it exceedingly difficult for a computer using brute-force methods to defeat even a moderately skilled Go player. For instance:
- In chess, different pieces have specific limitations on where they can move. There are very few limitations on where Go pieces can be placed on a board.
- In chess, pieces can be removed via capture, and those pieces can never return. No such limitation exists in Go. A piece can be removed, but it can just as easily come back in a later turn.
- The Go board is a 19 x 19 grid compared to a chess board which merely has 8 x 8 squares. This translates into ~100-200 possible moves each turn of Go compared with ~30-40 in chess.
What does this mean? A quick Wikipedia search shows that an estimated 10170 possible end-states and 10360-10700 possible games (compared to a measly 1050 end-states and 10120 games for chess). To give a sense of how large these numbers are, there are an estimated 1080 atoms in the observable universe!
The sheer complexity of the game and the ability of human masters to intuitively understand and visualize the board (as Dartmouth artificial intelligence professor Bob Hearn puts it in an interview with Wired Science, “Go is a game of living things, and you talk about it that way, as if the patterns might be alive”) led many to believe that developing a program capable of beating humans at Go would thus be a high sign of artificial intelligence. After all, what computer can possibly brute force search far enough ahead in a game of Go to beat a human?
Well, as it turns out, creating a computer algorithm that can understand a game of Go is exceedingly difficult. But, while ingenuity and intuition are difficult for computers, simulating it with number-crunching on carefully conducted statistical simulations is in a computer’s list of tricks. New programs based on compiling the results of millions of Monte Carlo simulations (a computational technique revolving around crunching the results of many random tests) have succeeded where dozens of previous attempts at introducing human-pattern-recognition heuristics failed. Instead of attempting to analyze every possible move or feably trying to understand the layout of a game, these Monte Carlo Go programs crunch through the results of their random games to determine quick statistical rules of play which help guide still further Monte Carlo simulations – the result of which is a computer which gets more and more knowledgeable about how Go games may result.
The result? On August 7, 2008, for the first time in history, the Monte Carlo Go program MoGo beat 8-dan (the second highest ranking possible) professional Go player Kim Myungwang. In all fairness to Kim, the program had a 9-stone handicap (a lead you give a beginner). But, I think the key takeaway that can be learned here is the power of statistical algorithms to mimick (and potentially surpass) human ingenuity.
And, with new methods making supercomputer power much more accessible like crowdsourcing, distributed computing, and alternative chip architectures, that’s something which scientists, doctors, and engineers may all hopefully benefit from in the near future.