Towards Making Sense of "The Tree of Life"

• Stephane Guindon 1
1. 1.  Department of Statistics The University of Auckland, New Zealand

I started working on PhyML during my second year as a PhD student. The article describing the first part of my PhD thesis had just been published and I felt it was the right time to take some risk and try something which first seemed out of my depth: implementing a program that calculates the phylogenetic likelihood function. In 2002, only very few softwares were based on the likelihood principle. The calculation of this function appeared to me as a tough challenge, but the underlying algorithm (Felsenstein's prunning algorithm (Felsenstein 1981)) is beautiful and I was thus eager to test my programming skills on that nice problem. I was based in Montpellier, in the south of France, at that time, but my wife lived in Paris which means I was spending a lot of time away from the lab. This freedom gave me the opportunity to immerse myself completely in my task. I remember being in Paris, not far from the Sacre Coeur, crunching numbers and, for the first time, having my own program return the very same likelihood value as that produced by PAML (Yang 2007) and PHYLIP (Felsenstein 2005), the references in likelihood-based phylogenetic softwares. This felt like a very significant victory to me. I was hooked.

I thus continued programming and tried to accommodate for larger data sets and apply more sophisticated parameter estimation techniques. It quickly appeared though that conventional algorithms would not allow me to analyze data sets with more than ~10 sequences. Other methods, that do not rely on the likelihood framework, could easily go up to ~100 sequences but lacked accuracy. A significant speed up in likelihood-based phylogenetic analyses was therefore in dire need. The core of my program relied on functions that would modify the current solution one step at a time, with each step applying the same operation to a new part of the phylogenetic tree. In order to save computing time, I decided to slightly modify that core and apply these multiple local operations all at the same time. Surprisingly, the results turned out to be very encouraging: the new algorithm was not only as accurate as the other likelihood-based softwares, it was also an order of magnitude faster. I remember then proudly showing the first results to Olivier Gascuel, my PhD supervisor. He was quite enthusiastic too and suggested further optimization strategies that significantly improved the algorithm. PhyML was born.

Olivier then wrote most of the paper while I was running extensive simulations that would compare the speed and accuracy of PhyML to its competitors. We first sent the article to PNAS, which rejected it after only a couple of days. We then tried Systematic Biology and asked Bruce Rannala to act as associate editor. He and the reviewers did a wonderful job that contributed to substantially improve the original draft. The paper was finally published in October 2003 (Guindon and Gascuel 2003). I started receiving requests from the first users of PhyML asking me to add new options to the program. People seemed to like the software overall and the article started to get cited. We reached 1,000 citations in 2007 and 5,000 four to five years after that, which made that paper the second most-cited in the field of ecology and evolution in the 1998-2008 period.

In my opinion, the key ingredients in this project were a bit of risk-taking at the beginning, a lot of efforts spent in programming (read: fixing bugs!) and a great student-supervisor relationship. We were luckly to come up with this new algorithm at a time where DNA sequencing was gathering considerable pace and people were therefore eager to speed up their phylogenetic analysis while keeping the same level of accuracy. So, good timing was therefore paramount too.

References

Felsenstein, J. PHYLIP (Phylogeny Inference Package) version 3.6, Department of Genome Sciences, University of Washington, Seattle.