Oxford University Press, Molecular Biology and Evolution, 7(30), p. 1675-1686, 2013
Full text: Download
Positive selection is widely estimated from protein coding sequences by the non synonymous / synonymous ratio ω. Increasingly elaborate codon models are used in a likelihood framework for this estimation. While there is widespread concern about the robustness of the estimation of the ω ratio, more efforts are needed to estimate this robustness, especially in the context of complex models. Here we focused on the branch-site codon model. We investigated its robustness on a large set of simulated data. First, we investigated the impact of sequence divergence. We found evidence of under-estimation of the synonymous substitution rate for values as small as 0.5, with a slight increase in false positives for the branch-site test. When dS increases further, under-estimation of dS is worse, but false positives decrease. Interestingly, the detection of true positives follows a similar distribution, with a maximum for intermediary values of dS. Thus high dS is more of a concern for a loss of power (false negatives) than for false positives of the test. Second, we investigated the impact of GC content. We showed that there is no significant difference of false positives between high GC (up to ∼80%) and low GC (∼30%) genes. Moreover, shifts of GC content on a specific branch do not generate many false positives. Nor do major shifts in GC along the gene sequence. Our results confirm that the branch-site is a very conservative test.