If you use this web service, please cite:
Morel, M., Lemoine, F., Zhukova, A. and O. Gascuel, “Accurate detection of Convergent Mutations in Large Protein Alignments with ConDor” doi: https://doi.org/10.1101/2021.06.30.450558
Running a ConDor analysis
ConDor allows to analyse convergent evolution based on 2 independent components:
- Emergence component: select mutations that emerge more often than expected under neutral evolution
- Correlation component: select mutations that correlate with the convergent phenotype under study
So you can chose among 3 run modes:
- emergence to run only the Emergence component
- correlation to run only the Correlation component
- condor to run both components (default)
To run a ConDor analysis you need to provide 4 files:
- An amino acid alignment (FASTA) including outgroup sequences.
- A phylogenetic tree (NEWICK) corresponding to the alignment. The names in the tree and alignment must be IDENTICAL. Outgroup sequences must be in the tree.
- A text file with the outgroup sequence names, each name on a new line (names separated with “\n”).
- A text file with the sequence names which have the convergent phenotype (or a predictor for it), each name on a new line (names separated with “\n”). This file is optional if you only want to run the Emergence component.
Optionaly you can choose:
- The model of evolution to run the analysis with (it will be used for the ancestral reconstruction, branch-length optimization and simulations). We recomand to infer the model with ModelFinder. If no model is provided, the workflow will automatically run ModelFinder.
- A gamma law for rate heterogeneity or the FreeRate model and add invariant sites to the model. We recommand to use FreeRate for more precise simulations.
- The minimal number of sequences a mutation should be found to be tested as well as the minimum number of emergence event of a mutation (EEM). For example if you choose 10 sequences and 2 EEMs, a mutation will be tested by ConDor if it is found in 10 or more sequences and in more (strict) than 2 distinct clades.
- The significance threshold after mutliple correction (default = 0.1 for the emergence component and 10 for BayesTraits).
- A run name and an email address to be alerted when the analysis is finished.
Interpreting output files
Two output files (csv) are given by ConDor:
- Tested_results.tsv: all mutations tested by ConDor with multiple metrics and statistics. The columns are described below.
- Significant_results.tsv: only mutations which p-value and log Bayes Factor passed the acceptance threshold and are thus considered as convergent.
Example results can be found here.
For this example, we used the dataset from (Besnard et al., 2009) used in the PCOC paper (Rey et al. 2018). It consists of 79 sequences of the PEPC protein in sedges (plant species at C3/C4 transition) and the corresponding tree. You can retrieve the data here. We chose to infer the model with model finder and to test mutations found in at least 2 sequences and with more than 2 EEMs. We did not add the option for invariant sites but put the rate heterogeneity (+R4). We set significant thresholds at 0.1 for the p-value and 2 for the logBF.
- pastml_root: Ancestral amino acid reconstructed at this position by PastML.
- consensus_root: Amino acid that is most frequent at this position.
- position: Position in the alignment.
- mut: Amino acid tested for convergence at this position.
- max_anc: Amino acid from which EEMs are most often issued.
- ref_EEM: Number of EEMs for the tested amino acid.
- nbseq: Number of sequences exhibiting this amino acid at this position.
- evol_rate: Rate of evolution of the position.
- genetic_distance: Minimal number of DNA substitutions in the codon to switch between the two amino acids.
- substitution rate: Value that indicates how exchangeable two amino acids are. If they can switch very easily (high substitution rate), we expect a lot of EEMs in the simulations, and then, the mutation is difficult to detect even if it is truly convergent. The substitution rate is given by the matrix of the substitution model (e.g. HIVb and MtZoa in the paper).
- findability: Inverse of the substitution rate.
- type_substitution: Category of the mutation: convergent (issued from several ancestral amino acids), parallel (always issued from the same ancestral amino acid) and revertant (go back to the root amino acid). Note that a mutation can be both convergent and revertant, or parallel and revertant.
- details: Ancestral amino acid(s) for the EEMs and how many EEMs are issued from it (them).
- loss: Number of times this newly acquired amino acid is lost (It becomes the ancestral amino acid in an other EEM).
- loss_details: Towards which amino acids can we observe a loss.
- max_simu: Maximum number of EEMs in the simulations.
- variance: Variance of the number of EEMs in the simulations.
- mean: Mean of the number of EEMs in the simulations.
- pvalue_raw: p-value corresponding to the number of simulations with more EEMs than observed (ref-emerge) divided by the number of simulations.
- adjust_pvalue: adjusted p-value according to Holm-Bonferroni correction
- adjust_pvalue_fdr: adjusted p-value according to Benjamini-Hochberg correction
- detected_EEM: If the mutation passed the acceptance threshold or not for the Emergence component.
- posmut: joint position and amino acid tested for convergence at this position.
- log-dep: log likelihood of BayesTraits for the dependence model
- log-indep: log likelihood of BayesTraits for the independence model
- BF: log Bayes Factor
- correlation: positive or negative according to phenotype