Most bioinformatic analyses start by building sequence alignments by means of scoring matrices. An implicit approximation on which many scoring matrices are built is that protein sequence evolution is considered a sequence of Point Accepted Mutations (PAM) (Dayhoff et al., 1978), in which each substitution happens independently of the history of the sequence, namely with a probability that depends only on the initial and final amino acids. But different protein sites evolve at a different rate (Echave et al., 2016) and this feature, though included in many phylogenetic reconstruction algorithms, is generally neglected when building or using substitution matrices. Moreover, substitutions at different protein sites are known to be entangled by coevolution (de Juan et al., 2013).
This thesis is devoted to the analysis of the consequences of neglecting these
effects and to the development of models of protein sequence evolution capable of incorporating them. We introduce a simple procedure that allows including the among-site rate variability in PAM-like scoring matrices through a mean-field-like framework, and we show that rate variability leads to non trivial evolutions when considering whole protein sequences. We also propose a procedure for deriving a substitution rate matrix from Single Nucleotide Polymorphisms (SNPs): we first test the statistical compatibility of frequent genetic variants within a species and substitutions accumulated between species; moreover we show that the matrix built from SNPs faithfully describes substitution rates for short evolutionary times, if rate variability is taken into account. Finally, we present a simple model, inspired by coevolution, capable of predicting at the same time the along-chain correlation of substitutions and the time variability of substitution rates. This model is based on the idea that a mutation at a site enhances the probability of fixing mutations in the other protein sites in its spatial proximity, but only for a certain amount of time.