Information about On Some Measures of Genetic Distance Based on Rates of Nucleotide...

An old class paper illustrating the use of the Laplace transform in population biology.

DNA nucleotide substitution models 2 Abstract We present a general DNA base-nucleotide substitution model and discuss three special cases: three-substitution-type (3ST), two-substitution-type (2ST), and the Jukes-Cantor models.

DNA nucleotide substitution models 3 On Some Measures of Genetic Distance Based on Rates of Nucleotide Substitution Introduction The genetic distance between two populations is deﬁned as a concept related to the time since the two populations diverged from a common ancestral population (Weir, 1990). A number of methods have been proposed to estimate the genetic distance between two populations and they are either based on the allele frequencies in the two populations, the rate of amino acid substitution in protein sequence data from the two populations, or the rates of base nucleotide substitution in DNA sequence data from the two populations. Measures of genetic distance that utilize the allele frequencies are estimates based on some geometric transformation of the allele frequencies (Cavalli-Sforza and Edwards, 1967; Cavalli-Sforza and Bodmer, 1971; Edwards, 1971; Nei, 1977, 1978; Li and Nei, 1977; Smith, 1977). Some of these measures are purely geometric and do not involve any genetic concept at all, e.g., the measure proposed by Cavalli-Sforza and Bodmer (Weir, 1990). On the other hand, the ones proposed by Edwards (1971) and by Nei (1977) can be shown to berelated to the concept of ﬁxation index (Hartl and Clark, 1989). A measure of genetic distance based on amino acid substitution from protein sequence data was proposed by Jukes and Cantor in 1969. This method was partly due to the abundance of amino acid sequence data available then. Some geneticists argue that this measure should be preferred since proteins are the subject of mutations. The discovery of DNA sequencing by Maxam and Gilbert and Sanger et al. in 1977 brought about more methods for measuring genetic distance. The estimates from these methods are based on the rates of nucleotide substitution in DNA sequence data. These are the methods which we will consider in this paper. We will formulate the general

DNA nucleotide substitution models 4 model, examine some special cases, give some numerical examples, and ﬁnally, examine the validity of these models based on their assumptions. The General Model We now start by formulating the general model. Let S1 and S2 be two nucleotide sequences with a common ancestral sequence. We consider a pair of homologous sites from S1 and S2 and examine how much they have diverged from each other during their descent from the ancestral sequence T years back (Figure 1). The evolutionary base substitution model we are going to use is shown in Figure 2. We have used RNA codes for the nucleotides so that the pyrimidines are uracil (U) and cytosine (C), and the purines are adenine (A) and guanine (G). The types and rates of base substitution are summarized in Table 1. A substitution of a purine by a purine or a pyrimidine by a pyrimidine is called a transition (TS). If a pyrimidine is substituted by a purine or vice-versa then the substitution is called a transversion (TV). We distinguish between two types of transversion, TV1 and TV2, and each type is shown in Table 1. The classiﬁcation of the TV as to type becomes easier if we look at Figure 2. The TV which go either vertically up or down are TV1 and those which go diagonally are TV2. When comparing the homologous sites of S1 and S2 at any time t > 0, there are 16 possible nucleotide base pairings, 12 of which involve mismatched base pairs. If the mismatch looks like a transition pair in Table 1, we call the mismatch a TS-type mismatch. We have a TV1-type mismatch if the mismatch looks like a Type 1 tranversion listed in Table 1. The TV2-type mismatch is deﬁned in the same manner. We summarize these in Table 2. In Table 2, for t > 0,

DNA nucleotide substitution models 5 4 S(t) = Si (t) = probability of no diﬀerence at a site (1) i=1 4 P (t) = Pi (t) = probability of a TS-typediﬀerence at a site (2) i=1 4 Q(t) = Qi (t) = probability of a TV1-type diﬀerence at a site (3) i=1 4 P (t) = Pi (t) = probability of a TTV2-type diﬀerence at a site (4) i=1 Hence, 4 Q(t) + R(t) = (Ri (t) + Qi (t)) (5) i=1 = probability of a TV-type diﬀerence at a site. We sometimes refer to the probabilities above as the match probabilities. We also deﬁne the following probabilities which we sometimes refer to as the base probabilities. U (t) = percentage frequency of uracil, (6) C(t) = percentage frequency of cytosine, (7) A(t) = percentgae frequency of adenine, (8) T (t) = percentage frequency of thymine in a strand (9) so that U (t) + C(t) + A(t) + G(t) = 1. (10) Note that the probabilities in (1) - (4) and (6) - (9) are all time-dependent. We also have

DNA nucleotide substitution models 6 the following relations: S(t) = U 2 (t) + C 2 (t) + A2 (t) + G2 (t) (11) P (t) = 2U (t)C(t) + 2A(t)G(t) (12) Q(t) = 2U (t)A(t) + 2C(t)G(t) (13) R(t) = 2U (t)G(t) + 2C(t)A(t) (14) Using the rates of substitution and the match probabilities, the mean rate of substitution at a speciﬁc site over the time interval (0,T] is given by 4 T αi + βi + γi k = Bi (t) dt (15) T 0 i=1 where B1 (t) = U (t), B2 (t) = C(t), B3 (t) = A(T ) and B4 (t) = G(t) and the integrals are the average probabilities of ﬁnding a given base at a given site during the time interval (0, T ]. A measure of genetic distance is therefore given by K = 2T k (16) where k is as deﬁned in (15), T is the time since the two sequences started diverging from the ancestral sequence and the factor of 2 is due to the fact that we are considering two branches that diverged. We now formulate the general model and proceed in a manner similar to that of Takahata and Kimura (1981). At any time t ∈ [0, T ], consider a short time interval ∆t, short enough so that if the mutation rate is small then higher order terms of ∆t and the occurrence of a double substitution at a speciﬁc site may be neglected. We have U (t + ∆t) = U (t) − α1 (∆t)U (t) + α2 (∆t)C(t) + β2 (∆t)A(t) + γ2 (∆t)U (t) − γ1 (∆t)U (t) − β1 (∆t)U (t) (17)

DNA nucleotide substitution models 7 which we can rewrite as U (t + ∆t) − U (t) = − (α1 + β1 + γ1 ) U (t) + α2 C(t) + β2 A(t) + γ2 G(t). (18) ∆t Getting the limit as ∆t approaches zero, (18) gives dU (t) = − (α1 + β1 + γ1 ) U (t) + α2 C(t) + β2 A(t) + γ2 G(t). (19) dt Doing this for the other three probabilities we get the following system of diﬀerential equations: dU (t) = −(α1 + β1 + γ1 )U (t) + α2 C(t) + β2 A(t) + γ2 G(t) (20) dt dC(t) = α1 U (t) − (α2 + β3 + γ3 )C(t) + γ4 A(t) + β4 G(t) (21) dt dA(t) = β1 U (t) − γ3 C(t) − (α3 + β2 + γ4 )A(t) + α4 G(t) (22) dt dG(t) = γ1 U (t) + β3 C(t) + α3 A(t) − (α4 + β4 + γ2 )G(t). (23) dt Writing (20) – (23) in matrix form gives U (t) −(α1 + β1 + γ1 ) α2 β2 γ2 U (t) d C(t) α1 −(α2 + β3 + γ4 ) γ4 β4 C(t) = . (24) dt A(t) β1 γ3 −(α3 + β2 + γ4 ) α4 A(t) G(t) γ1 β3 α3 −(α4 + β4 + γ2 ) G(t) Using fact that the sum of the base probabilities is equal to 1, the matrix equation reduces to U (t) −(α1 + β1 + γ1 + γ2 ) α2 − γ2 β2 − γ2 U (t) d C(t) = α 1 − β4 −(α2 + β3 + γ4 + β4 ) γ4 − β4 C(t) . (25) dt A(t) β1 − α 4 γ3 − α4 −(α3 + β2 + γ4 + α4 ) A(t) which can be written as d B1 (t) = Q1 B1 (t) + C1 . (26) dt Solving this system of diﬀerential equations entails solving for the eigenvalues of B1 . Although it is easy to get the eigenvalues of the 3 × 3 matrix B1 , the matrix equation in (26) is still diﬃcult to solve since only the ﬁnal conditions of the baseprobabilities can be approximated and the initial conditions are unknown. One way to avoid this problem is to

DNA nucleotide substitution models 8 express the base probabilities in terms of the match probabilities. The matrix equation involving the match probabilities is easier to solve since the initial conditions for the match probabilities are Pi (0) = Qi (0) = Ri (0) = 0, i = 1, . . . , 4 and S(0) = 1. After the expressions for the match probabilities have been solved, we can solve for the mean rate of base substitution k and hence the estimate of genetic distance K. Inherent in these models of evolutionary base nucleotide substitutions are the following four assumptions: (1) The two sequences diverged from a common ancestor, that is, Pi (0) = Qi (0) = Ri (0) = 0, i = 1, . . . , 4 and S(0) = 1. (2) The two sequences are stochastically identical and independent, and within each sequence, as substitution in one site in no way aﬀects a substitution in some other site. (3) The homologous sites chosen from the two sequences are of the same ﬁxed length during their descent from the common ancestor. (4) (The fourth assumption reduces the number of parameters in the model by assuming that some of the rates are equal. Since this diﬀers among the three models that we are going to consider, rather than stating it here, it will be stated as each model is being considered.) The 3ST Model The ﬁrst special case that we are going to consider is the three-substitution-type (3ST) model. This model is due to Kimura (1981) and is the most general of the three models we are going to consider in detail in this paper. The two other models we considerlater are special cases of this model. The fourth assumption in the 3ST model is that the TS-type substitutions all have rates α, and that the TV-type substitutions have rates β and γ depending on the speciﬁc type as shown in Figure 3. Under the 3ST model, Tables 1 and 2 can be simpliﬁed and their simpliﬁed forms are given below as Tables 3 and 4, respectively.

DNA nucleotide substitution models 9 The system of diﬀerential equations in (20) – (23) simpliﬁes to dU (t) = −(α + β + γ)U (t) + αC(t) + βA(t) + γG(t) (27) dt dC(t) = αU (t) − (α + β + γ)C(t) + γA(t) + βG(t) (28) dt dA(t) = βU (t) = γC(t) − (α + β + γ)A(t) + αG(t) (29) dt dG(t) = γU (t) + βC(t) + αA(t) − (α + β + γ)G(t). (30) dt and its corresponding matrix form is U (t) −(α + β + γ) α β γ U (t) d C(t) α −(α + β + γ) γ β C(t) = , (31) dt A(t) β γ −(α + β + γ) α A(t) G(t) γ β α −(α + β + γ) G(t) which again can be written in the form of (25). Considering the fact that the sum of the base probabilities is 1, we can simplify (31) to U (t) −(α + β + 2γ) α−γ β−γ U (t) d C(t) = α−β −(α + 2β + γ) γ−β C(t) . (32) dt A(t) β−α γ−α −(2α + β + γ) A(t) We can also rewrite (32) in the form of (25). The matrix equation in (32) is not diﬃcult to solve since the eigenvalues are easily obtainable. The problem here is that we do not know the initial conditions for the base probabilities since we do not know the base frequencies of the ancestral sequence. As we have mentioned before, a way to avoid this problem is to consider the match probabilities instead. It is easier to use the match probabilities since we have the initial conditions for this set of probabilities given by the ﬁrst assumption (A1) of our model. Using the relationships between the base probabilities and the match probabilities given in (11) – (14) it can be shown that P (t) −2(2α + β + 2γ) −2(α − γ) −2(α − β) P (t) 2α d Q(t) = −2(α − β) −2(α + 2β + γ) −2(β − α) Q(t) + 2β . (33) dt R(t) −2(γ − β) −2(γ − α) −(α + β + 2γ) R(t) 2γ which in matrix form is d T(t) = Q2 T(t) + C2 . (34) dt

DNA nucleotide substitution models 10 We now derive the expression for P (t) in (33). The expressions for Q(t) and R(t) can be obtained in very much the same manner. Recall that in (11) – (14) we have P (t) = probability of a TS-type diﬀerence at a homologous site (35) = 2C(t)U (t) + 2A(t)G(t). (36) Using the product-rule for the derivative of a product, dP (t) dU (t) dC(t) dG(t) dA(t) = 2 C(t) + U (t) + 2 A(t) + G(t) . (37) dt dt dt dt dt If we substitute the expressions for the derivatives of the match probabilities we obtained in (33) we have dP (t) = 2 {−2 (C(t)U (t) + A(t)G(t)) (α + β + γ) + 2β (A(t)C(t) + G(t)U (t)) + dt 2γ (A(t)U (t) + G(t)C(t)) + α A2 (t) + C 2 (t) + U 2 (t) + G2 (t) (38) Using the fact that A2 (t) + C 2 (t) + U 2 (t) + G2 (t) = 1- P (t) - Q(t) -R(t) we can simplify (38) to obtain dP (t) = 2 − {−(2α + β + γ)P (t) + (β − α)R(t) + (γ − α)Q(t) + 2α} (39) dt which is what we want. We now solve the matrix equation in (34). Deﬁne the following Laplace transform: P (t) p(s) L[T(t)] = L Q(t) = q(s) = T (s). (40) R(t) r(s) Applying the Laplace transform to (34), we get 1 sT (s) − T(0) = Q3 T (s) + C3 (41) s which we can rewrite as 1 − C3 = (Q − sI3 )T (s), (42) s

DNA nucleotide substitution models 11 where we have used the fact that T(0)= 0 and I3 is the 3 × 3 identity matrix. The problem of solving the system of diﬀerential equations in (34) is now reduced to solving a system of algebraic equations in the three unknowns p(s), q(s), and r(s). We now solve for these three unknowns and then apply the inverse Laplace transform to get the solutions for P (t), Q(t), and R(t). Using Cramer’s rule, we get −2α/s −2(α − γ) −2(α − β) −2β/s −2(α + 2β + γ) −2(β − α) −2γ/s −2(γ − α) −2(α + β + 2γ) − s p(s) = (43) ∆ −2(2α + β + γ) −2α/s −2(α − β) −2(β − γ) −2β/s −2(β − α) −2(γ − β) −2γ/s −2(α + β + 2γ) − s q(s) = (44) ∆ −2(2α + β + γ) − s −2(α − γ) −2α/s −2(β − γ) −2(α + 2β + γ) −2β/s −2(γ − β) −2(γα) −2γ/s r(s) = (45) ∆ where, −2(2α + β + γ) −2(α − γ) −2(α − β) ∆ = −2(β − γ) −2(α + 2β + γ) −2(β − α) . (46) −2(γ − β) −2(γ − α) −2(α + β + 2γ) Upon simplifying and expressing the results in partial fractions we get, 1 1 1 1 4 4 4 p(s) = − − + (47) 4s s + 4(α + β) s + 4(α + γ) s + 4(β + γ) 1 1 1 1 4 4 4 q(s) = − + − (48) 4s s + 4(α + β) s + 4(α + γ) s + 4(β + γ) 1 1 1 1 4 4 4 r(s) = + − − . (49) 4s s + 4(α + β) s + 4(α + γ) s + 4(β + γ)

DNA nucleotide substitution models 12 Applying the inverse Laplace transform, we get the following as solutions to the system in (49), 1 P (t) = L−1 {p(s)} = 1 − eλ1 t − eλ2 t + eλ3 t (50) 4 1 Q(t) = L−1 {q(s)} = 1 − eλ1 t + eλ2 t − eλ3 t (51) 4 1 R(t) = L−1 {r(s)} = 1 + eλ1 t − eλ2 t − eλ3 t , (52) 4 where λ1 = −4(α+β), λ2 = −4(α+γ), λ3 = −4(β+γ). Under the 3ST model, the equation for k in (15) can be expressed as 4 T α+β+γ k = Bi (t) dt = α + β + γ, (53) T 0 i=1 where we have used the fact that the sum of the base probabilities is equal to 1. Note that the assumption on some of the rates being equal played a crucial role in being able to factor α+β+γ out of the summation to get a simple expression for k. For K, we obtain K = 2T (α + β + γ). (54) We can solve (52) for λ1 , λ2 , and λ3 to get 4(α + β)t = − ln(1 − 2P (t) − 2Q(t)) (55) 4(α + γ)t = − ln(1 − 2P (t) − 2R(t)) (56) 4(β + γ)t = − ln(1 − 2Q(t) − 2R(t)), (57) and hence, for any time t ∈ [0, T ], 8(α + β + γ)t = − ln {[1 − 2P (t) − 2Q(t)][1 − 2P (T ) − 2R(t)][1 − 2Q(t) − 2R(t)]} (58) K = 2kt (59) 1 = − ln {[1 − 2P (t) − 2Q(t)][1 − 2P (T ) − 2R(t)][1 − 2Q(t) − 2R(t)]} . (60) 4 The variance for this estimate of K is also given in the paper of Kimura (1981). We have, 2 1 2 σK = a P (t) + b2 Q(t) + c2 R(t) − (aP (t) + bQ(t) + cR(t))2 (61) n

DNA nucleotide substitution models 13 where, 1 1 1 a = + (62) 2 1 − 2P (t) − 2Q(t) 1 − 2P (t) − 2Q(t) 1 1 1 b = + (63) 2 1 − 2P (t) − 2Q(t) 1 − 2Q(t) − 2R(t) 1 1 1 c = + . (64) 2 1 − 2P (t) − 2R(t) 1 − 2Q(t) − 2R(t) The 2ST Model We now proceed to a special case of this model which again is due to Kimura (1980). We will call this model the two-substitution type model. The two-substitution-type (2ST) was discussed by Kimura in a paper which was published a year previous to the 3ST model. The 2ST model is a special case of the 3ST model and hence we just give the results and do not gointo the details. (In the original paper, this model is actually nameless. We just call it the 2ST model for convenience). The fourth assumption here is that the transition rate is α and the transversion rate is β. Under this assumption the diagram in Figure 3 simpliﬁes further to the diagram in Figure 4. The tables for the base substitution and the match probabilities are given as Tables 5 and 6 below. The probability of a TS-type mismatch is given by P (t) and the probability of a TV-type mismatch is given by QR(t) = Q(t)+ R(t). That is, we have lumped together the TV1-type and TV2-type mismatches. The matrix equation in (24) under the 2ST model is U (t) −(α + 2β) α β β C(t) α −(α + 2β) β β d = (65) dt A(t) β β −(α + 2β) α G(t) β β α −(α + 2β) and the corresponding matrix equation involving the match probabilities is P (t) −2(2α + 2β) −2(α − β) −2(α − β) d Q(t) = −2(α + 3β) −2(β − α) . (66) 0 dt R(t) 0 −2(β − α) −2(α + 3β)

DNA nucleotide substitution models 14 If we now lump Q(t) and R(t) together as QR(t) we have the matrix equation in (67) which only involves a 2 × 2 matrix instead of the previous 3 × 3 matrix. P (t) −2(2α + β + γ −2(α − β) P (t) 2α = + (67) QR(t) 0 8β QR(t) 2β To solve (67), we use the initial conditions: P (0) = QR(0) = 0. As solutions we have 1 1 λ1 t 1 λ2 t P (t) = − e + e (68) 4 2 4 1 1 λ2 t QR(t) = − e (69) 2 2 where λ1 = −4(α+β) and λ2 = −8β. Under the 2ST model k = α + 2β. We can solve (69) for αt and βt and therefore obtain our estimate K. We have K = 2kt = 2(α + 2β) (70) 1 = − ln [1 − 2P (t) − QR(t)]2 [1 − 2QR(t)] . (71) 4 The variance of this estimate is given 2 1 2 σK = a P (t) + b2 QR(t) − (aP (t) + bQR(t))2 (72) n where 1 a = (73) 1 − 2P (t) − 2QR(t) 1 1 1 b = + . (74) 2 1 − 2P (t) − 2QR(t) 1 − 2QR(t) The Jukes-Cantor Model The simplest possible model is due to Jukes and Cantor (1969). The model was primarily formulated to describe protein evolution by looking at the rate of amino acid substitution. It turns out that this model can also be used to describe base substitution. The fourth assumption here is that all the rates of substitution are equal, i.e., α = αi = βi = γi , i = 1, . . ., 4. Figure 2 then becomes Figure 5 below. Under the Jukes-Cantor model, Tables 1 and 2 can be simpliﬁed to Tables 7 and 8, respectively.

DNA nucleotide substitution models 15 The matrix equation in (24) under the Jukes-Cantor model is U (t) −3α α α α U (t) C(t) α −3α α α C(t) d = (75) dt A(t) α α −3α α A(t) G(t) α α α −3α G(t) and the matrix equation involving the match probabilities is P (t) −8β 0 0 P (t) 2α d Q(t) = 0 0 Q(t) + 2α (76) −8β dt R(t) 0 0 −8β R(t) 2α If we deﬁne P QR(t) = P (t) + Q(t) + R(t) we have the diﬀerential equation d P QR(t) = −8αP QR(t) + 6α (77) dt which has as a solution 3 P QR(t) = 1 − e−8αt . (78) 4 Under the Jukes-Cantor model, k = 3α and the estimate K is 3 4 K = 2kt = 6αt = − ln(1 − P QR(t)) (79) 4 3 which can be obtained by solving for α in (78). The variance for K under the Jukes-Cantor model was derived by Kimura and Ohta (1972) and is given by 2 1 (1 − P QR(t))P QR(t) (1 − P QR(t))P QR(t) σJC = = . (80) n 1 − 4P QR(t)/3 n(1 − 4P QR(t)/3) We are going to illustrate the three models by comparing the human and protein kinase inhibitor. These two nucleotide sequences were recently sequenced by Olsen and Uhler (1991). The sequences are more than a thousand base pairs long but only 231 of these are part of the coding region. Our analysis is limited to these 231 base pairs. The sequences are shown in Figure 6. Of the 231 bp, only 15 show mismatches. These are

DNA nucleotide substitution models 16 summarized in Table 9. Usually, the estimate K is computed by codon position since there is that assumption that the substitution are independent of each other but there is evidence that adjacent substitutions are actually not independent of each other. This will not be done here since we have quite a small amount of base pairs and the mismatches are quite far apart (except for the ones occurring at positions 200 and 201). The estimate under each model is shown in Table 10. It is seen here that the estimates do not diﬀer so much from one model to the other. The variances are also not that diﬀerent from each other. Estimates of genetic distance using some other nucleotide sequences are also available. Tavar (1986) obtained estimates using human and mouse a-fetoprotein and serum albumin nucleotide sequences. The results he got for the human-mouse α-fetoprotein nucleotide sequences are reproduced below as Table 11. The data consist of 1824 base pairs and hence it was possible for him to compute the estimates by codon positions. Note that the estimates tend to be bigger for the third codon position and smallest for the second codon position. Tavar in his paper showed that the estimates are not homogeneous if we consider the codon positions as strata. Unfortunately, we cannot do the same thing in our analysis here since we just have 231 bp and 15 mismatches. All three models of evolutionary base substitutions that we have discussed here are far from perfect and their weaknesses lie on the second and third assumptions made to formulate the models. The second assumption states that the nucleotide sequences are stochastically identical and independent of each other. It is most possibly true that nucleotide sequences evolve in a manner stochastically independent from each other but there are evidences that they are in fact not stochastically identical. For example, Wu and Li (1985) noticed that the substitution rates in rodent is much higher than that in humans. Even within a sequence, there is evidence that that rates are much higher in some spots (“hot spots”) than in others (Miyata and Yasunaga, 1981; Brown and Clegg, 1983) and that the rates diﬀer between the sense and antisense strand (Wu and Maeda, 1987). There are also evidences showing that a substitution in one site does a aﬀect the rate of substitution in an adjacent site in phage T4 (Koch, 1971). It would be interesting to know if the same

DNA nucleotide substitution models 17 holds for higher organisms. This last fact is also one of the reasons why substitution rates are computed by codon sites if the data allow. The third assumption assumes that the diverging nucleotide sequences are both of a ﬁxed length and hence it doesn’t take into account mutations resulting from deletions and insertions. These assumption also does not take into account the possibility of concerted evolution, which brings about the presence of multigene families, and the duplication and divergence in multigene families. There have been eﬀorts to consider models which incorporate these shortcomings but at the same time still make the models mathematically tractable. Needleman and Wunsch (1970), for example, proposed a model which assigns weights to substitutions, insertions and deletions. Unfortunately, the weights assigned were arbitrary and had no genetic basis. The main problem that these models of evolutionary base nucleotide substitution face is that when all of the mechanisms of evolution are included in the model, the model becomes mathematically intractable with the present computer technology. Considering the fact that computer technology is still advancing, it is hoped that a model incorporating most, if not all, of the mechanisms discussed can be formulated in the near future.

DNA nucleotide substitution models 18 References Brown, A., & Clegg, M. (1983). Analysis of variation in related DNA sequences. In B. Weir (Ed.), Statistical data analysis (pp. 107–132). New York: Marcel-Dekker. Cavalli-Sforza, L., & Bodmer, W. (1971). The genetics of human populations. San Francisco: W. H. Freeman. Cavalli-Sforza, L., & Edwards, A. (1967). Phylogenetic analysis: models and estimation procedures. American Journal of Human Genetics, 19 , 233–257. Edwards, A. (1971). The distance between populations on the basis of gene frequencies. Biometrics, 27 , 873–881. Jukes, T., & Cantor, C. (1969). Evolution of protein molecules. In H. N. Munro (Ed.), Mammalian protein metabolism (pp. 21–123). New York: Academic Press. Kimura, M. (1980). A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16 , 11–120. Kimura, M. (1981). Estimation of evolutionary distances between homologous nucleotide sequences. Proceedings of the National Academy of Sciences USA, 78 , 454–458. Kimura, M., & Ohta, T. (1972). On the stochastic model for estimation of mutational distance between homologous proteins. Journal of Molecular Evolution, 2 , 87–90. Koch, R. (1971). The inﬂuence of neighbouring base pairs upon base-pair substitution mutation rates. Proceedings of the National Academy of Sciences USA, 68 , 773–776. Maxam, A., & Gilbert, W. (1977). A new method for sequencing DNA. Proceedings of the National Academy of Sciences USA, 74 , 560–564. Miura, R. (Ed.). (1986). Lectures on mathematics in the life sciences. Rhode Island: American Mathematical Society. Miyata, T., & Yasunaga, T. (1981). Rapidly evolving mouse α-globin-related pseudogenes. Proceedings of the National Academy of Sciences USA, 78 , 450–453.

DNA nucleotide substitution models 19 Munro, H. N. (Ed.). (1969). Mammalian protein metabolism. New York: Academic Press. Needleman, S., & Wunsch, C. (1970). A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48 , 443–453. Nei, M. (1977). F-statisitcs and analysis of gene diversity in subdivided populations. Annals of Human Genetics, 41 , 225–233. Olsen, S., & Uhler, M. (1991a). (nucleotide sequence of the human protein kinase inhibitor). Molecular Endocrinology. (manuscript submitted) Olsen, S., & Uhler, M. (1991b). (nucleotide sequence of the mouse protein kinase inhibitor). Journal of Biological Chemistry. (in press) Sanger, F., Nicklen, S., & Coulson, A. (1977). DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences USA, 74 , 4563–4567. Takahata, N., & Kimura, M. (1981). A model of evolutionary base substitutions and its application with special reference to rapid change in pseudogenes. Genetics, 98 , 641–657. Tavar´, S. (1986). Some probabilistic and statistical problems in the analysis of DNA e sequences. In R. Miura (Ed.), Lectures on mathematics in the life sciences (pp. 57–86). Rhode Island: American Mathematical Society. Weir, B. (Ed.). (1983). Statistical data analysis. New York: Marcel-Dekker. Weir, B. (1990). Genetic data analysis: methods for discrete population data. Sunderland, Massachussetts: Sinauer Associates. Wu, C., & Li, W. (1985). Evidence for higher rates of nucleotide substitution in rodents than in man. Proceedings of the National Academy of Sciences USA, 82 , 1741–1745. Wu, C., & Maeda, N. (1987). Inequality in mutation rates of the two strands of DNA. Nature, 327 , 169–170.

DNA nucleotide substitution models 20 Table 1 Types and rates of nucleotide sustitution. Types Transition (TS) Transversion (TV1) Transversion (TV2) Initial base U C A G U A C G U G C A New Base C U G A A U G C G U A C Rates α1 α2 α3 α4 β1 β2 β3 β4 γ1 γ2 γ3 γ4

DNA nucleotide substitution models 21 Table 2 Possible nucleotide base pairings at a speciﬁc homologius site for t > 0. Types Sequence Same TS-type TV1-type TV2-type 1 U C A G U C A G U A C G U G C A 2 U C A G C U G A A U G C G U A C Probabilities S1 S2 S3 S4 P1 P2 P3 P4 Q1 Q2 Q3 Q4 R1 R2 R3 R4

DNA nucleotide substitution models 22 Table 3 Types and rates of nucleotide sustitution under the 3ST model. Types Transition (TS) Transversion (TV1) Transversion (TV2) Initial base U C A G U A C G U G C A New Base C U G A A U G C G U A C Rates α α α α β β β β γ γ γ γ

DNA nucleotide substitution models 23 Table 4 Possible nucleotide base pairings at a speciﬁc homologius site for t > 0 under the 3ST model. Types Sequence Same TS-type TV1-type TV2-type 1 U C A G U C A G U A C G U G C A 2 U C A G C U G A A U G C G U A C Probabilities S P Q R

DNA nucleotide substitution models 24 Table 5 Types and rates of nucleotide sustitution under the 2ST model. Types Transition (TS) Transversion (TV1) Transversion (TV2) Initial base U C A G U A C G U G C A New Base C U G A A U G C G U A C Rates α α α α β β β β β β β β

DNA nucleotide substitution models 25 Table 6 Possible nucleotide base pairings at a speciﬁc homologius site for t > 0 under the 2ST model. Types Sequence Same TS-type TV1-type TV2-type 1 U C A G U C A G U A C G U G C A 2 U C A G C U G A A U G C G U A C Probabilities S P QR

DNA nucleotide substitution models 26 Table 7 Types and rates of nucleotide sustitution under the Jukes-Cantor model. Types Transition (TS) Transversion (TV1) Transversion (TV2) Initial base U C A G U A C G U G C A New Base C U G A A U G C G U A C Rates α α α α α α α α α α α α

DNA nucleotide substitution models 27 Table 8 Possible nucleotide base pairings at a speciﬁc homologius site for t > 0 under the Jukes- Cantor model. Types Sequence Same TS-type TV1-type TV2-type 1 U C A G U C A G U A C G U G C A 2 U C A G C U G A A U G C G U A C Probabilities S P QR

DNA nucleotide substitution models 28 Table 9 Nucleotide mismatches observed after time T since divergence between human and mouse protein kinase inhibitor (pki). Types Transition (TS) Transversion (TV1) Transversion (TV2) Human pki U C A G U A C G U G C A Mouse pki C U G A A U G C G U A C Numbers observed 5 0 3 2 0 1 1 6 0 1 1 2

DNA nucleotide substitution models 29 Table 10 Estimates of the genetic distance K under the diﬀerent models being considered. Model K standard error Jukes-Cantor 0.0682288 0.0178312 2ST 0.0686475 0.0180611 3ST 0.0686535 0.0180644

DNA nucleotide substitution models 30 Table 11 Estimates of the genetic distance Ki , where i = 1, 2, or 3, is the ith codon position, under the diﬀerent models considered in Tavar´ (1986). The sequence data are that of human and e mouse α-fetoprotein. Model K1 K2 K3 Jukes-Cantor 0.1752 (.0186) 0.1387 (.0162) .6566 (.0483) 3ST 0.1760 (.0188) 0.1389 (.0163) .7230 (.0642) (The parenthesized quantities are standard errors.)

DNA nucleotide substitution models 31 Figure Captions Figure 1. Divergence of sequences S1 and S2 from some common ancestor. Figure 2. Types and rates of nucleotide substitutions. Figure 3. Types and rates of nucleotide substitutions: 3ST Model. Figure 4. Types and rates of nucleotide substitutions: 2ST Model. Figure 5. Types and rates of nucleotide substitutions: Jukes-Cantor Model. Figure 6. The nucleotide sequences of the coding region of the mouse protein kinase inhibitor (Mpki.M) and the human protein kinase inhibitor (Hpki.2) are shown above. The 15 mismatches are indicated with bars (Olsen and Uhler, 1991a, 1991b).

Ancestral sequence ¢f ¢ f ¢ f ¢ f ¢ f T¢ fT ¢ f ¢ f ¢ x f S1 S2

α1 ' U E C s d α2 T γ d d1 γ3 T d d β1 β2 d d β3 β4 d d d d γ d d γ2 4 d d c α3 d c '© A E G α4

α ' U E C s d α T d dγ γ T d d β β d d β β d d d d γ d dγ d d c α c '© d A E G α

α ' U E C s d α T d dβ β T d d β β d d β β d d d d β d dβ d d c α c '© d A E G α

α ' U E C s d α T d dα α T d d α α d d α α d d d d α d dα d d c α c '© d A E G α

Genetic distance is a measure of the genetic divergence ... if the rate of genetic ... per gene substitution. The chord distance in the ...

Read more

Distance matrices in phylogeny ... explicitly rely on a measure of "genetic distance" between the ... method for clustering based on genetic distance ...

Read more

HUMAN GENETIC DISTANCE ... has also devised methods designed to estimate codon substitution rates. These distance ... SOME NEW GENETIC DISTANCE MEASURES

Read more

A common assumption about genetic distance is that it is a measure of the ... by some numerical quantity ... 1995) Genetic absolute dating based on ...

Read more

Genomic Classiﬁcation Using an Information ... Measures of genetic distance based on alignment methods are ... estimation of nucleotide substitution.

Read more

If the rate of gene substitution ... It is also linearly related to geographical distance or area in some ... A measure of genetic distance (D) based on ...

Read more

Phylogenetic inference based on distance ... true genetic distance because some of the nucleotide positions may have experienced multiple substitution ...

Read more

In genetics, the mutation rate is a measure of the rate at which ... Deconstructing TMRCA and genetic distance. ... Mutation Rates, and Some Historical ...

Read more

How to make a phylogenetic ... an NJ tree based on a matrix of genetic distances is ... taking into account individual nucleotide substitution rates. ...

Read more

## Add a comment