Parallel Algorithm for Phylogeny: Introduction to Phylogenetics part 3

For the same source as part 2, blogger had a fit at the size of the post.

Homology is the term used to describe a trait in two for more species derived from a common ancestor. Two morphological structures are called homologous if they are built by the same evolutionary pathway.

Orthology is the study of similarity in DNA sequences, two similar DNA sequences can be evidence of a recent common ancestor. Orthology works best when used on long DNA sequences as there is less chance that the sequence could have been derived naturally. Two similar DNA sequences are called orthologous.

Homology and orthology can be connected but it is not a reliable link. Evolution can re-task genes to new functions. When we find two species with similar DNA (orthology) but different morphological traits (homology) it is an indication of an "interesting" evolutionary event.

Figure 1: an example of homologous characteristics deriving from a common ancestor.

Analogy describes similar characteristics that do not derive from a common ancestor.

Figure 2: Analogous characteristics, as we can see the ichthyosaur and the dolphin look morphologically the same, however from the phylogenetic tree we can see that they do not share a common recent ancestor (we know that dolphins evolved from land mammals and mammals did not evolve from ichthyosaurs).

A broader term for analogy is homoplasy, a term to describe any similarity that does not derive from common ancestry.

How to build a phylogenetic tree
First you must collect specimens from the species that you want to place onto the phylogenetic tree. For each specimen you must determine what states particular traits are found, these traits are called characters.

Morphological traits are easier to track in samples that are more closely related (such as a bunch of butterflies) however if you are dealing with a diverse set of specimens (such as elephants and trees) then it is better to rely on DNA.

Next you build a matrix with the specimens on the rows and the characters on the columns. The characters selected should be meaningful and useful in defining a relationship. Systematics teaches us that only shared derived traits are useful for classification.

The ingroup is the collection of species that you are focused on sharing a derived character.

The outgroup is the relatives to the ingroup but do not share the character in question.

So derived characters shared by the ingroup but not shared with the outgroup are informative, primitive characters (from the ancient past) are less informative. Therefore it is important to know which characters are derived and which are primitive. Which leads to the problem of you cannot determine a derived characteristic from a primitive one with out a tree and you cannot have a tree with out knowing if a character is derived or primitive.

Solutions to the problem:
Parsimony: Build all versions of the tree and select the simplest using Occam's Razor to select the tree with the least convergence and fewest changes in character state.

Maximum Likelihood: Choose the tree that would make the characters you did observe most likely. This method requires an assumption of how evolutionary change happens over time.

Bayesian inference: Choose the tree that is most likely (branches, branch lengths, and character distributions) given a prior expectation of what the tree should look like. This method uses statistics and probability to determine the correct tree.

Hybrid: Use a combination of the three listed solutions.

A deeper look into phylogenetics:

Synapomorphy: "A shared character state indicating that two species belong to the same group." A trait derived from a common ancestor such as the forelimbs in tetrapods and not found in lobed fish.

Symplesiomorphy: an uninformative ancestral shared trait.

Some important points:
1) Looking the same is not informative.
2) Informative traits are shared and derived.
3) What is shared and what is derived depends on the context of what part of the tree you are looking at. Therefore what is informative is dependent on context.

To build the tree from the character matrix follow the change in character states between species
"We start to build a phylogenetic tree by considering the elemental step—the
change of a trait from one state to another (Figure 13.16). We denote the things
whose relationships are being analyzed—the genes, species, or larger groups—
by capital letters, A, B, C . . ., and the traits being used to determine relationship
by numbers, 1, 2, 3 . . .."

How do we infer the correct tree from the character matrix? We start with the observable character matrix and build the tree from it.

Figure 3: an example of how a character matrix relates and is transformed into a phylogenetic tree.

Trees are rooted by choice of outgroup.

A tree can be generated by focusing on the synapomorphy or the symplesiomorphy as the basis for branches. Once the trees have been made we can use parsimony or maximum likelihood to choose which one we think is correct. After the tree is chosen we must determine the root, that is done by choosing an outgroup that is closely related to the clade. So if we are establishing a clade of mammals then using a fish or a lizard as the outgroup would be appropriate to establish an ancestor relationship. "Rooting the tree establishes the direction of character change within the ingroup." The character states in the outgroup are considered to be primitive.

Outgroup comparision: A method used to root a phylogenetic tree and to establish the direction of character change.

Figure 4: Rooting a phylogenetic tree

Figure 5: Selecting a tree using parsimony, the tree with the least amount of step changes is more likely to be correct.

Using maximum likelihood means that we need a way to determine the probability that the tree is correct. This is most likely a model based on the probabilities that a change in character state was random. "For each candidate tree the probability is calculated of finding any two
sequences at opposite ends of a branch (for example a T at one end and a C at
the other). This is done for all branches, the probabilities are multiplied together
to get the likelihood for the whole tree, and the best tree is then the one with the
maximum likelihood (Felsenstein 1988). The method is logically appealing and
computationally expensive. Its range of application is increasing as computers
improve."

Constructing a even a modest tree requires immense computational power.
Hence the reason for this project.

Though a tree is built from a simple matrix, as species and character traits are added the problem becomes immense (see figure 6, it is greater than O (2^n))

Figure 6: Computational estimates of various sized phylogenetic trees. (Note to the reader: there are only around 10^130 protons, electrons and neutrons in the universe, compare that to the number of possible trees for 500 taxa which is around 10^1280 possible trees, no wonder creative computer algorithms are necessary! Given a computer running at a teraflop (10^13 instructions per second) it would take 10^1267 seconds to process it and the universe is only about 4 x 10^17 seconds old so it would take about 10^1250 lifetimes of the universe to compute! That amount of time is so big that I can't get my head around it, it may as well be infinite, but thankfully I can think of some algorithms to bring that number down, we shall have to see if they work ;) but even the ones that leap to mind would still be limited to about 25 species.)

That is probably a good place to say: To be continued in part 4

Parallel Algorithm for Phylogeny

Wednesday, 9 June 2010

Introduction to Phylogenetics part 3

1 comment:

About Me

Blog Archive

Followers