Saturday 10 July 2010

Tree of life algorithmic challenge part 1

Five steps of tree of life generation
1) Gather data on species to be compared
Gather DNA information: For this project genome databases will be queried for DNA sequences
2) Align DNA sequences
Use string comparison algorithm to align DNA - String Edit for a O > (2^N) comparison series.
http://www.cs.princeton.edu/courses/archive/spr05/cos126/assignments/sequence.html
3) Build multitude of trees
Use distance attributes from string alignment to determine optimal solution.
4) Compare trees to obtain "true" tree
5) Perform post tree analysis

Performance Criteria
1) Run Time
2) Space Requirements
3) Statistical Performance
4) Topological Accuracy - How well the tree matches the true tree
5) Accuracy in relation to real world data

"A model is identifiable if it is uniquely characterized by the probability distribution it defines."

"A phylogenetic reconstruction method is statistically consistent under a model if the probability that the method reconstructs the true tree goes to 1 as the sequence length increases."

Markov models are identifiable.

"Phylogeny estimation typically is done under identifiable models."

• "Neighbour Joining is polynomial time, and statistically consistent."
• "Maximum Parsimony is NP-hard, and even exact solutions are not statistically consistent."
• "Maximum Likelihood is NP-hard, but exact solutions are statistically consistent"

"Neighbour Joining has poor performance on large diameter trees"

"Neighbor joining (and some other distance-based methods) will return the true tree with high probability provided sequence lengths are exponential in the diameter of the tree (Erdos et al., Atteson). Exponential lower bound for caterpillar trees: Lacey and Chang."

"Maximum likelihood will return the true tree with high probability provided sequence lengths are exponential in the number of taxa (Steel and Szekely)."

"Better accuracy is obtained through good heuristics for NP-hard optimization (esp. maximum likelihood)"